Counterfactual evaluation for recommendation systems

I wish there was a HN for data content. Articles like this are always my favorite ways to learn something new from a domain expert. I work in fraud detection & would like to write something similar!

https://datatau.net used to be kinda that… but it has been dead for a while, and it looks like the spam bots have taken over.

[dead]

I admit this is way over my head, I am still trying to grok it. This seems to require an existing model to start from—-I am not sure how one would arrive at a model from scratch (I guess start from the same weights on all items?)

I think the point about A/B testing in production to confirm if a new model is working is really important, but quite important is to also do A/B/Control testing, where Control is random (seeded to the context or user) or no recommendations, which helps not only with A vs B, but helps validate that A or B isn’t performing worse than Control. What percentage of traffic (1% or 5%) goes to Control depends on traffic levels, but also requires convincing to run control.

I think one important technique is to pre-aggregate your data on a user-centered or item-centered basis. This can make it much more palatable to collect this data on a massive scale without having to store a log for every event.

Contextual bandit is one technique that attempts to deal with confounding factors and bias from actual recommendations. However, I think there’s a major challenge to scale it to large counts of items.

I think the quality of collected non-click data is also important—-did the user actually scroll down to see the recommendations or were they served but not looked at? Likewise, I think it’s important to add depth to the “views” or “clicks” metric—-if something was clicked, how long did the user spend viewing/interacting with the item? Did they click and immediately go back or did they click and look at it for a while? Did they add the item to cart? Or if we are talking about articles, did they spend time reading it? Item interest can be estimated more closely than just views, clicks and purchases. Of course, we know that purchases (or more generally conversion rates) have a direct business value, but, for example, an add to cart is somewhat of a proxy of purchase probability and can enhance the quality of the data used to train (and thus a higher proxy business value).

It’s probably impractical to train on control interactions only (and also difficult to keep the same user in control group between visits).

The SNIPS normalization technique reminds me of the Mutual Information factor correction when training co-occurrence (or association) models, where Mutual Information rewards items less likely to randomly co-occur.

Re: existing model, for recsys, as long as the product already exists you have some baseline available, even if it's not very good. Anything from "alphabetical order" to "random order" to "most popular" (a reasonable starting point for a lot of cases) is a baseline model.

I agree that a randomized control is extremely valuable, but more as a way to collect unbiased data than a way to validate that you're outperforming random: it's pretty difficult to do worse than random in most recommendation problems. A more palatable way to introduce some randomness is by showing a random item in a specific position with some probability, rather than showing totally random items for a given user/session. This has the advantage of not ruining the experience for an unlucky user when they get a page of things totally unrelated to their interests.

There are a number of different types of counterfactuals;

Describe the different types of counterfactuals in statistics: Classical counterfactuals, Pearl's counterfactuals, Quantum counterfactuals, Constructor theory counterfactuals