Ofline A/B testing for Recommender Systems
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, Simon Dollé
Criteo Research
Online A/B testing evaluates the impact ofa new technology by
running it in a real production environment and testing its perfor-
mance on a subset ofthe users ofthe platform. It is a well-known
practice to run a preliminary ofine evaluation on historical data
to iterate faster on new ideas, and to detect poor policies in order
to avoid losing money or breaking the system. For such ofine
evaluations, we are interested in methods that can compute ofine
an estimate ofthe potential uplift ofperformance generated by a
new technology. Ofine performance can be measured using esti-
mators known as counterfactualor of-policyestimators. Traditional
counterfactual estimators, such as cappedimportance sampling or
normalisedimportance sampling, exhibit unsatisfying bias-variance
compromises when experimenting on personalized product rec-
ommendation systems. To overcome this issue, we model the bias
incurred by these estimators rather than bound it in the worst
case, which leads us to propose a new counterfactual estimator.
We provide a benchmark ofthe diferent estimators showing their
correlation with business metrics observed by running online A/B
tests on a large-scale commercial recommender system.
o Computing methodologies → Learning from implicit feed-
back; o Information systems → Evaluation ofretrieval results;
counterfactual estimation, of-policy evaluation, recommender sys-
tem, importance sampling.
ACM Reference Format:
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abra-
ham, Simon Dollé. 2018. Ofine A/B testing for Recommender Systems. In
Proceedings of11th ACMInternationalConf. on Web Search andData Mining
(WSDM2018). ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
Personalized product recommendation has become a central part
of most online marketing systems. Having efcient and reliable
methods to evaluate recommender systems is critical in accelerating
the pace ofimprovement ofthese marketing platforms.
Permission to make digital or hard copies ofall or part ofthis work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components ofthis work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from permissions@acm.org.
WSDM2018, February 5–9, 2018, Marina DelRey, CA, USA
(c) 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa-
tion for Computing Machinery.
ACM ISBN 978-1-4503-5581-0/18/02...$15.00
Online A/B tests became ubiquitous in tech companies in order
to make informed decisions on the rollout ofa new technology such
as a recommender system. Each new software implementation is
tested by comparing its performance with the previous production
version through randomised experiments. In practice, to compare
two technologies, a pool ofunits (e.g. users, displays or servers) of
the platform is split in two populations and each ofthem is exposed
to one of the tested technologies. The careful choice of the unit
refects independence assumptions under which the test is run. At
the end ofthe experiments, business metrics such as the generated
revenue, the number ofclicks or the time spent on the platform are
compared to make a decision on the future ofthe new technology.
However, online A/B tests take time and cost money. Indeed, to
gather a sufcient amount ofdata to reach statistical sufciency
and be able to study periodic behaviours (the signal can be diferent
from one day to the other), an A/B test is usually implemented over
several weeks. On top of this, prototypes need to be brought to
production standard to be tested. These reasons prevent companies
from iterating quickly on new ideas.
To solve these pitfalls, people historically relied on ofine ex-
periments based on some rank-based metrics, such as NDCG [13],
MAP [ 1 ] or Precision@K. Such evaluations sufer from very heavy
assumptions, such as independence between products or the fact
that the feedback (e.g. click) can be translated into a supervised task
[ 10 , 16 , 22 ]. To overcome these limitations, some estimators were
introduced [ 2 , 15 ] to estimate ofine – i.e. using randomised histor-
ical data gathered under only one policy – a business comparison
between two systems. In the following, we shall call the procedure
ofcomparing ofine two systems based on some business metric
defning the outcome an ofine A/B test.
This setting is called counterfactual reasoning or of-policy evalu-
ation (OPE) (see [ 2 ] for a comprehensive study). Several estimators
such as BasicImportance Sampling (BIS, [ 8 , 11 ]), CappedImportance
Sampling (CIS, [ 2 ]), Normalised Importance Sampling (NIS, [ 21 ])
and Doubly Robust (DR, [ 6 ]) have been introduced to compute the
expected reward ofthe tested technology π t based on logs collected
on the current technology in production π p .
All theses estimators achieve diferent trade-ofs between bias
and variance. After explaining why BIS (Section 3) and DR (Section
4.1) sufer from high variance in the recommendation setting, we
shall focus on capped importance sampling (Section 4.2). [ 2 ] pro-
posed clipping the importance weights, which leads to a biased
estimator with lower variance. However, the control ofthe bias is
very loose in the general case and [ 2 ] only advocates modifying the
source probability (the current system) to further explore whether
a light clipping could be sufcient.
The main caveat ofsuch biased estimates is that a low bias is
present only under unrealistic conditions (see Section 4.2). Our
main contribution is to propose two variants ofcappedimportance
