[其它] Offline A/B testing for Recommender Systems [推广有奖]

0关注
0粉丝

等待验证会员

大专生

65%

还不是VIP/贵宾

-

0%

威望: 0 级
论坛币: 271 个
通用积分: 1.2649
学术水平: 0 点
热心指数: 0 点
信用等级: 0 点
经验: 320 点
帖子: 31
精华: 0
在线时间: 0 小时
注册时间: 2021-2-4
最后登录: 2023-9-4

楼主

vN28s20242 发表于 2023-10-10 21:13:21 |只看作者 |坛友微信交流群|倒序 |AI写论文

相似文件

换一批

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Ofline A/B testing for Recommender Systems
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, Simon Dollé
Criteo Research
f.name@criteo.com
ABSTRACT
Online A/B testing evaluates the impact ofa new technology by
running it in a real production environment and testing its perfor-
mance on a subset ofthe users ofthe platform. It is a well-known
practice to run a preliminary ofine evaluation on historical data
to iterate faster on new ideas, and to detect poor policies in order
to avoid losing money or breaking the system. For such ofine
evaluations, we are interested in methods that can compute ofine
an estimate ofthe potential uplift ofperformance generated by a
new technology. Ofine performance can be measured using esti-
mators known as counterfactualor of-policyestimators. Traditional
counterfactual estimators, such as cappedimportance sampling or
normalisedimportance sampling, exhibit unsatisfying bias-variance
compromises when experimenting on personalized product rec-
ommendation systems. To overcome this issue, we model the bias
incurred by these estimators rather than bound it in the worst
case, which leads us to propose a new counterfactual estimator.
We provide a benchmark ofthe diferent estimators showing their
correlation with business metrics observed by running online A/B
tests on a large-scale commercial recommender system.
CCS CONCEPTS
o Computing methodologies → Learning from implicit feed-
back; o Information systems → Evaluation ofretrieval results;
KEYWORDS
counterfactual estimation, of-policy evaluation, recommender sys-
tem, importance sampling.
ACM Reference Format:
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abra-
ham, Simon Dollé. 2018. Ofine A/B testing for Recommender Systems. In
Proceedings of11th ACMInternationalConf. on Web Search andData Mining
(WSDM2018). ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
3159652.3159687
1 INTRODUCTION
Personalized product recommendation has become a central part
of most online marketing systems. Having efcient and reliable
methods to evaluate recommender systems is critical in accelerating
the pace ofimprovement ofthese marketing platforms.
Permission to make digital or hard copies ofall or part ofthis work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components ofthis work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from permissions@acm.org.
WSDM2018, February 5–9, 2018, Marina DelRey, CA, USA
(c) 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa-
tion for Computing Machinery.
ACM ISBN 978-1-4503-5581-0/18/02...$15.00
https://doi.org/10.1145/3159652.3159687
Online A/B tests became ubiquitous in tech companies in order
to make informed decisions on the rollout ofa new technology such
as a recommender system. Each new software implementation is
tested by comparing its performance with the previous production
version through randomised experiments. In practice, to compare
two technologies, a pool ofunits (e.g. users, displays or servers) of
the platform is split in two populations and each ofthem is exposed
to one of the tested technologies. The careful choice of the unit
refects independence assumptions under which the test is run. At
the end ofthe experiments, business metrics such as the generated
revenue, the number ofclicks or the time spent on the platform are
compared to make a decision on the future ofthe new technology.
However, online A/B tests take time and cost money. Indeed, to
gather a sufcient amount ofdata to reach statistical sufciency
and be able to study periodic behaviours (the signal can be diferent
from one day to the other), an A/B test is usually implemented over
several weeks. On top of this, prototypes need to be brought to
production standard to be tested. These reasons prevent companies
from iterating quickly on new ideas.
To solve these pitfalls, people historically relied on ofine ex-
periments based on some rank-based metrics, such as NDCG [13],
MAP [ 1 ] or Precision@K. Such evaluations sufer from very heavy
assumptions, such as independence between products or the fact
that the feedback (e.g. click) can be translated into a supervised task
[ 10 , 16 , 22 ]. To overcome these limitations, some estimators were
introduced [ 2 , 15 ] to estimate ofine – i.e. using randomised histor-
ical data gathered under only one policy – a business comparison
between two systems. In the following, we shall call the procedure
ofcomparing ofine two systems based on some business metric
defning the outcome an ofine A/B test.
This setting is called counterfactual reasoning or of-policy evalu-
ation (OPE) (see [ 2 ] for a comprehensive study). Several estimators
such as BasicImportance Sampling (BIS, [ 8 , 11 ]), CappedImportance
Sampling (CIS, [ 2 ]), Normalised Importance Sampling (NIS, [ 21 ])
and Doubly Robust (DR, [ 6 ]) have been introduced to compute the
expected reward ofthe tested technology π t based on logs collected
on the current technology in production π p .
All theses estimators achieve diferent trade-ofs between bias
and variance. After explaining why BIS (Section 3) and DR (Section
4.1) sufer from high variance in the recommendation setting, we
shall focus on capped importance sampling (Section 4.2). [ 2 ] pro-
posed clipping the importance weights, which leads to a biased
estimator with lower variance. However, the control ofthe bias is
very loose in the general case and [ 2 ] only advocates modifying the
source probability (the current system) to further explore whether
a light clipping could be sufcient.
The main caveat ofsuch biased estimates is that a low bias is
present only under unrealistic conditions (see Section 4.2). Our
main contribution is to propose two variants ofcappedimportance
arXiv:1801.07030v1 [stat.ML] 22 Jan 2018