《Achieving Reliable Causal Inference with Data-Mined Variables: A Random
Forest Approach to the Measurement Error Problem》
---
作者:
Mochen Yang, Edward McFowland III, Gordon Burtch and Gediminas
Adomavicius
---
最新提交年份:
2020
---
英文摘要:
Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy involves the application of predictive modeling techniques to \'mine\' variables of interest from available data, followed by the inclusion of those variables into an econometric framework, with the objective of estimating causal effects. Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest. We propose employing random forest not just for prediction, but also for generating instrumental variables to address the measurement error embedded in the prediction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make \'different\' mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees serve as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases and its superior performance over three alternative methods for bias correction.
---
中文摘要:
将机器学习与计量经济分析相结合在研究和实践中越来越普遍。常见的经验策略包括应用预测建模技术,从可用数据中“挖掘”感兴趣的变量,然后将这些变量纳入计量经济学框架,目的是估计因果效应。最近的工作强调,由于机器学习模型的预测不可避免地不完美,基于预测变量的经济计量分析可能会因测量误差而产生偏差。我们提出了一种新的方法来缓解这些偏见,利用集成学习技术称为随机森林。我们建议使用随机森林不仅用于预测,还用于生成工具变量,以解决预测中嵌入的测量误差。当由一组树组成时,随机森林算法的性能最佳,这些树在各自的预测中是准确的,但也会犯“不同”的错误,即预测错误的相关性较弱。一个关键的观察结果是,这些属性与有效工具变量的相关性和排除要求密切相关。我们设计了一个数据驱动的程序,从随机森林中选择单个树的元组,其中一棵树作为内生协变量,其他树作为其工具。仿真实验证明了该方法在减少估计偏差方面的有效性,并且与三种不同的偏差校正方法相比,其性能优越。
---
分类信息:
一级分类:Economics 经济学
二级分类:Econometrics 计量经济学
分类描述:Econometric Theory, Micro-Econometrics, Macro-Econometrics, Empirical Content of Economic Relations discovered via New Methods, Methodological Aspects of the Application of Statistical Inference to Economic Data.
计量经济学理论,微观计量经济学,宏观计量经济学,通过新方法发现的经济关系的实证内容,统计推论应用于经济数据的方法论方面。
--
一级分类:Computer Science 计算机科学
二级分类:Computers and Society 计算机与社会
分类描述:Covers impact of computers on society, computer ethics, information technology and public policy, legal aspects of computing, computers and education. Roughly includes material in ACM Subject Classes K.0, K.2, K.3, K.4, K.5, and K.7.
涵盖计算机对社会的影响、计算机伦理、信息技术和公共政策、计算机的法律方面、计算机和教育。大致包括ACM学科类K.0、K.2、K.3、K.4、K.5和K.7中的材料。
--
一级分类:Computer Science 计算机科学
二级分类:Machine Learning 机器学习
分类描述:Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文(有监督的,无监督的,强化学习,强盗问题,等等),包括健壮性,解释性,公平性和方法论。对于机器学习方法的应用,CS.LG也是一个合适的主要类别。
--
---
PDF下载:
-->