摘要翻译:
本文针对回归和分类问题中使用大量特征(如数千个特征)的挑战进行了研究。出现这种高维特征的情况有两种。一个是当高维测量是可用的,例如,基因表达数据产生的微阵列技术。由于计算或其他原因,人们在对这些数据建模时可能只选择一小部分特征子集,通过查看特征与预测响应的相关性,基于一些度量,例如与训练数据中响应的相关性。虽然它是非常常用的,但这个过程将使响应看起来比实际更可预测。在第二章中,我们提出了一种贝叶斯方法来避免这种选择偏差,并将其应用于朴素贝叶斯模型和混合模型。当我们考虑高阶相互作用时,也会出现高维特征。参数的数量将随着所考虑的顺序呈指数增长。在第三章中,我们提出了一种将一组参数压缩为一个参数的方法,该方法利用了许多由高阶相互作用导出的预测变量在所有训练案例中具有相同的值这一事实。在考虑最高可能的阶数之前,压缩参数的数目可能已经收敛。我们将这种压缩方法应用于logistic序列预测模型和logistic分类模型。我们使用模拟数据和真实数据来测试我们在两章的方法。
---
英文标题:
《Bayesian Classification and Regression with High Dimensional Features》
---
作者:
Longhai Li
---
最新提交年份:
2007
---
分类信息:
一级分类:Statistics 统计学
二级分类:Machine Learning 机器学习
分类描述:Covers machine learning papers (supervised, unsupervised, semi-supervised learning, graphical models, reinforcement learning, bandits, high dimensional inference, etc.) with a statistical or theoretical grounding
覆盖机器学习论文(监督,无监督,半监督学习,图形模型,强化学习,强盗,高维推理等)与统计或理论基础
--
---
英文摘要:
This thesis responds to the challenges of using a large number, such as thousands, of features in regression and classification problems. There are two situations where such high dimensional features arise. One is when high dimensional measurements are available, for example, gene expression data produced by microarray techniques. For computational or other reasons, people may select only a small subset of features when modelling such data, by looking at how relevant the features are to predicting the response, based on some measure such as correlation with the response in the training data. Although it is used very commonly, this procedure will make the response appear more predictable than it actually is. In Chapter 2, we propose a Bayesian method to avoid this selection bias, with application to naive Bayes models and mixture models. High dimensional features also arise when we consider high-order interactions. The number of parameters will increase exponentially with the order considered. In Chapter 3, we propose a method for compressing a group of parameters into a single one, by exploiting the fact that many predictor variables derived from high-order interactions have the same values for all the training cases. The number of compressed parameters may have converged before considering the highest possible order. We apply this compression method to logistic sequence prediction models and logistic classification models. We use both simulated data and real data to test our methods in both chapters.
---
PDF链接:
https://arxiv.org/pdf/709.2936


雷达卡



京公网安备 11010802022788号







