摘要翻译:
许多回归问题涉及的不是一个而是几个响应变量。通常,这些答复被怀疑具有共同的基础结构,在这种情况下,在它们之间共享信息可能是有利的;这就是所谓的多任务学习。作为一个特例,我们可以使用多个响应来更好地识别共享的预测特征--这个项目我们可以称之为多任务特征选择。本论文组织如下。第1节介绍回归的特征选择,重点讨论ell_0正则化方法及其在最小描述长度(MDL)框架内的解释。第二节提出了一种新的MDL特征选择扩展到多任务设置。这种方法被称为“多重包含标准”(MIC),旨在通过更容易地选择与多重响应相关的特征来跨回归任务借用信息。我们在合成的和真实的生物数据集上的实验表明,MIC可以在特征在响应之间至少部分共享的情况下减少预测误差。第3节用一个单一反应的回归来调查假设检验,重点是标准的Bonferroni校正和MDL方法之间的平行性。第四节反映了第二节的思想,提出了一种新的MIC方法,用于多响应假设检验,并表明在多响应特征显著共享的合成数据上,MIC有时在寻找给定假阳性水平的真阳性方面优于标准的FDR控制方法。第5节总结。
---
英文标题:
《A Minimum Description Length Approach to Multitask Feature Selection》
---
作者:
Brian Tomasik
---
最新提交年份:
2009
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Machine Learning 机器学习
分类描述:Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.
关于机器学习研究的所有方面的论文(有监督的,无监督的,强化学习,强盗问题,等等),包括健壮性,解释性,公平性和方法论。对于机器学习方法的应用,CS.LG也是一个合适的主要类别。
--
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence 人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
---
英文摘要:
Many regression problems involve not one but several response variables (y's). Often the responses are suspected to share a common underlying structure, in which case it may be advantageous to share information across them; this is known as multitask learning. As a special case, we can use multiple responses to better identify shared predictive features -- a project we might call multitask feature selection. This thesis is organized as follows. Section 1 introduces feature selection for regression, focusing on ell_0 regularization methods and their interpretation within a Minimum Description Length (MDL) framework. Section 2 proposes a novel extension of MDL feature selection to the multitask setting. The approach, called the "Multiple Inclusion Criterion" (MIC), is designed to borrow information across regression tasks by more easily selecting features that are associated with multiple responses. We show in experiments on synthetic and real biological data sets that MIC can reduce prediction error in settings where features are at least partially shared across responses. Section 3 surveys hypothesis testing by regression with a single response, focusing on the parallel between the standard Bonferroni correction and an MDL approach. Mirroring the ideas in Section 2, Section 4 proposes a novel MIC approach to hypothesis testing with multiple responses and shows that on synthetic data with significant sharing of features across responses, MIC sometimes outperforms standard FDR-controlling methods in terms of finding true positives for a given level of false positives. Section 5 concludes.
---
PDF链接:
https://arxiv.org/pdf/0906.0052