摘要翻译:
在许多现实世界的学习任务中,获取足够数量的标记示例进行训练是昂贵的。本文研究了通过“样本选择”来降低注释成本的方法。在这种方法中,在训练过程中,学习程序检查许多未标记的例子,并只选择那些在每个阶段信息最丰富的例子进行标记。这样就避免了重复地标记那些几乎没有新信息的示例。我们的工作是在以往基于委员会的查询研究的基础上,将基于委员会的范式扩展到概率分类的背景下。我们描述了概率分类模型中基于委员会的样本选择的一系列经验方法,这些方法通过测量几个模型变体之间的不一致程度来评估一个例子的信息量。这些变体(委员会)是从一个概率分布中随机抽取的,这个概率分布是由到目前为止标记的训练集所条件的。将该方法应用于实际自然语言处理任务中的随机词性标注。我们发现该方法的所有变体都实现了注释代价的显著降低,尽管它们的计算效率不同。特别是,最简单的变体,一个没有参数可调的两个成员委员会,给出了极好的结果。我们还表明,样本选择使标记器使用的模型的大小显著减少。
---
英文标题:
《Committee-Based Sample Selection for Probabilistic Classifiers》
---
作者:
S. Argamon-Engelson, I. Dagan
---
最新提交年份:
2011
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence 人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
---
英文摘要:
In many real-world learning tasks, it is expensive to acquire a sufficient number of labeled examples for training. This paper investigates methods for reducing annotation cost by `sample selection'. In this approach, during training the learning program examines many unlabeled examples and selects for labeling only those that are most informative at each stage. This avoids redundantly labeling examples that contribute little new information. Our work follows on previous research on Query By Committee, extending the committee-based paradigm to the context of probabilistic classification. We describe a family of empirical methods for committee-based sample selection in probabilistic classification models, which evaluate the informativeness of an example by measuring the degree of disagreement between several model variants. These variants (the committee) are drawn randomly from a probability distribution conditioned by the training set labeled so far. The method was applied to the real-world natural language processing task of stochastic part-of-speech tagging. We find that all variants of the method achieve a significant reduction in annotation cost, although their computational efficiency differs. In particular, the simplest variant, a two member committee with no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.
---
PDF链接:
https://arxiv.org/pdf/1106.0220