摘要翻译:
文档聚类对于更好地管理、智能导航、高效过滤以及对万维网(WWW)等大量文档的简洁摘要等方面的重要性得到了广泛的认可。下一个挑战在于基于文档的语义内容在语义上执行聚类。文档聚类问题有两个主要组成部分:(1)以一种固有的捕获文本语义的形式表示文档。这也有助于降低文档的维数,以及(2)基于语义表示定义相似性度量,使得它将更高的数值分配给具有更高语义关系的文档对。文档的特征空间对于文档聚类来说是非常具有挑战性的。一个文档可能包含多个主题,它可能包含一大组与类无关的通用字词和少数特定于类的核心字词。基于文档向量模型(DVM)和后缀树模型(STC)的传统聚类算法在产生高质量聚类结果时效率较低。提出了一种基于主题图表示的文档聚类方法。文档正在转换成紧凑的形式。提出了一种基于主题地图、数据和结构推断信息的相似性度量方法。该方法通过聚类层次聚类实现,并在标准信息检索(IR)数据集上进行了测试。对比实验表明,该方法在提高聚类质量方面是有效的。
---
英文标题:
《Document Clustering based on Topic Maps》
---
作者:
Muhammad Rafi, M. Shahid Shaikh, Amir Farooq
---
最新提交年份:
2011
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Information Retrieval 信息检索
分类描述:Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.
涵盖索引,字典,检索,内容和分析。大致包括ACM主题课程H.3.0、H.3.1、H.3.2、H.3.3和H.3.4中的材料。
--
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence 人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
---
英文摘要:
Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.
---
PDF链接:
https://arxiv.org/pdf/1112.6219