摘要翻译:
我们提出了一套维数无关相似度计算(DISCO)算法来计算非常高维稀疏向量之间的所有成对相似度。我们的所有结果都是与维数无关的,这意味着除了最初的数据读取开销之外,所有后续的操作都与维数无关,因此维数可以很大。我们研究余弦,骰子,重叠和Jaccard相似度量。对于Jaccard的相似性,我们包括Minhash的一个改进版本。我们的结果面向MapReduce框架。我们使用来自社交网站Twitter的数据在大规模的经验上验证了我们的定理。在撰写本文时,我们的算法正在Twitter.com进行现场生产。
---
英文标题:
《Dimension Independent Similarity Computation》
---
作者:
Reza Bosagh Zadeh, Ashish Goel
---
最新提交年份:
2013
---
分类信息:
一级分类:Computer Science 计算机科学
二级分类:Data Structures and Algorithms 数据结构与算法
分类描述:Covers data structures and analysis of algorithms. Roughly includes material in ACM Subject Classes E.1, E.2, F.2.1, and F.2.2.
涵盖数据结构和算法分析。大致包括ACM学科类E.1、E.2、F.2.1和F.2.2中的材料。
--
一级分类:Computer Science 计算机科学
二级分类:Artificial Intelligence 人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类:Computer Science 计算机科学
二级分类:Distributed, Parallel, and Cluster Computing 分布式、并行和集群计算
分类描述:Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing. Roughly includes material in ACM Subject Classes C.1.2, C.1.4, C.2.4, D.1.3, D.4.5, D.4.7, E.1.
包括容错、分布式算法、稳定性、并行计算和集群计算。大致包括ACM学科类C.1.2、C.1.4、C.2.4、D.1.3、D.4.5、D.4.7、E.1中的材料。
--
---
英文摘要:
We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high dimensional sparse vectors. All of our results are provably independent of dimension, meaning apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension, thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similiarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems at large scale using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.
---
PDF链接:
https://arxiv.org/pdf/1206.2082


雷达卡



京公网安备 11010802022788号







