原创：如何用Hash对象技术创建文本向量空间模型 [推广有奖]

11楼

jjtww 发表于 2012-9-25 16:12:44

复制代码

12楼

小甲克虫

发表于 2012-11-5 22:40:45

加我QQ啊大哥，我有问题向您请教！178684023

13楼

jjtww 发表于 2013-12-28 15:06:38

看到SAS处理这种底层问题时显得多么的复杂，用Python就简单的多：

复制代码

输出：

{'!': 0, '#': 0, '%': 0, '
: 0, '5': 0, '4': 0, '@': 0, 'a': 1, 'c': 1, 'b': 2, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}
{'!': 0, '#': 0, '%': 0, '
: 0, '5': 0, '4': 0, '@': 0, 'a': 0, 'c': 2, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 1, 'y': 0}
{'!': 0, '#': 0, '%': 0, '
: 0, '5': 0, '4': 0, '@': 0, 'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 2, 'u': 0, 't': 1, 'w': 1, 'y': 0}
{'!': 0, '#': 0, '%': 0, '
: 0, '5': 1, '4': 1, '@': 0, 'a': 0, 'c': 0, 'b': 0, 'e': 1, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 1, 'u': 0, 't': 0, 'w': 1, 'y': 0}
{'!': 0, '#': 0, '%': 0, '
: 0, '5': 0, '4': 0, '@': 0, 'a': 2, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 2, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 1, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}
{'!': 0, '#': 0, '%': 0, '
: 0, '5': 0, '4': 0, '@': 0, 'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 2, 'k': 1, 'j': 1, 'o': 0, 's': 0, 'r': 0, 'u': 1, 't': 1, 'w': 0, 'y': 1}
{'!': 0, '#': 0, '%': 0, '
: 0, '5': 0, '4': 0, '@': 0, 'a': 1, 'c': 0, 'b': 0, 'e': 0, 'd': 1, 'f': 2, 'i': 0, 'k': 0, 'j': 0, 'o': 1, 's': 1, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}
{'!': 1, '#': 1, '%': 1, '
: 1, '5': 0, '4': 0, '@': 1, 'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}

复制代码

下面可以用Distinct这个函数来统计一篇英文小说中各字符出现的频数。
这里简单测试选取了羊脂球-BALL-OF-FAT第一章，第一段话：

str2='For many days now the fag-end of the army had been straggling through the town．They were not troops，\
but a disbanded horde．The beards of the men were long and filthy，their uniforms in tatters，and they advanced \
at an easy pace without flag or regiment．All seemed worn-out and back-broken，incapable of a thought or a \
resolution，marching by habit solely， and falling from fatigue as soon as they stopped．In short，they were \
a mobilized，pacific people，bending under the weight of the gun；some little squads on the alert，easy to \
take alarm and prompt in enthusiasm，ready to attack or to flee；and in the midst of them，some red \
breeches，the remains of a division broken up in a great battle；some somber artillery men in line with \
these varied kinds of foot soldiers；and，sometimes the brilliant helmet of a dragoon on foot who followed \
with difficulty the shortest march of the lines．';
print Distinct(str2)

复制代码

输出：

{'!': 1, ' ': 138, '#': 1, '%': 1, ': 1, '-': 3, '\xac': 14, '\xae': 5, '5': 0, '4': 0, '\xbb': 4, 'A': 1, '@': 1, 'F': 1, 'I': 1, '\xa3': 23, 'T': 2, 'a': 58, 'c': 11, 'b': 16, 'e': 87, 'd': 33, 'g': 17, 'f': 23, 'i': 44, 'h': 40, 'k': 6, 'j': 0, 'm': 24, 'l': 31, 'o': 61, 'n': 50, 'q': 1, 'p': 11, 's': 38, 'r': 42, 'u': 14, 't': 67, 'w': 12, 'v': 3, 'y': 15, 'z': 1}

复制代码

上面可以看出，文章中出现最多字符的是空格，有138个，其次是字母e，有87个。

加好友,备注cda
拉您进交流群