楼主: nkwilling
5766 12

原创:如何用Hash对象技术创建文本向量空间模型 [推广有奖]

11
jjtww 发表于 2012-9-25 16:12:44
  1. %macro test;
  2. %LET ST = AK AZ CA CO CT DC FL GA HI IA IL IN KS MA MD ME MI MS MT MO NE;  
  3. /* Count participating states */
  4. %LET E = %EVAL(%SYSFUNC(COUNT(&ST,%STR( )))+1);
  5.   
  6. %DO A = 1 %TO &E;  /* loop state by state*/
  7. /* Create state abbreviation FIPCODE and state name */           
  8. %LET STATE = %SCAN(&ST,&A);               
  9. %LET STFIPC = %SYSFUNC(STFIPS(&STATE));     
  10. %LET STSTR = %SYSFUNC(FIPSTATE(&STFIPC));
  11. %LET STNAME = %SYSFUNC(FIPNAME(&STFIPC));

  12. %put &state &stfipc &ststr &stname;
  13. %end;

  14. %mend;

  15. %test;
复制代码

12
小甲克虫 在职认证  发表于 2012-11-5 22:40:45
加我QQ啊大哥,我有问题向您请教!178684023

13
jjtww 发表于 2013-12-28 15:06:38
看到SAS处理这种底层问题时显得多么的复杂,用Python就简单的多:
  1. def Distinct(str,w={}):
  2.     token=w
  3.     for s in str:
  4.         for i in range(len(s)):
  5.             if s[i] in token.keys():
  6.                 token[s[i]]+=1
  7.             else:
  8.                 token[s[i]]=1
  9.     return token

  10. str=['cbba','wcc','rrtw','45rwe','afsaf','uityikj','fdasfo','!@#$%']
  11. for i in range(len(str)):
  12.     w={}
  13.     w=Distinct(str)
  14.     for key in w:
  15.         w[key]=0   
  16.     print Distinct(str[i],w)
复制代码

输出:
  1. {'!': 0, '#': 0, '%': 0, '

  2. : 0, '5': 0, '4': 0, '@': 0, 'a': 1, 'c': 1, 'b': 2, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}
  3. {'!': 0, '#': 0, '%': 0, '

  4. : 0, '5': 0, '4': 0, '@': 0, 'a': 0, 'c': 2, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 1, 'y': 0}
  5. {'!': 0, '#': 0, '%': 0, '

  6. : 0, '5': 0, '4': 0, '@': 0, 'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 2, 'u': 0, 't': 1, 'w': 1, 'y': 0}
  7. {'!': 0, '#': 0, '%': 0, '

  8. : 0, '5': 1, '4': 1, '@': 0, 'a': 0, 'c': 0, 'b': 0, 'e': 1, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 1, 'u': 0, 't': 0, 'w': 1, 'y': 0}
  9. {'!': 0, '#': 0, '%': 0, '

  10. : 0, '5': 0, '4': 0, '@': 0, 'a': 2, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 2, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 1, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}
  11. {'!': 0, '#': 0, '%': 0, '

  12. : 0, '5': 0, '4': 0, '@': 0, 'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 2, 'k': 1, 'j': 1, 'o': 0, 's': 0, 'r': 0, 'u': 1, 't': 1, 'w': 0, 'y': 1}
  13. {'!': 0, '#': 0, '%': 0, '

  14. : 0, '5': 0, '4': 0, '@': 0, 'a': 1, 'c': 0, 'b': 0, 'e': 0, 'd': 1, 'f': 2, 'i': 0, 'k': 0, 'j': 0, 'o': 1, 's': 1, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}
  15. {'!': 1, '#': 1, '%': 1, '

  16. : 1, '5': 0, '4': 0, '@': 1, 'a': 0, 'c': 0, 'b': 0, 'e': 0, 'd': 0, 'f': 0, 'i': 0, 'k': 0, 'j': 0, 'o': 0, 's': 0, 'r': 0, 'u': 0, 't': 0, 'w': 0, 'y': 0}
复制代码


下面可以用Distinct这个函数来统计一篇英文小说中各字符出现的频数。
这里简单测试选取了羊脂球-BALL-OF-FAT第一章,第一段话:


  1. str2='For many days now the fag-end of the army had been straggling through the town.They were not troops,\
  2. but a disbanded horde.The beards of the men were long and filthy,their uniforms in tatters,and they advanced \
  3. at an easy pace without flag or regiment.All seemed worn-out and back-broken,incapable of a thought or a \
  4. resolution,marching by habit solely, and falling from fatigue as soon as they stopped.In short,they were \
  5. a mobilized,pacific people,bending under the weight of the gun;some little squads on the alert,easy to \
  6. take alarm and prompt in enthusiasm,ready to attack or to flee;and in the midst of them,some red \
  7. breeches,the remains of a division broken up in a great battle;some somber artillery men in line with \
  8. these varied kinds of foot soldiers;and,sometimes the brilliant helmet of a dragoon on foot who followed \
  9. with difficulty the shortest march of the lines.';
  10. print Distinct(str2)
复制代码


输出:

  1. {'!': 1, ' ': 138, '#': 1, '%': 1, ': 1, '-': 3, '\xac': 14, '\xae': 5, '5': 0, '4': 0, '\xbb': 4, 'A': 1, '@': 1, 'F': 1, 'I': 1, '\xa3': 23, 'T': 2, 'a': 58, 'c': 11, 'b': 16, 'e': 87, 'd': 33, 'g': 17, 'f': 23, 'i': 44, 'h': 40, 'k': 6, 'j': 0, 'm': 24, 'l': 31, 'o': 61, 'n': 50, 'q': 1, 'p': 11, 's': 38, 'r': 42, 'u': 14, 't': 67, 'w': 12, 'v': 3, 'y': 15, 'z': 1}
复制代码


上面可以看出,文章中出现最多字符的是空格,有138个,其次是字母e,有87个。

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2025-12-31 01:01