[程序分享] 一个程序的比较(revisit) [推广有奖]

11楼

jingju11 发表于 2014-9-11 07:21:17

playmore 发表于 2014-9-10 14:39
对，基本上就是这个算法
但我这个算法的问题是稀疏矩阵+矩阵乘法
直接用R的话，10000×10000基本上就是 ...

http://blog.sina.com.cn/s/blog_a3a926360102v0w6.html

Thank you for raising the question.
Sometimes we call it 'Curse of Dimensionality'. Because of it, we come accross some computation problems.
As my blog shown, the maximum matrix dimension cannot exceed 20,000 X 1,000 (basically commensurate with 3.2 GB in SAS) in this case, even though the code was submitted on server and the compuation is fairly simple as per below:

The run-time was compared between matrix (FCMP) and method 1(DATA).
when n =16,250, the matrix cannnot be computed because of its dimension. On the other hand, the compuation is very fast for using simple formula.

JingJu

http://blog.sina.com.cn/s/blog_a3a926360102v0w6.html

已有 1 人评分	经验	论坛币	收起理由
crackman	+ 100	+ 100	热心帮助其他会员

总评分: 经验 + 100 论坛币 + 100 查看全部评分

12楼

playmore 发表于 2014-9-11 08:28:30

jingju11 发表于 2014-9-11 07:21
http://blog.sina.com.cn/s/blog_a3a926360102v0w6.html

Thank you for raising the questi ...

多谢京剧大哥，
没有想到用fcmp里的array完成矩阵运算速度还挺快

13楼

何必不淡定。 发表于 2014-9-11 16:57:58

mark xie xie louzhu

14楼

jingju11 发表于 2014-9-12 10:32:31

playmore 发表于 2014-9-11 08:28
多谢京剧大哥，
没有想到用fcmp里的array完成矩阵运算速度还挺快

Someone points out that XXT matrix is symmetric and thus the last transpose is redundant. This is true. Moreover, we can read out cross product matrix from some procedures, such as reg or corr, directly.
JingJu

15楼

ilovekate 发表于 2014-9-12 10:45:58

这个实在是经典。赞一下。

16楼

jingju11 发表于 2014-9-13 05:27:02

ilovekate 发表于 2014-9-12 10:45
这个实在是经典。赞一下。

Thanks.
This formula was more straightforward.

JingJu

17楼

jingju11 发表于 2014-9-13 06:20:19

jingju11 发表于 2014-9-13 05:27
Thanks.
This formula was more straightforward.

这个主题探索至此也差不多了.非常感谢大家的参与和建议。在这里简单总结以下:
这个数据的特点是:较多的不同的书目,在每个书目之下, 作者的个数相对较少. 比如<20.

方法 1 & 2 因为方案本身的构建而局限其效率。当数据增大时,效率较差,尤其是在行数较大的情况之下.
方法论3 效率明显较好, 较大的行数并不过多影响效率.
利用巨阵来计算的方法值得探索.但是通常受制于巨阵的维度而造成对内存的过度消耗.
以下是时间消耗的对比：方法1/方法3。运行时间主要考虑最重要的因素。因此具备很大的近似型。图示的横轴表示记录数，竖轴表示方法1 和方法3 运行时间的比值的对数值。竖轴0 表示两种方法的运行时间相等(log(1) = 0)。不同的线条表示不同的列数（每个题目的最大作者数）。保持同样的列数，随着记录数的增大，方法1消耗更多的运行时间比例增大。较大的列数使得方法3的效率降低。如果列数增大到30以上，方法1的效率或许高于方法3。如果列数超过35，方法3 几乎无法计算（2**30是个很大的数字）。

京剧

18楼

何必不淡定。 发表于 2014-9-15 11:37:25

谢谢楼主，对于method2，我提出一点我的疑问。
我觉得这个算法计算的cnt是任意两个不同title下相同name的个数，所以最后的max(cnt)应该是在两个不同title下存在最多相同的name时cnt的个数，所以计算的应该是两个title（titleA和titleB）之间作者重复的最大数；而这个问题要求却是有每个title多少作者重复，所以当存在titleC和titleA重复而与titleB不重复时，这个算法就可能不够准确。
不知道我的理解哪里存在漏洞，请楼主指正。谢谢：）

19楼

jingju11 发表于 2014-9-16 00:51:02

何必不淡定。发表于 2014-9-15 11:37
谢谢楼主，对于method2，我提出一点我的疑问。
我觉得这个算法计算的cnt是任意两个不同title下相同name的个 ...

原题的要求是:对于某个TITLE的作者里,在其他TITLE里再次合作的最大人数.
title1 have authors A B C.
title2 have authors A B C D.
title3 have authors A B E F.
then for title1, we look at title2 first. all A B C show at title2 then max-common-authors = 3; then we llook at title1's authors in title3, the number is 2 (A & B). so the result = max(3, 2) = 3;
same for title2, the results =max(3 in title1, 2 in title3) = 3;
for title3, the results =max(2 in title1, 2 in title2) = 2.
...
所以方法2应该是正确的.在所有的方法里,只有方法3 的思路略有不同.方法1和方法2的不同点在于方法1使用数据步,方法2 使用sql.而在距阵运算里,也是在找这个最大重复的值.因此这些方法的效率较差因为:假如10000个titles , 针对某个title ,必须要在 (10000-1)个title里寻找,切不说每个title包含不同的作者,增加了更多的循环.方法3先对title里的所有作者做组合.如果作者数较少,就比较快速.但是如果某个题目包含30个作者(虽然可能性不大),所造成的组合数是2**30-1 =1,000,000,000.这个程序也将不胜重负.
京剧
JingJu