楼主: ReneeBK
7448 25

Programming Collective Intelligence   [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49407 个
通用积分
51.8104
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币





Programming Collective Intelligence:


Building Smart Web 2.0 Applications


Toby Segaran



Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.

Programming Collective Intelligence takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general -- all from information that you and others collect every day. Each algorithm is described clearly and concisely with code that can immediately be used on your web site, blog, Wiki, or specialized application. This book explains:

  • Collaborative filtering techniques that enable online retailers to recommend products or media
  • Methods of clustering to detect groups of similar items in a large dataset
  • Search engine features -- crawlers, indexers, query engines, and the PageRank algorithm
  • Optimization algorithms that search millions of possible solutions to a problem and choose the best one
  • Bayesian filtering, used in spam filters for classifying documents based on word types and other features
  • Using decision trees not only to make predictions, but to model the way decisions are made
  • Predicting numerical values rather than classifications to build price models
  • Support vector machines to match people in online dating sites
  • Non-negative matrix factorization to find the independent features in a dataset
  • Evolving intelligence for problem solving -- how a computer develops its skill by improving its own code the more it plays a game
Each chapter includes exercises for extending the algorithms to make them more powerful. Go beyond simple database-backed applications and put the wealth of Internet data to work for you.

"Bravo! I cannot think of a better way for a developer to first learn these algorithms and methods, nor can I think of a better way for me (an old AI dog) to reinvigorate my knowledge of the details."
-- Dan Russell, Google

"Toby's book does a great job of breaking down the complex subject matter of machine-learning algorithms into practical, easy-to-understand examples that can be directly applied to analysis of social interaction across the Web today. If I had this book two years ago, it would have saved precious time going down some fruitless paths."
-- Tim Wolters, CTO, Collective Intellect

Product Details
  • Paperback: 362 pages
  • Publisher: O'Reilly Media; 1 edition (Aug. 26 2007)
  • Language: English
  • ISBN-10: 0596529325
  • ISBN-13: 978-059652932

本帖隐藏的内容

Programming Collective Intelligence.rar (2.51 MB, 需要: 20 个论坛币) 本附件包括:
  • Programming Collective Intelligence.pdf




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Intelligence Programming Collective Program collect Internet enormous programs collect created

本帖被以下文库推荐

沙发
sunyiping 发表于 2014-8-29 09:05:31 |只看作者 |坛友微信交流群

Chapter 2: Making Recommendations

  1. # A dictionary of movie critics and their ratings of a small
  2. # set of movies
  3. critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
  4. 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
  5. 'The Night Listener': 3.0},
  6. 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
  7. 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
  8. 'You, Me and Dupree': 3.5},
  9. 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
  10. 'Superman Returns': 3.5, 'The Night Listener': 4.0},
  11. 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
  12. 'The Night Listener': 4.5, 'Superman Returns': 4.0,
  13. 'You, Me and Dupree': 2.5},
  14. 'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
  15. 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
  16. 'You, Me and Dupree': 2.0},
  17. 'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
  18. 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
  19. 'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}


  20. from math import sqrt

  21. # Returns a distance-based similarity score for person1 and person2
  22. def sim_distance(prefs,person1,person2):
  23.   # Get the list of shared_items
  24.   si={}
  25.   for item in prefs[person1]:
  26.     if item in prefs[person2]: si[item]=1

  27.   # if they have no ratings in common, return 0
  28.   if len(si)==0: return 0

  29.   # Add up the squares of all the differences
  30.   sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
  31.                       for item in prefs[person1] if item in prefs[person2]])

  32.   return 1/(1+sum_of_squares)

  33. # Returns the Pearson correlation coefficient for p1 and p2
  34. def sim_pearson(prefs,p1,p2):
  35.   # Get the list of mutually rated items
  36.   si={}
  37.   for item in prefs[p1]:
  38.     if item in prefs[p2]: si[item]=1

  39.   # if they are no ratings in common, return 0
  40.   if len(si)==0: return 0

  41.   # Sum calculations
  42.   n=len(si)
  43.   
  44.   # Sums of all the preferences
  45.   sum1=sum([prefs[p1][it] for it in si])
  46.   sum2=sum([prefs[p2][it] for it in si])
  47.   
  48.   # Sums of the squares
  49.   sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
  50.   sum2Sq=sum([pow(prefs[p2][it],2) for it in si])        
  51.   
  52.   # Sum of the products
  53.   pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
  54.   
  55.   # Calculate r (Pearson score)
  56.   num=pSum-(sum1*sum2/n)
  57.   den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
  58.   if den==0: return 0

  59.   r=num/den

  60.   return r

  61. # Returns the best matches for person from the prefs dictionary.
  62. # Number of results and similarity function are optional params.
  63. def topMatches(prefs,person,n=5,similarity=sim_pearson):
  64.   scores=[(similarity(prefs,person,other),other)
  65.                   for other in prefs if other!=person]
  66.   scores.sort()
  67.   scores.reverse()
  68.   return scores[0:n]

  69. # Gets recommendations for a person by using a weighted average
  70. # of every other user's rankings
  71. def getRecommendations(prefs,person,similarity=sim_pearson):
  72.   totals={}
  73.   simSums={}
  74.   for other in prefs:
  75.     # don't compare me to myself
  76.     if other==person: continue
  77.     sim=similarity(prefs,person,other)

  78.     # ignore scores of zero or lower
  79.     if sim<=0: continue
  80.     for item in prefs[other]:
  81.             
  82.       # only score movies I haven't seen yet
  83.       if item not in prefs[person] or prefs[person][item]==0:
  84.         # Similarity * Score
  85.         totals.setdefault(item,0)
  86.         totals[item]+=prefs[other][item]*sim
  87.         # Sum of similarities
  88.         simSums.setdefault(item,0)
  89.         simSums[item]+=sim

  90.   # Create the normalized list
  91.   rankings=[(total/simSums[item],item) for item,total in totals.items()]

  92.   # Return the sorted list
  93.   rankings.sort()
  94.   rankings.reverse()
  95.   return rankings

  96. def transformPrefs(prefs):
  97.   result={}
  98.   for person in prefs:
  99.     for item in prefs[person]:
  100.       result.setdefault(item,{})
  101.       
  102.       # Flip item and person
  103.       result[item][person]=prefs[person][item]
  104.   return result


  105. def calculateSimilarItems(prefs,n=10):
  106.   # Create a dictionary of items showing which other items they
  107.   # are most similar to.
  108.   result={}
  109.   # Invert the preference matrix to be item-centric
  110.   itemPrefs=transformPrefs(prefs)
  111.   c=0
  112.   for item in itemPrefs:
  113.     # Status updates for large datasets
  114.     c+=1
  115.     if c%100==0: print "%d / %d" % (c,len(itemPrefs))
  116.     # Find the most similar items to this one
  117.     scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance)
  118.     result[item]=scores
  119.   return result

  120. def getRecommendedItems(prefs,itemMatch,user):
  121.   userRatings=prefs[user]
  122.   scores={}
  123.   totalSim={}
  124.   # Loop over items rated by this user
  125.   for (item,rating) in userRatings.items( ):

  126.     # Loop over items similar to this one
  127.     for (similarity,item2) in itemMatch[item]:

  128.       # Ignore if this user has already rated this item
  129.       if item2 in userRatings: continue
  130.       # Weighted sum of rating times similarity
  131.       scores.setdefault(item2,0)
  132.       scores[item2]+=similarity*rating
  133.       # Sum of all the similarities
  134.       totalSim.setdefault(item2,0)
  135.       totalSim[item2]+=similarity

  136.   # Divide each total score by total weighting to get an average
  137.   rankings=[(score/totalSim[item],item) for item,score in scores.items( )]

  138.   # Return the rankings from highest to lowest
  139.   rankings.sort( )
  140.   rankings.reverse( )
  141.   return rankings

  142. def loadMovieLens(path='/data/movielens'):
  143.   # Get movie titles
  144.   movies={}
  145.   for line in open(path+'/u.item'):
  146.     (id,title)=line.split('|')[0:2]
  147.     movies[id]=title
  148.   
  149.   # Load data
  150.   prefs={}
  151.   for line in open(path+'/u.data'):
  152.     (user,movieid,rating,ts)=line.split('\t')
  153.     prefs.setdefault(user,{})
  154.     prefs[user][movies[movieid]]=float(rating)
  155.   return prefs
复制代码

使用道具

藤椅
vaster 发表于 2014-8-29 09:34:41 |只看作者 |坛友微信交流群

Chapter 2: Making Recommendations

  1. from pydelicious import get_popular,get_userposts,get_urlposts
  2. import time

  3. def initializeUserDict(tag,count=5):
  4.   user_dict={}
  5.   # get the top count' popular posts
  6.   for p1 in get_popular(tag=tag)[0:count]:
  7.     # find all users who posted this
  8.     for p2 in get_urlposts(p1['href']):
  9.       user=p2['user']
  10.       user_dict[user]={}
  11.   return user_dict

  12. def fillItems(user_dict):
  13.   all_items={}
  14.   # Find links posted by all users
  15.   for user in user_dict:
  16.     for i in range(3):
  17.       try:
  18.         posts=get_userposts(user)
  19.         break
  20.       except:
  21.         print "Failed user "+user+", retrying"
  22.         time.sleep(4)
  23.     for post in posts:
  24.       url=post['href']
  25.       user_dict[user][url]=1.0
  26.       all_items[url]=1
  27.   
  28.   # Fill in missing items with 0
  29.   for ratings in user_dict.values():
  30.     for item in all_items:
  31.       if item not in ratings:
  32.         ratings[item]=0.0
复制代码

使用道具

板凳
找媳妇 发表于 2014-8-29 10:05:47 |只看作者 |坛友微信交流群

Chapter 3:Discovering Groups

  1. import feedparser
  2. import re

  3. # Returns title and dictionary of word counts for an RSS feed
  4. def getwordcounts(url):
  5.   # Parse the feed
  6.   d=feedparser.parse(url)
  7.   wc={}

  8.   # Loop over all the entries
  9.   for e in d.entries:
  10.     if 'summary' in e: summary=e.summary
  11.     else: summary=e.description

  12.     # Extract a list of words
  13.     words=getwords(e.title+' '+summary)
  14.     for word in words:
  15.       wc.setdefault(word,0)
  16.       wc[word]+=1
  17.   return d.feed.title,wc

  18. def getwords(html):
  19.   # Remove all the HTML tags
  20.   txt=re.compile(r'<[^>]+>').sub('',html)

  21.   # Split words by all non-alpha characters
  22.   words=re.compile(r'[^A-Z^a-z]+').split(txt)

  23.   # Convert to lowercase
  24.   return [word.lower() for word in words if word!='']


  25. apcount={}
  26. wordcounts={}
  27. feedlist=[line for line in file('feedlist.txt')]
  28. for feedurl in feedlist:
  29.   try:
  30.     title,wc=getwordcounts(feedurl)
  31.     wordcounts[title]=wc
  32.     for word,count in wc.items():
  33.       apcount.setdefault(word,0)
  34.       if count>1:
  35.         apcount[word]+=1
  36.   except:
  37.     print 'Failed to parse feed %s' % feedurl

  38. wordlist=[]
  39. for w,bc in apcount.items():
  40.   frac=float(bc)/len(feedlist)
  41.   if frac>0.1 and frac<0.5:
  42.     wordlist.append(w)

  43. out=file('blogdata1.txt','w')
  44. out.write('Blog')
  45. for word in wordlist: out.write('\t%s' % word)
  46. out.write('\n')
  47. for blog,wc in wordcounts.items():
  48.   print blog
  49.   out.write(blog)
  50.   for word in wordlist:
  51.     if word in wc: out.write('\t%d' % wc[word])
  52.     else: out.write('\t0')
  53.   out.write('\n')
复制代码

使用道具

报纸
jojoan719 发表于 2014-8-29 10:07:58 |只看作者 |坛友微信交流群

Chapter 3: Discovering Groups

  1. from BeautifulSoup import BeautifulSoup
  2. import urllib2
  3. import re
  4. chare=re.compile(r'[!-\.&]')
  5. itemowners={}

  6. # Words to remove
  7. dropwords=['a','new','some','more','my','own','the','many','other','another']

  8. currentuser=0
  9. for i in range(1,51):
  10.   # URL for the want search page
  11.   c=urllib2.urlopen(
  12.   'http://member.zebo.com/Main?event_key=USERSEARCH&wiowiw=wiw&keyword=car&page=%d'
  13.   % (i))
  14.   soup=BeautifulSoup(c.read())
  15.   for td in soup('td'):
  16.     # Find table cells of bgverdanasmall class
  17.     if ('class' in dict(td.attrs) and td['class']=='bgverdanasmall'):
  18.       items=[re.sub(chare,'',str(a.contents[0]).lower()).strip() for a in td('a')]
  19.       for item in items:
  20.         # Remove extra words
  21.         txt=' '.join([t for t in item.split(' ') if t not in dropwords])
  22.         if len(txt)<2: continue
  23.         itemowners.setdefault(txt,{})
  24.         itemowners[txt][currentuser]=1
  25.       currentuser+=1
  26.       
  27. out=file('zebo.txt','w')
  28. out.write('Item')
  29. for user in range(0,currentuser): out.write('\tU%d' % user)
  30. out.write('\n')
  31. for item,owners in itemowners.items():
  32.   if len(owners)>10:
  33.     out.write(item)
  34.     for user in range(0,currentuser):
  35.       if user in owners: out.write('\t1')
  36.       else: out.write('\t0')
  37.     out.write('\n')
复制代码

使用道具

地板
oink-oink 发表于 2014-8-29 10:27:50 |只看作者 |坛友微信交流群

Chapter 3: Discovering Groups

  1. from PIL import Image,ImageDraw

  2. def readfile(filename):
  3.   lines=[line for line in file(filename)]
  4.   
  5.   # First line is the column titles
  6.   colnames=lines[0].strip().split('\t')[1:]
  7.   rownames=[]
  8.   data=[]
  9.   for line in lines[1:]:
  10.     p=line.strip().split('\t')
  11.     # First column in each row is the rowname
  12.     rownames.append(p[0])
  13.     # The data for this row is the remainder of the row
  14.     data.append([float(x) for x in p[1:]])
  15.   return rownames,colnames,data


  16. from math import sqrt

  17. def pearson(v1,v2):
  18.   # Simple sums
  19.   sum1=sum(v1)
  20.   sum2=sum(v2)
  21.   
  22.   # Sums of the squares
  23.   sum1Sq=sum([pow(v,2) for v in v1])
  24.   sum2Sq=sum([pow(v,2) for v in v2])        
  25.   
  26.   # Sum of the products
  27.   pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
  28.   
  29.   # Calculate r (Pearson score)
  30.   num=pSum-(sum1*sum2/len(v1))
  31.   den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
  32.   if den==0: return 0

  33.   return 1.0-num/den

  34. class bicluster:
  35.   def __init__(self,vec,left=None,right=None,distance=0.0,id=None):
  36.     self.left=left
  37.     self.right=right
  38.     self.vec=vec
  39.     self.id=id
  40.     self.distance=distance

  41. def hcluster(rows,distance=pearson):
  42.   distances={}
  43.   currentclustid=-1

  44.   # Clusters are initially just the rows
  45.   clust=[bicluster(rows[i],id=i) for i in range(len(rows))]

  46.   while len(clust)>1:
  47.     lowestpair=(0,1)
  48.     closest=distance(clust[0].vec,clust[1].vec)

  49.     # loop through every pair looking for the smallest distance
  50.     for i in range(len(clust)):
  51.       for j in range(i+1,len(clust)):
  52.         # distances is the cache of distance calculations
  53.         if (clust[i].id,clust[j].id) not in distances:
  54.           distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

  55.         d=distances[(clust[i].id,clust[j].id)]

  56.         if d<closest:
  57.           closest=d
  58.           lowestpair=(i,j)

  59.     # calculate the average of the two clusters
  60.     mergevec=[
  61.     (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
  62.     for i in range(len(clust[0].vec))]

  63.     # create the new cluster
  64.     newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
  65.                          right=clust[lowestpair[1]],
  66.                          distance=closest,id=currentclustid)

  67.     # cluster ids that weren't in the original set are negative
  68.     currentclustid-=1
  69.     del clust[lowestpair[1]]
  70.     del clust[lowestpair[0]]
  71.     clust.append(newcluster)

  72.   return clust[0]

  73. def printclust(clust,labels=None,n=0):
  74.   # indent to make a hierarchy layout
  75.   for i in range(n): print ' ',
  76.   if clust.id<0:
  77.     # negative id means that this is branch
  78.     print '-'
  79.   else:
  80.     # positive id means that this is an endpoint
  81.     if labels==None: print clust.id
  82.     else: print labels[clust.id]

  83.   # now print the right and left branches
  84.   if clust.left!=None: printclust(clust.left,labels=labels,n=n+1)
  85.   if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)

  86. def getheight(clust):
  87.   # Is this an endpoint? Then the height is just 1
  88.   if clust.left==None and clust.right==None: return 1

  89.   # Otherwise the height is the same of the heights of
  90.   # each branch
  91.   return getheight(clust.left)+getheight(clust.right)

  92. def getdepth(clust):
  93.   # The distance of an endpoint is 0.0
  94.   if clust.left==None and clust.right==None: return 0

  95.   # The distance of a branch is the greater of its two sides
  96.   # plus its own distance
  97.   return max(getdepth(clust.left),getdepth(clust.right))+clust.distance


  98. def drawdendrogram(clust,labels,jpeg='clusters.jpg'):
  99.   # height and width
  100.   h=getheight(clust)*20
  101.   w=1200
  102.   depth=getdepth(clust)

  103.   # width is fixed, so scale distances accordingly
  104.   scaling=float(w-150)/depth

  105.   # Create a new image with a white background
  106.   img=Image.new('RGB',(w,h),(255,255,255))
  107.   draw=ImageDraw.Draw(img)

  108.   draw.line((0,h/2,10,h/2),fill=(255,0,0))   

  109.   # Draw the first node
  110.   drawnode(draw,clust,10,(h/2),scaling,labels)
  111.   img.save(jpeg,'JPEG')

  112. def drawnode(draw,clust,x,y,scaling,labels):
  113.   if clust.id<0:
  114.     h1=getheight(clust.left)*20
  115.     h2=getheight(clust.right)*20
  116.     top=y-(h1+h2)/2
  117.     bottom=y+(h1+h2)/2
  118.     # Line length
  119.     ll=clust.distance*scaling
  120.     # Vertical line from this cluster to children   
  121.     draw.line((x,top+h1/2,x,bottom-h2/2),fill=(255,0,0))   
  122.    
  123.     # Horizontal line to left item
  124.     draw.line((x,top+h1/2,x+ll,top+h1/2),fill=(255,0,0))   

  125.     # Horizontal line to right item
  126.     draw.line((x,bottom-h2/2,x+ll,bottom-h2/2),fill=(255,0,0))        

  127.     # Call the function to draw the left and right nodes   
  128.     drawnode(draw,clust.left,x+ll,top+h1/2,scaling,labels)
  129.     drawnode(draw,clust.right,x+ll,bottom-h2/2,scaling,labels)
  130.   else:   
  131.     # If this is an endpoint, draw the item label
  132.     draw.text((x+5,y-7),labels[clust.id],(0,0,0))

  133. def rotatematrix(data):
  134.   newdata=[]
  135.   for i in range(len(data[0])):
  136.     newrow=[data[j][i] for j in range(len(data))]
  137.     newdata.append(newrow)
  138.   return newdata

  139. import random

  140. def kcluster(rows,distance=pearson,k=4):
  141.   # Determine the minimum and maximum values for each point
  142.   ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
  143.   for i in range(len(rows[0]))]

  144.   # Create k randomly placed centroids
  145.   clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
  146.   for i in range(len(rows[0]))] for j in range(k)]
  147.   
  148.   lastmatches=None
  149.   for t in range(100):
  150.     print 'Iteration %d' % t
  151.     bestmatches=[[] for i in range(k)]
  152.    
  153.     # Find which centroid is the closest for each row
  154.     for j in range(len(rows)):
  155.       row=rows[j]
  156.       bestmatch=0
  157.       for i in range(k):
  158.         d=distance(clusters[i],row)
  159.         if d<distance(clusters[bestmatch],row): bestmatch=i
  160.       bestmatches[bestmatch].append(j)

  161.     # If the results are the same as last time, this is complete
  162.     if bestmatches==lastmatches: break
  163.     lastmatches=bestmatches
  164.    
  165.     # Move the centroids to the average of their members
  166.     for i in range(k):
  167.       avgs=[0.0]*len(rows[0])
  168.       if len(bestmatches[i])>0:
  169.         for rowid in bestmatches[i]:
  170.           for m in range(len(rows[rowid])):
  171.             avgs[m]+=rows[rowid][m]
  172.         for j in range(len(avgs)):
  173.           avgs[j]/=len(bestmatches[i])
  174.         clusters[i]=avgs
  175.       
  176.   return bestmatches

  177. def tanamoto(v1,v2):
  178.   c1,c2,shr=0,0,0
  179.   
  180.   for i in range(len(v1)):
  181.     if v1[i]!=0: c1+=1 # in v1
  182.     if v2[i]!=0: c2+=1 # in v2
  183.     if v1[i]!=0 and v2[i]!=0: shr+=1 # in both
  184.   
  185.   return 1.0-(float(shr)/(c1+c2-shr))

  186. def scaledown(data,distance=pearson,rate=0.01):
  187.   n=len(data)

  188.   # The real distances between every pair of items
  189.   realdist=[[distance(data[i],data[j]) for j in range(n)]
  190.              for i in range(0,n)]

  191.   # Randomly initialize the starting points of the locations in 2D
  192.   loc=[[random.random(),random.random()] for i in range(n)]
  193.   fakedist=[[0.0 for j in range(n)] for i in range(n)]
  194.   
  195.   lasterror=None
  196.   for m in range(0,1000):
  197.     # Find projected distances
  198.     for i in range(n):
  199.       for j in range(n):
  200.         fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
  201.                                  for x in range(len(loc[i]))]))
  202.   
  203.     # Move points
  204.     grad=[[0.0,0.0] for i in range(n)]
  205.    
  206.     totalerror=0
  207.     for k in range(n):
  208.       for j in range(n):
  209.         if j==k: continue
  210.         # The error is percent difference between the distances
  211.         errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
  212.         
  213.         # Each point needs to be moved away from or towards the other
  214.         # point in proportion to how much error it has
  215.         grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
  216.         grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm

  217.         # Keep track of the total error
  218.         totalerror+=abs(errorterm)
  219.     print totalerror

  220.     # If the answer got worse by moving the points, we are done
  221.     if lasterror and lasterror<totalerror: break
  222.     lasterror=totalerror
  223.    
  224.     # Move each of the points by the learning rate times the gradient
  225.     for k in range(n):
  226.       loc[k][0]-=rate*grad[k][0]
  227.       loc[k][1]-=rate*grad[k][1]

  228.   return loc

  229. def draw2d(data,labels,jpeg='mds2d.jpg'):
  230.   img=Image.new('RGB',(2000,2000),(255,255,255))
  231.   draw=ImageDraw.Draw(img)
  232.   for i in range(len(data)):
  233.     x=(data[i][0]+0.5)*1000
  234.     y=(data[i][1]+0.5)*1000
  235.     draw.text((x,y),labels[i],(0,0,0))
  236.   img.save(jpeg,'JPEG')  
复制代码

使用道具

7
redflame 发表于 2014-8-29 11:07:33 |只看作者 |坛友微信交流群

Chapter 4: Searching and Ranking

  1. import urllib2
  2. from BeautifulSoup import *
  3. from urlparse import urljoin
  4. from pysqlite2 import dbapi2 as sqlite
  5. import nn
  6. mynet=nn.searchnet('nn.db')

  7. # Create a list of words to ignore
  8. ignorewords={'the':1,'of':1,'to':1,'and':1,'a':1,'in':1,'is':1,'it':1}


  9. class crawler:
  10.   # Initialize the crawler with the name of database
  11.   def __init__(self,dbname):
  12.     self.con=sqlite.connect(dbname)
  13.   
  14.   def __del__(self):
  15.     self.con.close()

  16.   def dbcommit(self):
  17.     self.con.commit()

  18.   # Auxilliary function for getting an entry id and adding
  19.   # it if it's not present
  20.   def getentryid(self,table,field,value,createnew=True):
  21.     cur=self.con.execute(
  22.     "select rowid from %s where %s='%s'" % (table,field,value))
  23.     res=cur.fetchone()
  24.     if res==None:
  25.       cur=self.con.execute(
  26.       "insert into %s (%s) values ('%s')" % (table,field,value))
  27.       return cur.lastrowid
  28.     else:
  29.       return res[0]


  30.   # Index an individual page
  31.   def addtoindex(self,url,soup):
  32.     if self.isindexed(url): return
  33.     print 'Indexing '+url
  34.   
  35.     # Get the individual words
  36.     text=self.gettextonly(soup)
  37.     words=self.separatewords(text)
  38.    
  39.     # Get the URL id
  40.     urlid=self.getentryid('urllist','url',url)
  41.    
  42.     # Link each word to this url
  43.     for i in range(len(words)):
  44.       word=words[i]
  45.       if word in ignorewords: continue
  46.       wordid=self.getentryid('wordlist','word',word)
  47.       self.con.execute("insert into wordlocation(urlid,wordid,location) values (%d,%d,%d)" % (urlid,wordid,i))
  48.   

  49.   
  50.   # Extract the text from an HTML page (no tags)
  51.   def gettextonly(self,soup):
  52.     v=soup.string
  53.     if v==Null:   
  54.       c=soup.contents
  55.       resulttext=''
  56.       for t in c:
  57.         subtext=self.gettextonly(t)
  58.         resulttext+=subtext+'\n'
  59.       return resulttext
  60.     else:
  61.       return v.strip()

  62.   # Seperate the words by any non-whitespace character
  63.   def separatewords(self,text):
  64.     splitter=re.compile('\\W*')
  65.     return [s.lower() for s in splitter.split(text) if s!='']

  66.    
  67.   # Return true if this url is already indexed
  68.   def isindexed(self,url):
  69.     return False
  70.   
  71.   # Add a link between two pages
  72.   def addlinkref(self,urlFrom,urlTo,linkText):
  73.     words=self.separateWords(linkText)
  74.     fromid=self.getentryid('urllist','url',urlFrom)
  75.     toid=self.getentryid('urllist','url',urlTo)
  76.     if fromid==toid: return
  77.     cur=self.con.execute("insert into link(fromid,toid) values (%d,%d)" % (fromid,toid))
  78.     linkid=cur.lastrowid
  79.     for word in words:
  80.       if word in ignorewords: continue
  81.       wordid=self.getentryid('wordlist','word',word)
  82.       self.con.execute("insert into linkwords(linkid,wordid) values (%d,%d)" % (linkid,wordid))

  83.   # Starting with a list of pages, do a breadth
  84.   # first search to the given depth, indexing pages
  85.   # as we go
  86.   def crawl(self,pages,depth=2):
  87.     for i in range(depth):
  88.       newpages={}
  89.       for page in pages:
  90.         try:
  91.           c=urllib2.urlopen(page)
  92.         except:
  93.           print "Could not open %s" % page
  94.           continue
  95.         try:
  96.           soup=BeautifulSoup(c.read())
  97.           self.addtoindex(page,soup)
  98.   
  99.           links=soup('a')
  100.           for link in links:
  101.             if ('href' in dict(link.attrs)):
  102.               url=urljoin(page,link['href'])
  103.               if url.find("'")!=-1: continue
  104.               url=url.split('#')[0]  # remove location portion
  105.               if url[0:4]=='http' and not self.isindexed(url):
  106.                 newpages[url]=1
  107.               linkText=self.gettextonly(link)
  108.               self.addlinkref(page,url,linkText)
  109.   
  110.           self.dbcommit()
  111.         except:
  112.           print "Could not parse page %s" % page

  113.       pages=newpages

  114.   
  115.   # Create the database tables
  116.   def createindextables(self):
  117.     self.con.execute('create table urllist(url)')
  118.     self.con.execute('create table wordlist(word)')
  119.     self.con.execute('create table wordlocation(urlid,wordid,location)')
  120.     self.con.execute('create table link(fromid integer,toid integer)')
  121.     self.con.execute('create table linkwords(wordid,linkid)')
  122.     self.con.execute('create index wordidx on wordlist(word)')
  123.     self.con.execute('create index urlidx on urllist(url)')
  124.     self.con.execute('create index wordurlidx on wordlocation(wordid)')
  125.     self.con.execute('create index urltoidx on link(toid)')
  126.     self.con.execute('create index urlfromidx on link(fromid)')
  127.     self.dbcommit()

  128.   def calculatepagerank(self,iterations=20):
  129.     # clear out the current page rank tables
  130.     self.con.execute('drop table if exists pagerank')
  131.     self.con.execute('create table pagerank(urlid primary key,score)')
  132.    
  133.     # initialize every url with a page rank of 1
  134.     for (urlid,) in self.con.execute('select rowid from urllist'):
  135.       self.con.execute('insert into pagerank(urlid,score) values (%d,1.0)' % urlid)
  136.     self.dbcommit()
  137.    
  138.     for i in range(iterations):
  139.       print "Iteration %d" % (i)
  140.       for (urlid,) in self.con.execute('select rowid from urllist'):
  141.         pr=0.15
  142.         
  143.         # Loop through all the pages that link to this one
  144.         for (linker,) in self.con.execute(
  145.         'select distinct fromid from link where toid=%d' % urlid):
  146.           # Get the page rank of the linker
  147.           linkingpr=self.con.execute(
  148.           'select score from pagerank where urlid=%d' % linker).fetchone()[0]

  149.           # Get the total number of links from the linker
  150.           linkingcount=self.con.execute(
  151.           'select count(*) from link where fromid=%d' % linker).fetchone()[0]
  152.           pr+=0.85*(linkingpr/linkingcount)
  153.         self.con.execute(
  154.         'update pagerank set score=%f where urlid=%d' % (pr,urlid))
  155.       self.dbcommit()

  156. class searcher:
  157.   def __init__(self,dbname):
  158.     self.con=sqlite.connect(dbname)

  159.   def __del__(self):
  160.     self.con.close()

  161.   def getmatchrows(self,q):
  162.     # Strings to build the query
  163.     fieldlist='w0.urlid'
  164.     tablelist=''  
  165.     clauselist=''
  166.     wordids=[]

  167.     # Split the words by spaces
  168.     words=q.split(' ')  
  169.     tablenumber=0

  170.     for word in words:
  171.       # Get the word ID
  172.       wordrow=self.con.execute(
  173.       "select rowid from wordlist where word='%s'" % word).fetchone()
  174.       if wordrow!=None:
  175.         wordid=wordrow[0]
  176.         wordids.append(wordid)
  177.         if tablenumber>0:
  178.           tablelist+=','
  179.           clauselist+=' and '
  180.           clauselist+='w%d.urlid=w%d.urlid and ' % (tablenumber-1,tablenumber)
  181.         fieldlist+=',w%d.location' % tablenumber
  182.         tablelist+='wordlocation w%d' % tablenumber      
  183.         clauselist+='w%d.wordid=%d' % (tablenumber,wordid)
  184.         tablenumber+=1

  185.     # Create the query from the separate parts
  186.     fullquery='select %s from %s where %s' % (fieldlist,tablelist,clauselist)
  187.     print fullquery
  188.     cur=self.con.execute(fullquery)
  189.     rows=[row for row in cur]

  190.     return rows,wordids

  191.   def getscoredlist(self,rows,wordids):
  192.     totalscores=dict([(row[0],0) for row in rows])

  193.     # This is where we'll put our scoring functions
  194.     weights=[(1.0,self.locationscore(rows)),
  195.              (1.0,self.frequencyscore(rows)),
  196.              (1.0,self.pagerankscore(rows)),
  197.              (1.0,self.linktextscore(rows,wordids)),
  198.              (5.0,self.nnscore(rows,wordids))]
  199.     for (weight,scores) in weights:
  200.       for url in totalscores:
  201.         totalscores[url]+=weight*scores[url]

  202.     return totalscores

  203.   def geturlname(self,id):
  204.     return self.con.execute(
  205.     "select url from urllist where rowid=%d" % id).fetchone()[0]

  206.   def query(self,q):
  207.     rows,wordids=self.getmatchrows(q)
  208.     scores=self.getscoredlist(rows,wordids)
  209.     rankedscores=[(score,url) for (url,score) in scores.items()]
  210.     rankedscores.sort()
  211.     rankedscores.reverse()
  212.     for (score,urlid) in rankedscores[0:10]:
  213.       print '%f\t%s' % (score,self.geturlname(urlid))
  214.     return wordids,[r[1] for r in rankedscores[0:10]]

  215.   def normalizescores(self,scores,smallIsBetter=0):
  216.     vsmall=0.00001 # Avoid division by zero errors
  217.     if smallIsBetter:
  218.       minscore=min(scores.values())
  219.       return dict([(u,float(minscore)/max(vsmall,l)) for (u,l) in scores.items()])
  220.     else:
  221.       maxscore=max(scores.values())
  222.       if maxscore==0: maxscore=vsmall
  223.       return dict([(u,float(c)/maxscore) for (u,c) in scores.items()])

  224.   def frequencyscore(self,rows):
  225.     counts=dict([(row[0],0) for row in rows])
  226.     for row in rows: counts[row[0]]+=1
  227.     return self.normalizescores(counts)

  228.   def locationscore(self,rows):
  229.     locations=dict([(row[0],1000000) for row in rows])
  230.     for row in rows:
  231.       loc=sum(row[1:])
  232.       if loc<locations[row[0]]: locations[row[0]]=loc
  233.    
  234.     return self.normalizescores(locations,smallIsBetter=1)

  235.   def distancescore(self,rows):
  236.     # If there's only one word, everyone wins!
  237.     if len(rows[0])<=2: return dict([(row[0],1.0) for row in rows])

  238.     # Initialize the dictionary with large values
  239.     mindistance=dict([(row[0],1000000) for row in rows])

  240.     for row in rows:
  241.       dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))])
  242.       if dist<mindistance[row[0]]: mindistance[row[0]]=dist
  243.     return self.normalizescores(mindistance,smallIsBetter=1)

  244.   def inboundlinkscore(self,rows):
  245.     uniqueurls=dict([(row[0],1) for row in rows])
  246.     inboundcount=dict([(u,self.con.execute('select count(*) from link where toid=%d' % u).fetchone()[0]) for u in uniqueurls])   
  247.     return self.normalizescores(inboundcount)

  248.   def linktextscore(self,rows,wordids):
  249.     linkscores=dict([(row[0],0) for row in rows])
  250.     for wordid in wordids:
  251.       cur=self.con.execute('select link.fromid,link.toid from linkwords,link where wordid=%d and linkwords.linkid=link.rowid' % wordid)
  252.       for (fromid,toid) in cur:
  253.         if toid in linkscores:
  254.           pr=self.con.execute('select score from pagerank where urlid=%d' % fromid).fetchone()[0]
  255.           linkscores[toid]+=pr
  256.     maxscore=max(linkscores.values())
  257.     normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items()])
  258.     return normalizedscores

  259.   def pagerankscore(self,rows):
  260.     pageranks=dict([(row[0],self.con.execute('select score from pagerank where urlid=%d' % row[0]).fetchone()[0]) for row in rows])
  261.     maxrank=max(pageranks.values())
  262.     normalizedscores=dict([(u,float(l)/maxrank) for (u,l) in pageranks.items()])
  263.     return normalizedscores

  264.   def nnscore(self,rows,wordids):
  265.     # Get unique URL IDs as an ordered list
  266.     urlids=[urlid for urlid in dict([(row[0],1) for row in rows])]
  267.     nnres=mynet.getresult(wordids,urlids)
  268.     scores=dict([(urlids[i],nnres[i]) for i in range(len(urlids))])
  269.     return self.normalizescores(scores)
复制代码

使用道具

8
长线小白龙 在职认证  企业认证  发表于 2014-8-29 11:53:15 |只看作者 |坛友微信交流群

Chapter 4: Searching and Ranking

  1. from math import tanh
  2. from pysqlite2 import dbapi2 as sqlite

  3. def dtanh(y):
  4.     return 1.0-y*y

  5. class searchnet:
  6.     def __init__(self,dbname):
  7.       self.con=sqlite.connect(dbname)
  8.   
  9.     def __del__(self):
  10.       self.con.close()

  11.     def maketables(self):
  12.       self.con.execute('create table hiddennode(create_key)')
  13.       self.con.execute('create table wordhidden(fromid,toid,strength)')
  14.       self.con.execute('create table hiddenurl(fromid,toid,strength)')
  15.       self.con.commit()

  16.     def getstrength(self,fromid,toid,layer):
  17.       if layer==0: table='wordhidden'
  18.       else: table='hiddenurl'
  19.       res=self.con.execute('select strength from %s where fromid=%d and toid=%d' % (table,fromid,toid)).fetchone()
  20.       if res==None:
  21.           if layer==0: return -0.2
  22.           if layer==1: return 0
  23.       return res[0]

  24.     def setstrength(self,fromid,toid,layer,strength):
  25.       if layer==0: table='wordhidden'
  26.       else: table='hiddenurl'
  27.       res=self.con.execute('select rowid from %s where fromid=%d and toid=%d' % (table,fromid,toid)).fetchone()
  28.       if res==None:
  29.         self.con.execute('insert into %s (fromid,toid,strength) values (%d,%d,%f)' % (table,fromid,toid,strength))
  30.       else:
  31.         rowid=res[0]
  32.         self.con.execute('update %s set strength=%f where rowid=%d' % (table,strength,rowid))

  33.     def generatehiddennode(self,wordids,urls):
  34.       if len(wordids)>3: return None
  35.       # Check if we already created a node for this set of words
  36.       sorted_words=[str(id) for id in wordids]
  37.       sorted_words.sort()
  38.       createkey='_'.join(sorted_words)
  39.       res=self.con.execute(
  40.       "select rowid from hiddennode where create_key='%s'" % createkey).fetchone()

  41.       # If not, create it
  42.       if res==None:
  43.         cur=self.con.execute(
  44.         "insert into hiddennode (create_key) values ('%s')" % createkey)
  45.         hiddenid=cur.lastrowid
  46.         # Put in some default weights
  47.         for wordid in wordids:
  48.           self.setstrength(wordid,hiddenid,0,1.0/len(wordids))
  49.         for urlid in urls:
  50.           self.setstrength(hiddenid,urlid,1,0.1)
  51.         self.con.commit()

  52.     def getallhiddenids(self,wordids,urlids):
  53.       l1={}
  54.       for wordid in wordids:
  55.         cur=self.con.execute(
  56.         'select toid from wordhidden where fromid=%d' % wordid)
  57.         for row in cur: l1[row[0]]=1
  58.       for urlid in urlids:
  59.         cur=self.con.execute(
  60.         'select fromid from hiddenurl where toid=%d' % urlid)
  61.         for row in cur: l1[row[0]]=1
  62.       return l1.keys()

  63.     def setupnetwork(self,wordids,urlids):
  64.         # value lists
  65.         self.wordids=wordids
  66.         self.hiddenids=self.getallhiddenids(wordids,urlids)
  67.         self.urlids=urlids

  68.         # node outputs
  69.         self.ai = [1.0]*len(self.wordids)
  70.         self.ah = [1.0]*len(self.hiddenids)
  71.         self.ao = [1.0]*len(self.urlids)
  72.         
  73.         # create weights matrix
  74.         self.wi = [[self.getstrength(wordid,hiddenid,0)
  75.                     for hiddenid in self.hiddenids]
  76.                    for wordid in self.wordids]
  77.         self.wo = [[self.getstrength(hiddenid,urlid,1)
  78.                     for urlid in self.urlids]
  79.                    for hiddenid in self.hiddenids]

  80.     def feedforward(self):
  81.         # the only inputs are the query words
  82.         for i in range(len(self.wordids)):
  83.             self.ai[i] = 1.0

  84.         # hidden activations
  85.         for j in range(len(self.hiddenids)):
  86.             sum = 0.0
  87.             for i in range(len(self.wordids)):
  88.                 sum = sum + self.ai[i] * self.wi[i][j]
  89.             self.ah[j] = tanh(sum)

  90.         # output activations
  91.         for k in range(len(self.urlids)):
  92.             sum = 0.0
  93.             for j in range(len(self.hiddenids)):
  94.                 sum = sum + self.ah[j] * self.wo[j][k]
  95.             self.ao[k] = tanh(sum)

  96.         return self.ao[:]

  97.     def getresult(self,wordids,urlids):
  98.       self.setupnetwork(wordids,urlids)
  99.       return self.feedforward()

  100.     def backPropagate(self, targets, N=0.5):
  101.         # calculate errors for output
  102.         output_deltas = [0.0] * len(self.urlids)
  103.         for k in range(len(self.urlids)):
  104.             error = targets[k]-self.ao[k]
  105.             output_deltas[k] = dtanh(self.ao[k]) * error

  106.         # calculate errors for hidden layer
  107.         hidden_deltas = [0.0] * len(self.hiddenids)
  108.         for j in range(len(self.hiddenids)):
  109.             error = 0.0
  110.             for k in range(len(self.urlids)):
  111.                 error = error + output_deltas[k]*self.wo[j][k]
  112.             hidden_deltas[j] = dtanh(self.ah[j]) * error

  113.         # update output weights
  114.         for j in range(len(self.hiddenids)):
  115.             for k in range(len(self.urlids)):
  116.                 change = output_deltas[k]*self.ah[j]
  117.                 self.wo[j][k] = self.wo[j][k] + N*change

  118.         # update input weights
  119.         for i in range(len(self.wordids)):
  120.             for j in range(len(self.hiddenids)):
  121.                 change = hidden_deltas[j]*self.ai[i]
  122.                 self.wi[i][j] = self.wi[i][j] + N*change

  123.     def trainquery(self,wordids,urlids,selectedurl):
  124.       # generate a hidden node if necessary
  125.       self.generatehiddennode(wordids,urlids)

  126.       self.setupnetwork(wordids,urlids)      
  127.       self.feedforward()
  128.       targets=[0.0]*len(urlids)
  129.       targets[urlids.index(selectedurl)]=1.0
  130.       error = self.backPropagate(targets)
  131.       self.updatedatabase()

  132.     def updatedatabase(self):
  133.       # set them to database values
  134.       for i in range(len(self.wordids)):
  135.           for j in range(len(self.hiddenids)):
  136.               self.setstrength(self.wordids[i],self. hiddenids[j],0,self.wi[i][j])
  137.       for j in range(len(self.hiddenids)):
  138.           for k in range(len(self.urlids)):
  139.               self.setstrength(self.hiddenids[j],self.urlids[k],1,self.wo[j][k])
  140.       self.con.commit()
复制代码

使用道具

9
lhf8059 发表于 2014-8-29 18:03:00 |只看作者 |坛友微信交流群
  1. import math

  2. people=['Charlie','Augustus','Veruca','Violet','Mike','Joe','Willy','Miranda']

  3. links=[('Augustus', 'Willy'),
  4.        ('Mike', 'Joe'),
  5.        ('Miranda', 'Mike'),
  6.        ('Violet', 'Augustus'),
  7.        ('Miranda', 'Willy'),
  8.        ('Charlie', 'Mike'),
  9.        ('Veruca', 'Joe'),
  10.        ('Miranda', 'Augustus'),
  11.        ('Willy', 'Augustus'),
  12.        ('Joe', 'Charlie'),
  13.        ('Veruca', 'Augustus'),
  14.        ('Miranda', 'Joe')]


  15. def crosscount(v):
  16.   # Convert the number list into a dictionary of person:(x,y)
  17.   loc=dict([(people[i],(v[i*2],v[i*2+1])) for i in range(0,len(people))])
  18.   total=0
  19.   
  20.   # Loop through every pair of links
  21.   for i in range(len(links)):
  22.     for j in range(i+1,len(links)):

  23.       # Get the locations
  24.       (x1,y1),(x2,y2)=loc[links[i][0]],loc[links[i][1]]
  25.       (x3,y3),(x4,y4)=loc[links[j][0]],loc[links[j][1]]
  26.       
  27.       den=(y4-y3)*(x2-x1)-(x4-x3)*(y2-y1)

  28.       # den==0 if the lines are parallel
  29.       if den==0: continue

  30.       # Otherwise ua and ub are the fraction of the
  31.       # line where they cross
  32.       ua=((x4-x3)*(y1-y3)-(y4-y3)*(x1-x3))/den
  33.       ub=((x2-x1)*(y1-y3)-(y2-y1)*(x1-x3))/den
  34.       
  35.       # If the fraction is between 0 and 1 for both lines
  36.       # then they cross each other
  37.       if ua>0 and ua<1 and ub>0 and ub<1:
  38.         total+=1
  39.     for i in range(len(people)):
  40.       for j in range(i+1,len(people)):
  41.         # Get the locations of the two nodes
  42.         (x1,y1),(x2,y2)=loc[people[i]],loc[people[j]]

  43.         # Find the distance between them
  44.         dist=math.sqrt(math.pow(x1-x2,2)+math.pow(y1-y2,2))
  45.         # Penalize any nodes closer than 50 pixels
  46.         if dist<50:
  47.           total+=(1.0-(dist/50.0))
  48.         
  49.   return total
  50. from PIL import Image,ImageDraw

  51. def drawnetwork(sol):
  52.   # Create the image
  53.   img=Image.new('RGB',(400,400),(255,255,255))
  54.   draw=ImageDraw.Draw(img)

  55.   # Create the position dict
  56.   pos=dict([(people[i],(sol[i*2],sol[i*2+1])) for i in range(0,len(people))])

  57.   for (a,b) in links:
  58.     draw.line((pos[a],pos[b]),fill=(255,0,0))

  59.   for n,p in pos.items():
  60.     draw.text(p,n,(0,0,0))

  61.   img.show()


  62. domain=[(10,370)]*(len(people)*2)
复制代码

使用道具

10
hsiaoyan 发表于 2014-8-29 18:52:30 |只看作者 |坛友微信交流群

Chapter 5:Optimization

  1. import time
  2. import random
  3. import math

  4. people = [('Seymour','BOS'),
  5.           ('Franny','DAL'),
  6.           ('Zooey','CAK'),
  7.           ('Walt','MIA'),
  8.           ('Buddy','ORD'),
  9.           ('Les','OMA')]
  10. # Laguardia
  11. destination='LGA'

  12. flights={}
  13. #
  14. for line in file('schedule.txt'):
  15.   origin,dest,depart,arrive,price=line.strip().split(',')
  16.   flights.setdefault((origin,dest),[])

  17.   # Add details to the list of possible flights
  18.   flights[(origin,dest)].append((depart,arrive,int(price)))

  19. def getminutes(t):
  20.   x=time.strptime(t,'%H:%M')
  21.   return x[3]*60+x[4]

  22. def printschedule(r):
  23.   for d in range(len(r)/2):
  24.     name=people[d][0]
  25.     origin=people[d][1]
  26.     out=flights[(origin,destination)][int(r[d])]
  27.     ret=flights[(destination,origin)][int(r[d+1])]
  28.     print '%10s%10s %5s-%5s $%3s %5s-%5s $%3s' % (name,origin,
  29.                                                   out[0],out[1],out[2],
  30.                                                   ret[0],ret[1],ret[2])

  31. def schedulecost(sol):
  32.   totalprice=0
  33.   latestarrival=0
  34.   earliestdep=24*60

  35.   for d in range(len(sol)/2):
  36.     # Get the inbound and outbound flights
  37.     origin=people[d][1]
  38.     outbound=flights[(origin,destination)][int(sol[d])]
  39.     returnf=flights[(destination,origin)][int(sol[d+1])]
  40.    
  41.     # Total price is the price of all outbound and return flights
  42.     totalprice+=outbound[2]
  43.     totalprice+=returnf[2]
  44.    
  45.     # Track the latest arrival and earliest departure
  46.     if latestarrival<getminutes(outbound[1]): latestarrival=getminutes(outbound[1])
  47.     if earliestdep>getminutes(returnf[0]): earliestdep=getminutes(returnf[0])
  48.   
  49.   # Every person must wait at the airport until the latest person arrives.
  50.   # They also must arrive at the same time and wait for their flights.
  51.   totalwait=0  
  52.   for d in range(len(sol)/2):
  53.     origin=people[d][1]
  54.     outbound=flights[(origin,destination)][int(sol[d])]
  55.     returnf=flights[(destination,origin)][int(sol[d+1])]
  56.     totalwait+=latestarrival-getminutes(outbound[1])
  57.     totalwait+=getminutes(returnf[0])-earliestdep  

  58.   # Does this solution require an extra day of car rental? That'll be $50!
  59.   if latestarrival>earliestdep: totalprice+=50
  60.   
  61.   return totalprice+totalwait

  62. def randomoptimize(domain,costf):
  63.   best=999999999
  64.   bestr=None
  65.   for i in range(0,1000):
  66.     # Create a random solution
  67.     r=[float(random.randint(domain[i][0],domain[i][1]))
  68.        for i in range(len(domain))]
  69.    
  70.     # Get the cost
  71.     cost=costf(r)
  72.    
  73.     # Compare it to the best one so far
  74.     if cost<best:
  75.       best=cost
  76.       bestr=r
  77.   return r

  78. def hillclimb(domain,costf):
  79.   # Create a random solution
  80.   sol=[random.randint(domain[i][0],domain[i][1])
  81.       for i in range(len(domain))]
  82.   # Main loop
  83.   while 1:
  84.     # Create list of neighboring solutions
  85.     neighbors=[]
  86.    
  87.     for j in range(len(domain)):
  88.       # One away in each direction
  89.       if sol[j]>domain[j][0]:
  90.         neighbors.append(sol[0:j]+[sol[j]+1]+sol[j+1:])
  91.       if sol[j]<domain[j][1]:
  92.         neighbors.append(sol[0:j]+[sol[j]-1]+sol[j+1:])

  93.     # See what the best solution amongst the neighbors is
  94.     current=costf(sol)
  95.     best=current
  96.     for j in range(len(neighbors)):
  97.       cost=costf(neighbors[j])
  98.       if cost<best:
  99.         best=cost
  100.         sol=neighbors[j]

  101.     # If there's no improvement, then we've reached the top
  102.     if best==current:
  103.       break
  104.   return sol

  105. def annealingoptimize(domain,costf,T=10000.0,cool=0.95,step=1):
  106.   # Initialize the values randomly
  107.   vec=[float(random.randint(domain[i][0],domain[i][1]))
  108.        for i in range(len(domain))]
  109.   
  110.   while T>0.1:
  111.     # Choose one of the indices
  112.     i=random.randint(0,len(domain)-1)

  113.     # Choose a direction to change it
  114.     dir=random.randint(-step,step)

  115.     # Create a new list with one of the values changed
  116.     vecb=vec[:]
  117.     vecb[i]+=dir
  118.     if vecb[i]<domain[i][0]: vecb[i]=domain[i][0]
  119.     elif vecb[i]>domain[i][1]: vecb[i]=domain[i][1]

  120.     # Calculate the current cost and the new cost
  121.     ea=costf(vec)
  122.     eb=costf(vecb)
  123.     p=pow(math.e,(-eb-ea)/T)

  124.     # Is it better, or does it make the probability
  125.     # cutoff?
  126.     if (eb<ea or random.random()<p):
  127.       vec=vecb      

  128.     # Decrease the temperature
  129.     T=T*cool
  130.   return vec

  131. def geneticoptimize(domain,costf,popsize=50,step=1,
  132.                     mutprob=0.2,elite=0.2,maxiter=100):
  133.   # Mutation Operation
  134.   def mutate(vec):
  135.     i=random.randint(0,len(domain)-1)
  136.     if random.random()<0.5 and vec[i]>domain[i][0]:
  137.       return vec[0:i]+[vec[i]-step]+vec[i+1:]
  138.     elif vec[i]<domain[i][1]:
  139.       return vec[0:i]+[vec[i]+step]+vec[i+1:]
  140.   
  141.   # Crossover Operation
  142.   def crossover(r1,r2):
  143.     i=random.randint(1,len(domain)-2)
  144.     return r1[0:i]+r2[i:]

  145.   # Build the initial population
  146.   pop=[]
  147.   for i in range(popsize):
  148.     vec=[random.randint(domain[i][0],domain[i][1])
  149.          for i in range(len(domain))]
  150.     pop.append(vec)
  151.   
  152.   # How many winners from each generation?
  153.   topelite=int(elite*popsize)
  154.   
  155.   # Main loop
  156.   for i in range(maxiter):
  157.     scores=[(costf(v),v) for v in pop]
  158.     scores.sort()
  159.     ranked=[v for (s,v) in scores]
  160.    
  161.     # Start with the pure winners
  162.     pop=ranked[0:topelite]
  163.    
  164.     # Add mutated and bred forms of the winners
  165.     while len(pop)<popsize:
  166.       if random.random()<mutprob:

  167.         # Mutation
  168.         c=random.randint(0,topelite)
  169.         pop.append(mutate(ranked[c]))
  170.       else:
  171.       
  172.         # Crossover
  173.         c1=random.randint(0,topelite)
  174.         c2=random.randint(0,topelite)
  175.         pop.append(crossover(ranked[c1],ranked[c2]))
  176.    
  177.     # Print current best score
  178.     print scores[0][0]
  179.    
  180.   return scores[0][1]
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-24 00:14