Programming Collective Intelligence - winbugs及其他软件专版

1关注
62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

0%

威望: 1 级
论坛币: 49407 个
通用积分: 51.8104
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57815 点
帖子: 4006
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2014-8-28 21:42:05 |只看作者 |坛友微信交流群|倒序 |AI写论文

相似文件

换一批

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Programming Collective Intelligence:

Building Smart Web 2.0 Applications

Toby Segaran

Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.

Programming Collective Intelligence takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general -- all from information that you and others collect every day. Each algorithm is described clearly and concisely with code that can immediately be used on your web site, blog, Wiki, or specialized application. This book explains:

Collaborative filtering techniques that enable online retailers to recommend products or media
Methods of clustering to detect groups of similar items in a large dataset
Search engine features -- crawlers, indexers, query engines, and the PageRank algorithm
Optimization algorithms that search millions of possible solutions to a problem and choose the best one
Bayesian filtering, used in spam filters for classifying documents based on word types and other features
Using decision trees not only to make predictions, but to model the way decisions are made
Predicting numerical values rather than classifications to build price models
Support vector machines to match people in online dating sites
Non-negative matrix factorization to find the independent features in a dataset
Evolving intelligence for problem solving -- how a computer develops its skill by improving its own code the more it plays a game

Each chapter includes exercises for extending the algorithms to make them more powerful. Go beyond simple database-backed applications and put the wealth of Internet data to work for you.

"Bravo! I cannot think of a better way for a developer to first learn these algorithms and methods, nor can I think of a better way for me (an old AI dog) to reinvigorate my knowledge of the details."
-- Dan Russell, Google

"Toby's book does a great job of breaking down the complex subject matter of machine-learning algorithms into practical, easy-to-understand examples that can be directly applied to analysis of social interaction across the Web today. If I had this book two years ago, it would have saved precious time going down some fruitless paths."
-- Tim Wolters, CTO, Collective Intellect

Product Details

Paperback: 362 pages
Publisher: O'Reilly Media; 1 edition (Aug. 26 2007)
Language: English
ISBN-10: 0596529325
ISBN-13: 978-059652932

本帖隐藏的内容

Programming Collective Intelligence.rar (2.51 MB, 需要: 20 个论坛币) 本附件包括：
Programming Collective Intelligence.pdf

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏13 回帖

关键词：Intelligence Programming Collective Program collect Internet enormous programs collect created

本帖被以下文库推荐

· 编程语言(Coding Languages)|主题: 3936, 订阅: 126
· 東西方精品圖書|主题: 896, 订阅: 110
· Python(Must-Read Books)|主题: 1687, 订阅: 407
· 东西方数据挖掘|主题: 1798, 订阅: 171
· 计算与编程|主题: 550, 订阅: 129
· 2万+全球顶级名校/投行英文文献 |主题: 21680, 订阅: 2696

使用道具举报

沙发

sunyiping 发表于 2014-8-29 09:05:31 |只看作者 |坛友微信交流群

Chapter 2: Making Recommendations

# A dictionary of movie critics and their ratings of a small
# set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 3.5},
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
'The Night Listener': 4.5, 'Superman Returns': 4.0,
'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0},
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}
from math import sqrt
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]: si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1+sum_of_squares)
# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
# Get the list of mutually rated items
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
# if they are no ratings in common, return 0
if len(si)==0: return 0
# Sum calculations
n=len(si)
# Sums of all the preferences
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
# Sums of the squares
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
# Sum of the products
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
# Returns the best matches for person from the prefs dictionary.
# Number of results and similarity function are optional params.
def topMatches(prefs,person,n=5,similarity=sim_pearson):
scores=[(similarity(prefs,person,other),other)
for other in prefs if other!=person]
scores.sort()
scores.reverse()
return scores[0:n]
# Gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs,person,similarity=sim_pearson):
totals={}
simSums={}
for other in prefs:
# don't compare me to myself
if other==person: continue
sim=similarity(prefs,person,other)
# ignore scores of zero or lower
if sim<=0: continue
for item in prefs[other]:
# only score movies I haven't seen yet
if item not in prefs[person] or prefs[person][item]==0:
# Similarity * Score
totals.setdefault(item,0)
totals[item]+=prefs[other][item]*sim
# Sum of similarities
simSums.setdefault(item,0)
simSums[item]+=sim
# Create the normalized list
rankings=[(total/simSums[item],item) for item,total in totals.items()]
# Return the sorted list
rankings.sort()
rankings.reverse()
return rankings
def transformPrefs(prefs):
result={}
for person in prefs:
for item in prefs[person]:
result.setdefault(item,{})
# Flip item and person
result[item][person]=prefs[person][item]
return result
def calculateSimilarItems(prefs,n=10):
# Create a dictionary of items showing which other items they
# are most similar to.
result={}
# Invert the preference matrix to be item-centric
itemPrefs=transformPrefs(prefs)
c=0
for item in itemPrefs:
# Status updates for large datasets
c+=1
if c%100==0: print "%d / %d" % (c,len(itemPrefs))
# Find the most similar items to this one
scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance)
result[item]=scores
return result
def getRecommendedItems(prefs,itemMatch,user):
userRatings=prefs[user]
scores={}
totalSim={}
# Loop over items rated by this user
for (item,rating) in userRatings.items( ):
# Loop over items similar to this one
for (similarity,item2) in itemMatch[item]:
# Ignore if this user has already rated this item
if item2 in userRatings: continue
# Weighted sum of rating times similarity
scores.setdefault(item2,0)
scores[item2]+=similarity*rating
# Sum of all the similarities
totalSim.setdefault(item2,0)
totalSim[item2]+=similarity
# Divide each total score by total weighting to get an average
rankings=[(score/totalSim[item],item) for item,score in scores.items( )]
# Return the rankings from highest to lowest
rankings.sort( )
rankings.reverse( )
return rankings
def loadMovieLens(path='/data/movielens'):
# Get movie titles
movies={}
for line in open(path+'/u.item'):
(id,title)=line.split('|')[0:2]
movies[id]=title
# Load data
prefs={}
for line in open(path+'/u.data'):
(user,movieid,rating,ts)=line.split('\t')
prefs.setdefault(user,{})
prefs[user][movies[movieid]]=float(rating)
return prefs

复制代码

使用道具举报

藤椅

vaster 发表于 2014-8-29 09:34:41 |只看作者 |坛友微信交流群

Chapter 2: Making Recommendations

from pydelicious import get_popular,get_userposts,get_urlposts
import time
def initializeUserDict(tag,count=5):
user_dict={}
# get the top count' popular posts
for p1 in get_popular(tag=tag)[0:count]:
# find all users who posted this
for p2 in get_urlposts(p1['href']):
user=p2['user']
user_dict[user]={}
return user_dict
def fillItems(user_dict):
all_items={}
# Find links posted by all users
for user in user_dict:
for i in range(3):
try:
posts=get_userposts(user)
break
except:
print "Failed user "+user+", retrying"
time.sleep(4)
for post in posts:
url=post['href']
user_dict[user][url]=1.0
all_items[url]=1
# Fill in missing items with 0
for ratings in user_dict.values():
for item in all_items:
if item not in ratings:
ratings[item]=0.0

复制代码

使用道具举报

板凳

找媳妇 发表于 2014-8-29 10:05:47 |只看作者 |坛友微信交流群

Chapter 3:Discovering Groups

import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}
# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description
# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc
def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)
# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)
# Convert to lowercase
return [word.lower() for word in words if word!='']
apcount={}
wordcounts={}
feedlist=[line for line in file('feedlist.txt')]
for feedurl in feedlist:
try:
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word,count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1
except:
print 'Failed to parse feed %s' % feedurl
wordlist=[]
for w,bc in apcount.items():
frac=float(bc)/len(feedlist)
if frac>0.1 and frac<0.5:
wordlist.append(w)
out=file('blogdata1.txt','w')
out.write('Blog')
for word in wordlist: out.write('\t%s' % word)
out.write('\n')
for blog,wc in wordcounts.items():
print blog
out.write(blog)
for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])
else: out.write('\t0')
out.write('\n')

复制代码

使用道具举报

报纸

jojoan719 发表于 2014-8-29 10:07:58 |只看作者 |坛友微信交流群

Chapter 3: Discovering Groups

from BeautifulSoup import BeautifulSoup
import urllib2
import re
chare=re.compile(r'[!-\.&]')
itemowners={}
# Words to remove
dropwords=['a','new','some','more','my','own','the','many','other','another']
currentuser=0
for i in range(1,51):
# URL for the want search page
c=urllib2.urlopen(
'http://member.zebo.com/Main?event_key=USERSEARCH&wiowiw=wiw&keyword=car&page=%d'
% (i))
soup=BeautifulSoup(c.read())
for td in soup('td'):
# Find table cells of bgverdanasmall class
if ('class' in dict(td.attrs) and td['class']=='bgverdanasmall'):
items=[re.sub(chare,'',str(a.contents[0]).lower()).strip() for a in td('a')]
for item in items:
# Remove extra words
txt=' '.join([t for t in item.split(' ') if t not in dropwords])
if len(txt)<2: continue
itemowners.setdefault(txt,{})
itemowners[txt][currentuser]=1
currentuser+=1
out=file('zebo.txt','w')
out.write('Item')
for user in range(0,currentuser): out.write('\tU%d' % user)
out.write('\n')
for item,owners in itemowners.items():
if len(owners)>10:
out.write(item)
for user in range(0,currentuser):
if user in owners: out.write('\t1')
else: out.write('\t0')
out.write('\n')

复制代码

使用道具举报

地板

oink-oink 发表于 2014-8-29 10:27:50 |只看作者 |坛友微信交流群

Chapter 3: Discovering Groups

from PIL import Image,ImageDraw
def readfile(filename):
lines=[line for line in file(filename)]
# First line is the column titles
colnames=lines[0].strip().split('\t')[1:]
rownames=[]
data=[]
for line in lines[1:]:
p=line.strip().split('\t')
# First column in each row is the rowname
rownames.append(p[0])
# The data for this row is the remainder of the row
data.append([float(x) for x in p[1:]])
return rownames,colnames,data
from math import sqrt
def pearson(v1,v2):
# Simple sums
sum1=sum(v1)
sum2=sum(v2)
# Sums of the squares
sum1Sq=sum([pow(v,2) for v in v1])
sum2Sq=sum([pow(v,2) for v in v2])
# Sum of the products
pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/len(v1))
den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
if den==0: return 0
return 1.0-num/den
class bicluster:
def __init__(self,vec,left=None,right=None,distance=0.0,id=None):
self.left=left
self.right=right
self.vec=vec
self.id=id
self.distance=distance
def hcluster(rows,distance=pearson):
distances={}
currentclustid=-1
# Clusters are initially just the rows
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)
# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]
if d<closest:
closest=d
lowestpair=(i,j)
# calculate the average of the two clusters
mergevec=[
(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
for i in range(len(clust[0].vec))]
# create the new cluster
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren't in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)
return clust[0]
def printclust(clust,labels=None,n=0):
# indent to make a hierarchy layout
for i in range(n): print ' ',
if clust.id<0:
# negative id means that this is branch
print '-'
else:
# positive id means that this is an endpoint
if labels==None: print clust.id
else: print labels[clust.id]
# now print the right and left branches
if clust.left!=None: printclust(clust.left,labels=labels,n=n+1)
if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)
def getheight(clust):
# Is this an endpoint? Then the height is just 1
if clust.left==None and clust.right==None: return 1
# Otherwise the height is the same of the heights of
# each branch
return getheight(clust.left)+getheight(clust.right)
def getdepth(clust):
# The distance of an endpoint is 0.0
if clust.left==None and clust.right==None: return 0
# The distance of a branch is the greater of its two sides
# plus its own distance
return max(getdepth(clust.left),getdepth(clust.right))+clust.distance
def drawdendrogram(clust,labels,jpeg='clusters.jpg'):
# height and width
h=getheight(clust)*20
w=1200
depth=getdepth(clust)
# width is fixed, so scale distances accordingly
scaling=float(w-150)/depth
# Create a new image with a white background
img=Image.new('RGB',(w,h),(255,255,255))
draw=ImageDraw.Draw(img)
draw.line((0,h/2,10,h/2),fill=(255,0,0))
# Draw the first node
drawnode(draw,clust,10,(h/2),scaling,labels)
img.save(jpeg,'JPEG')
def drawnode(draw,clust,x,y,scaling,labels):
if clust.id<0:
h1=getheight(clust.left)*20
h2=getheight(clust.right)*20
top=y-(h1+h2)/2
bottom=y+(h1+h2)/2
# Line length
ll=clust.distance*scaling
# Vertical line from this cluster to children
draw.line((x,top+h1/2,x,bottom-h2/2),fill=(255,0,0))
# Horizontal line to left item
draw.line((x,top+h1/2,x+ll,top+h1/2),fill=(255,0,0))
# Horizontal line to right item
draw.line((x,bottom-h2/2,x+ll,bottom-h2/2),fill=(255,0,0))
# Call the function to draw the left and right nodes
drawnode(draw,clust.left,x+ll,top+h1/2,scaling,labels)
drawnode(draw,clust.right,x+ll,bottom-h2/2,scaling,labels)
else:
# If this is an endpoint, draw the item label
draw.text((x+5,y-7),labels[clust.id],(0,0,0))
def rotatematrix(data):
newdata=[]
for i in range(len(data[0])):
newrow=[data[j][i] for j in range(len(data))]
newdata.append(newrow)
return newdata
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)
# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
lastmatches=bestmatches
# Move the centroids to the average of their members
for i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
def tanamoto(v1,v2):
c1,c2,shr=0,0,0
for i in range(len(v1)):
if v1[i]!=0: c1+=1 # in v1
if v2[i]!=0: c2+=1 # in v2
if v1[i]!=0 and v2[i]!=0: shr+=1 # in both
return 1.0-(float(shr)/(c1+c2-shr))
def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
def draw2d(data,labels,jpeg='mds2d.jpg'):
img=Image.new('RGB',(2000,2000),(255,255,255))
draw=ImageDraw.Draw(img)
for i in range(len(data)):
x=(data[i][0]+0.5)*1000
y=(data[i][1]+0.5)*1000
draw.text((x,y),labels[i],(0,0,0))
img.save(jpeg,'JPEG')

复制代码

使用道具举报

7楼

redflame 发表于 2014-8-29 11:07:33 |只看作者 |坛友微信交流群

Chapter 4: Searching and Ranking

import urllib2
from BeautifulSoup import *
from urlparse import urljoin
from pysqlite2 import dbapi2 as sqlite
import nn
mynet=nn.searchnet('nn.db')
# Create a list of words to ignore
ignorewords={'the':1,'of':1,'to':1,'and':1,'a':1,'in':1,'is':1,'it':1}
class crawler:
# Initialize the crawler with the name of database
def __init__(self,dbname):
self.con=sqlite.connect(dbname)
def __del__(self):
self.con.close()
def dbcommit(self):
self.con.commit()
# Auxilliary function for getting an entry id and adding
# it if it's not present
def getentryid(self,table,field,value,createnew=True):
cur=self.con.execute(
"select rowid from %s where %s='%s'" % (table,field,value))
res=cur.fetchone()
if res==None:
cur=self.con.execute(
"insert into %s (%s) values ('%s')" % (table,field,value))
return cur.lastrowid
else:
return res[0]
# Index an individual page
def addtoindex(self,url,soup):
if self.isindexed(url): return
print 'Indexing '+url
# Get the individual words
text=self.gettextonly(soup)
words=self.separatewords(text)
# Get the URL id
urlid=self.getentryid('urllist','url',url)
# Link each word to this url
for i in range(len(words)):
word=words[i]
if word in ignorewords: continue
wordid=self.getentryid('wordlist','word',word)
self.con.execute("insert into wordlocation(urlid,wordid,location) values (%d,%d,%d)" % (urlid,wordid,i))
# Extract the text from an HTML page (no tags)
def gettextonly(self,soup):
v=soup.string
if v==Null:
c=soup.contents
resulttext=''
for t in c:
subtext=self.gettextonly(t)
resulttext+=subtext+'\n'
return resulttext
else:
return v.strip()
# Seperate the words by any non-whitespace character
def separatewords(self,text):
splitter=re.compile('\\W*')
return [s.lower() for s in splitter.split(text) if s!='']
# Return true if this url is already indexed
def isindexed(self,url):
return False
# Add a link between two pages
def addlinkref(self,urlFrom,urlTo,linkText):
words=self.separateWords(linkText)
fromid=self.getentryid('urllist','url',urlFrom)
toid=self.getentryid('urllist','url',urlTo)
if fromid==toid: return
cur=self.con.execute("insert into link(fromid,toid) values (%d,%d)" % (fromid,toid))
linkid=cur.lastrowid
for word in words:
if word in ignorewords: continue
wordid=self.getentryid('wordlist','word',word)
self.con.execute("insert into linkwords(linkid,wordid) values (%d,%d)" % (linkid,wordid))
# Starting with a list of pages, do a breadth
# first search to the given depth, indexing pages
# as we go
def crawl(self,pages,depth=2):
for i in range(depth):
newpages={}
for page in pages:
try:
c=urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
try:
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url=urljoin(page,link['href'])
if url.find("'")!=-1: continue
url=url.split('#')[0] # remove location portion
if url[0:4]=='http' and not self.isindexed(url):
newpages[url]=1
linkText=self.gettextonly(link)
self.addlinkref(page,url,linkText)
self.dbcommit()
except:
print "Could not parse page %s" % page
pages=newpages
# Create the database tables
def createindextables(self):
self.con.execute('create table urllist(url)')
self.con.execute('create table wordlist(word)')
self.con.execute('create table wordlocation(urlid,wordid,location)')
self.con.execute('create table link(fromid integer,toid integer)')
self.con.execute('create table linkwords(wordid,linkid)')
self.con.execute('create index wordidx on wordlist(word)')
self.con.execute('create index urlidx on urllist(url)')
self.con.execute('create index wordurlidx on wordlocation(wordid)')
self.con.execute('create index urltoidx on link(toid)')
self.con.execute('create index urlfromidx on link(fromid)')
self.dbcommit()
def calculatepagerank(self,iterations=20):
# clear out the current page rank tables
self.con.execute('drop table if exists pagerank')
self.con.execute('create table pagerank(urlid primary key,score)')
# initialize every url with a page rank of 1
for (urlid,) in self.con.execute('select rowid from urllist'):
self.con.execute('insert into pagerank(urlid,score) values (%d,1.0)' % urlid)
self.dbcommit()
for i in range(iterations):
print "Iteration %d" % (i)
for (urlid,) in self.con.execute('select rowid from urllist'):
pr=0.15
# Loop through all the pages that link to this one
for (linker,) in self.con.execute(
'select distinct fromid from link where toid=%d' % urlid):
# Get the page rank of the linker
linkingpr=self.con.execute(
'select score from pagerank where urlid=%d' % linker).fetchone()[0]
# Get the total number of links from the linker
linkingcount=self.con.execute(
'select count(*) from link where fromid=%d' % linker).fetchone()[0]
pr+=0.85*(linkingpr/linkingcount)
self.con.execute(
'update pagerank set score=%f where urlid=%d' % (pr,urlid))
self.dbcommit()
class searcher:
def __init__(self,dbname):
self.con=sqlite.connect(dbname)
def __del__(self):
self.con.close()
def getmatchrows(self,q):
# Strings to build the query
fieldlist='w0.urlid'
tablelist=''
clauselist=''
wordids=[]
# Split the words by spaces
words=q.split(' ')
tablenumber=0
for word in words:
# Get the word ID
wordrow=self.con.execute(
"select rowid from wordlist where word='%s'" % word).fetchone()
if wordrow!=None:
wordid=wordrow[0]
wordids.append(wordid)
if tablenumber>0:
tablelist+=','
clauselist+=' and '
clauselist+='w%d.urlid=w%d.urlid and ' % (tablenumber-1,tablenumber)
fieldlist+=',w%d.location' % tablenumber
tablelist+='wordlocation w%d' % tablenumber
clauselist+='w%d.wordid=%d' % (tablenumber,wordid)
tablenumber+=1
# Create the query from the separate parts
fullquery='select %s from %s where %s' % (fieldlist,tablelist,clauselist)
print fullquery
cur=self.con.execute(fullquery)
rows=[row for row in cur]
return rows,wordids
def getscoredlist(self,rows,wordids):
totalscores=dict([(row[0],0) for row in rows])
# This is where we'll put our scoring functions
weights=[(1.0,self.locationscore(rows)),
(1.0,self.frequencyscore(rows)),
(1.0,self.pagerankscore(rows)),
(1.0,self.linktextscore(rows,wordids)),
(5.0,self.nnscore(rows,wordids))]
for (weight,scores) in weights:
for url in totalscores:
totalscores[url]+=weight*scores[url]
return totalscores
def geturlname(self,id):
return self.con.execute(
"select url from urllist where rowid=%d" % id).fetchone()[0]
def query(self,q):
rows,wordids=self.getmatchrows(q)
scores=self.getscoredlist(rows,wordids)
rankedscores=[(score,url) for (url,score) in scores.items()]
rankedscores.sort()
rankedscores.reverse()
for (score,urlid) in rankedscores[0:10]:
print '%f\t%s' % (score,self.geturlname(urlid))
return wordids,[r[1] for r in rankedscores[0:10]]
def normalizescores(self,scores,smallIsBetter=0):
vsmall=0.00001 # Avoid division by zero errors
if smallIsBetter:
minscore=min(scores.values())
return dict([(u,float(minscore)/max(vsmall,l)) for (u,l) in scores.items()])
else:
maxscore=max(scores.values())
if maxscore==0: maxscore=vsmall
return dict([(u,float(c)/maxscore) for (u,c) in scores.items()])
def frequencyscore(self,rows):
counts=dict([(row[0],0) for row in rows])
for row in rows: counts[row[0]]+=1
return self.normalizescores(counts)
def locationscore(self,rows):
locations=dict([(row[0],1000000) for row in rows])
for row in rows:
loc=sum(row[1:])
if loc<locations[row[0]]: locations[row[0]]=loc
return self.normalizescores(locations,smallIsBetter=1)
def distancescore(self,rows):
# If there's only one word, everyone wins!
if len(rows[0])<=2: return dict([(row[0],1.0) for row in rows])
# Initialize the dictionary with large values
mindistance=dict([(row[0],1000000) for row in rows])
for row in rows:
dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))])
if dist<mindistance[row[0]]: mindistance[row[0]]=dist
return self.normalizescores(mindistance,smallIsBetter=1)
def inboundlinkscore(self,rows):
uniqueurls=dict([(row[0],1) for row in rows])
inboundcount=dict([(u,self.con.execute('select count(*) from link where toid=%d' % u).fetchone()[0]) for u in uniqueurls])
return self.normalizescores(inboundcount)
def linktextscore(self,rows,wordids):
linkscores=dict([(row[0],0) for row in rows])
for wordid in wordids:
cur=self.con.execute('select link.fromid,link.toid from linkwords,link where wordid=%d and linkwords.linkid=link.rowid' % wordid)
for (fromid,toid) in cur:
if toid in linkscores:
pr=self.con.execute('select score from pagerank where urlid=%d' % fromid).fetchone()[0]
linkscores[toid]+=pr
maxscore=max(linkscores.values())
normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items()])
return normalizedscores
def pagerankscore(self,rows):
pageranks=dict([(row[0],self.con.execute('select score from pagerank where urlid=%d' % row[0]).fetchone()[0]) for row in rows])
maxrank=max(pageranks.values())
normalizedscores=dict([(u,float(l)/maxrank) for (u,l) in pageranks.items()])
return normalizedscores
def nnscore(self,rows,wordids):
# Get unique URL IDs as an ordered list
urlids=[urlid for urlid in dict([(row[0],1) for row in rows])]
nnres=mynet.getresult(wordids,urlids)
scores=dict([(urlids[i],nnres[i]) for i in range(len(urlids))])
return self.normalizescores(scores)

复制代码

使用道具举报

8楼

长线小白龙

发表于 2014-8-29 11:53:15 |只看作者 |坛友微信交流群

Chapter 4: Searching and Ranking

from math import tanh
from pysqlite2 import dbapi2 as sqlite
def dtanh(y):
return 1.0-y*y
class searchnet:
def __init__(self,dbname):
self.con=sqlite.connect(dbname)
def __del__(self):
self.con.close()
def maketables(self):
self.con.execute('create table hiddennode(create_key)')
self.con.execute('create table wordhidden(fromid,toid,strength)')
self.con.execute('create table hiddenurl(fromid,toid,strength)')
self.con.commit()
def getstrength(self,fromid,toid,layer):
if layer==0: table='wordhidden'
else: table='hiddenurl'
res=self.con.execute('select strength from %s where fromid=%d and toid=%d' % (table,fromid,toid)).fetchone()
if res==None:
if layer==0: return -0.2
if layer==1: return 0
return res[0]
def setstrength(self,fromid,toid,layer,strength):
if layer==0: table='wordhidden'
else: table='hiddenurl'
res=self.con.execute('select rowid from %s where fromid=%d and toid=%d' % (table,fromid,toid)).fetchone()
if res==None:
self.con.execute('insert into %s (fromid,toid,strength) values (%d,%d,%f)' % (table,fromid,toid,strength))
else:
rowid=res[0]
self.con.execute('update %s set strength=%f where rowid=%d' % (table,strength,rowid))
def generatehiddennode(self,wordids,urls):
if len(wordids)>3: return None
# Check if we already created a node for this set of words
sorted_words=[str(id) for id in wordids]
sorted_words.sort()
createkey='_'.join(sorted_words)
res=self.con.execute(
"select rowid from hiddennode where create_key='%s'" % createkey).fetchone()
# If not, create it
if res==None:
cur=self.con.execute(
"insert into hiddennode (create_key) values ('%s')" % createkey)
hiddenid=cur.lastrowid
# Put in some default weights
for wordid in wordids:
self.setstrength(wordid,hiddenid,0,1.0/len(wordids))
for urlid in urls:
self.setstrength(hiddenid,urlid,1,0.1)
self.con.commit()
def getallhiddenids(self,wordids,urlids):
l1={}
for wordid in wordids:
cur=self.con.execute(
'select toid from wordhidden where fromid=%d' % wordid)
for row in cur: l1[row[0]]=1
for urlid in urlids:
cur=self.con.execute(
'select fromid from hiddenurl where toid=%d' % urlid)
for row in cur: l1[row[0]]=1
return l1.keys()
def setupnetwork(self,wordids,urlids):
# value lists
self.wordids=wordids
self.hiddenids=self.getallhiddenids(wordids,urlids)
self.urlids=urlids
# node outputs
self.ai = [1.0]*len(self.wordids)
self.ah = [1.0]*len(self.hiddenids)
self.ao = [1.0]*len(self.urlids)
# create weights matrix
self.wi = [[self.getstrength(wordid,hiddenid,0)
for hiddenid in self.hiddenids]
for wordid in self.wordids]
self.wo = [[self.getstrength(hiddenid,urlid,1)
for urlid in self.urlids]
for hiddenid in self.hiddenids]
def feedforward(self):
# the only inputs are the query words
for i in range(len(self.wordids)):
self.ai[i] = 1.0
# hidden activations
for j in range(len(self.hiddenids)):
sum = 0.0
for i in range(len(self.wordids)):
sum = sum + self.ai[i] * self.wi[i][j]
self.ah[j] = tanh(sum)
# output activations
for k in range(len(self.urlids)):
sum = 0.0
for j in range(len(self.hiddenids)):
sum = sum + self.ah[j] * self.wo[j][k]
self.ao[k] = tanh(sum)
return self.ao[:]
def getresult(self,wordids,urlids):
self.setupnetwork(wordids,urlids)
return self.feedforward()
def backPropagate(self, targets, N=0.5):
# calculate errors for output
output_deltas = [0.0] * len(self.urlids)
for k in range(len(self.urlids)):
error = targets[k]-self.ao[k]
output_deltas[k] = dtanh(self.ao[k]) * error
# calculate errors for hidden layer
hidden_deltas = [0.0] * len(self.hiddenids)
for j in range(len(self.hiddenids)):
error = 0.0
for k in range(len(self.urlids)):
error = error + output_deltas[k]*self.wo[j][k]
hidden_deltas[j] = dtanh(self.ah[j]) * error
# update output weights
for j in range(len(self.hiddenids)):
for k in range(len(self.urlids)):
change = output_deltas[k]*self.ah[j]
self.wo[j][k] = self.wo[j][k] + N*change
# update input weights
for i in range(len(self.wordids)):
for j in range(len(self.hiddenids)):
change = hidden_deltas[j]*self.ai[i]
self.wi[i][j] = self.wi[i][j] + N*change
def trainquery(self,wordids,urlids,selectedurl):
# generate a hidden node if necessary
self.generatehiddennode(wordids,urlids)
self.setupnetwork(wordids,urlids)
self.feedforward()
targets=[0.0]*len(urlids)
targets[urlids.index(selectedurl)]=1.0
error = self.backPropagate(targets)
self.updatedatabase()
def updatedatabase(self):
# set them to database values
for i in range(len(self.wordids)):
for j in range(len(self.hiddenids)):
self.setstrength(self.wordids[i],self. hiddenids[j],0,self.wi[i][j])
for j in range(len(self.hiddenids)):
for k in range(len(self.urlids)):
self.setstrength(self.hiddenids[j],self.urlids[k],1,self.wo[j][k])
self.con.commit()

复制代码

使用道具举报

9楼

lhf8059 发表于 2014-8-29 18:03:00 |只看作者 |坛友微信交流群

import math
people=['Charlie','Augustus','Veruca','Violet','Mike','Joe','Willy','Miranda']
links=[('Augustus', 'Willy'),
('Mike', 'Joe'),
('Miranda', 'Mike'),
('Violet', 'Augustus'),
('Miranda', 'Willy'),
('Charlie', 'Mike'),
('Veruca', 'Joe'),
('Miranda', 'Augustus'),
('Willy', 'Augustus'),
('Joe', 'Charlie'),
('Veruca', 'Augustus'),
('Miranda', 'Joe')]
def crosscount(v):
# Convert the number list into a dictionary of person:(x,y)
loc=dict([(people[i],(v[i*2],v[i*2+1])) for i in range(0,len(people))])
total=0
# Loop through every pair of links
for i in range(len(links)):
for j in range(i+1,len(links)):
# Get the locations
(x1,y1),(x2,y2)=loc[links[i][0]],loc[links[i][1]]
(x3,y3),(x4,y4)=loc[links[j][0]],loc[links[j][1]]
den=(y4-y3)*(x2-x1)-(x4-x3)*(y2-y1)
# den==0 if the lines are parallel
if den==0: continue
# Otherwise ua and ub are the fraction of the
# line where they cross
ua=((x4-x3)*(y1-y3)-(y4-y3)*(x1-x3))/den
ub=((x2-x1)*(y1-y3)-(y2-y1)*(x1-x3))/den
# If the fraction is between 0 and 1 for both lines
# then they cross each other
if ua>0 and ua<1 and ub>0 and ub<1:
total+=1
for i in range(len(people)):
for j in range(i+1,len(people)):
# Get the locations of the two nodes
(x1,y1),(x2,y2)=loc[people[i]],loc[people[j]]
# Find the distance between them
dist=math.sqrt(math.pow(x1-x2,2)+math.pow(y1-y2,2))
# Penalize any nodes closer than 50 pixels
if dist<50:
total+=(1.0-(dist/50.0))
return total
from PIL import Image,ImageDraw
def drawnetwork(sol):
# Create the image
img=Image.new('RGB',(400,400),(255,255,255))
draw=ImageDraw.Draw(img)
# Create the position dict
pos=dict([(people[i],(sol[i*2],sol[i*2+1])) for i in range(0,len(people))])
for (a,b) in links:
draw.line((pos[a],pos[b]),fill=(255,0,0))
for n,p in pos.items():
draw.text(p,n,(0,0,0))
img.show()
domain=[(10,370)]*(len(people)*2)

复制代码

使用道具举报

10楼

hsiaoyan 发表于 2014-8-29 18:52:30 |只看作者 |坛友微信交流群

Chapter 5:Optimization

import time
import random
import math
people = [('Seymour','BOS'),
('Franny','DAL'),
('Zooey','CAK'),
('Walt','MIA'),
('Buddy','ORD'),
('Les','OMA')]
# Laguardia
destination='LGA'
flights={}
#
for line in file('schedule.txt'):
origin,dest,depart,arrive,price=line.strip().split(',')
flights.setdefault((origin,dest),[])
# Add details to the list of possible flights
flights[(origin,dest)].append((depart,arrive,int(price)))
def getminutes(t):
x=time.strptime(t,'%H:%M')
return x[3]*60+x[4]
def printschedule(r):
for d in range(len(r)/2):
name=people[d][0]
origin=people[d][1]
out=flights[(origin,destination)][int(r[d])]
ret=flights[(destination,origin)][int(r[d+1])]
print '%10s%10s %5s-%5s $%3s %5s-%5s $%3s' % (name,origin,
out[0],out[1],out[2],
ret[0],ret[1],ret[2])
def schedulecost(sol):
totalprice=0
latestarrival=0
earliestdep=24*60
for d in range(len(sol)/2):
# Get the inbound and outbound flights
origin=people[d][1]
outbound=flights[(origin,destination)][int(sol[d])]
returnf=flights[(destination,origin)][int(sol[d+1])]
# Total price is the price of all outbound and return flights
totalprice+=outbound[2]
totalprice+=returnf[2]
# Track the latest arrival and earliest departure
if latestarrival<getminutes(outbound[1]): latestarrival=getminutes(outbound[1])
if earliestdep>getminutes(returnf[0]): earliestdep=getminutes(returnf[0])
# Every person must wait at the airport until the latest person arrives.
# They also must arrive at the same time and wait for their flights.
totalwait=0
for d in range(len(sol)/2):
origin=people[d][1]
outbound=flights[(origin,destination)][int(sol[d])]
returnf=flights[(destination,origin)][int(sol[d+1])]
totalwait+=latestarrival-getminutes(outbound[1])
totalwait+=getminutes(returnf[0])-earliestdep
# Does this solution require an extra day of car rental? That'll be $50!
if latestarrival>earliestdep: totalprice+=50
return totalprice+totalwait
def randomoptimize(domain,costf):
best=999999999
bestr=None
for i in range(0,1000):
# Create a random solution
r=[float(random.randint(domain[i][0],domain[i][1]))
for i in range(len(domain))]
# Get the cost
cost=costf(r)
# Compare it to the best one so far
if cost<best:
best=cost
bestr=r
return r
def hillclimb(domain,costf):
# Create a random solution
sol=[random.randint(domain[i][0],domain[i][1])
for i in range(len(domain))]
# Main loop
while 1:
# Create list of neighboring solutions
neighbors=[]
for j in range(len(domain)):
# One away in each direction
if sol[j]>domain[j][0]:
neighbors.append(sol[0:j]+[sol[j]+1]+sol[j+1:])
if sol[j]<domain[j][1]:
neighbors.append(sol[0:j]+[sol[j]-1]+sol[j+1:])
# See what the best solution amongst the neighbors is
current=costf(sol)
best=current
for j in range(len(neighbors)):
cost=costf(neighbors[j])
if cost<best:
best=cost
sol=neighbors[j]
# If there's no improvement, then we've reached the top
if best==current:
break
return sol
def annealingoptimize(domain,costf,T=10000.0,cool=0.95,step=1):
# Initialize the values randomly
vec=[float(random.randint(domain[i][0],domain[i][1]))
for i in range(len(domain))]
while T>0.1:
# Choose one of the indices
i=random.randint(0,len(domain)-1)
# Choose a direction to change it
dir=random.randint(-step,step)
# Create a new list with one of the values changed
vecb=vec[:]
vecb[i]+=dir
if vecb[i]<domain[i][0]: vecb[i]=domain[i][0]
elif vecb[i]>domain[i][1]: vecb[i]=domain[i][1]
# Calculate the current cost and the new cost
ea=costf(vec)
eb=costf(vecb)
p=pow(math.e,(-eb-ea)/T)
# Is it better, or does it make the probability
# cutoff?
if (eb<ea or random.random()<p):
vec=vecb
# Decrease the temperature
T=T*cool
return vec
def geneticoptimize(domain,costf,popsize=50,step=1,
mutprob=0.2,elite=0.2,maxiter=100):
# Mutation Operation
def mutate(vec):
i=random.randint(0,len(domain)-1)
if random.random()<0.5 and vec[i]>domain[i][0]:
return vec[0:i]+[vec[i]-step]+vec[i+1:]
elif vec[i]<domain[i][1]:
return vec[0:i]+[vec[i]+step]+vec[i+1:]
# Crossover Operation
def crossover(r1,r2):
i=random.randint(1,len(domain)-2)
return r1[0:i]+r2[i:]
# Build the initial population
pop=[]
for i in range(popsize):
vec=[random.randint(domain[i][0],domain[i][1])
for i in range(len(domain))]
pop.append(vec)
# How many winners from each generation?
topelite=int(elite*popsize)
# Main loop
for i in range(maxiter):
scores=[(costf(v),v) for v in pop]
scores.sort()
ranked=[v for (s,v) in scores]
# Start with the pure winners
pop=ranked[0:topelite]
# Add mutated and bred forms of the winners
while len(pop)<popsize:
if random.random()<mutprob:
# Mutation
c=random.randint(0,topelite)
pop.append(mutate(ranked[c]))
else:
# Crossover
c1=random.randint(0,topelite)
c2=random.randint(0,topelite)
pop.append(crossover(ranked[c1],ranked[c2]))
# Print current best score
print scores[0][0]
return scores[0][1]

复制代码

使用道具举报

Programming Collective Intelligence [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

Chapter 2: Making Recommendations

Chapter 2: Making Recommendations

Chapter 3:Discovering Groups

Chapter 3: Discovering Groups

Chapter 3: Discovering Groups

Chapter 4: Searching and Ranking

Chapter 4: Searching and Ranking

Chapter 5:Optimization

本版微信群

Programming Collective Intelligence [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

Chapter 2: Making Recommendations

Chapter 2: Making Recommendations

Chapter 3:Discovering Groups

Chapter 3: Discovering Groups

Chapter 3: Discovering Groups

Chapter 4: Searching and Ranking

Chapter 4: Searching and Ranking

Chapter 5:Optimization

本版微信群

扫码加我拉你入群