楼主: Lisrelchen
773 7

【博文】A Beginner’s Guide to Tweet Analytics with Pandas [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
49957 个
通用积分
79.5487
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. def load_tweets(tweet_file):

  2.     """ Load and process a Twitter analytics data file """

  3.     # Read tweet data (obtained from Twitter Analytics)
  4.     tweet_df = pd.read_csv(tweet_file)

  5.     # Drop irrelevant columns
  6.     tweet_df = tweet_df.drop(tweet_df.columns[[13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39]], axis=1)

  7.     return tweet_df
复制代码
http://www.kdnuggets.com/2017/03/beginners-guide-tweet-analytics-pandas.html
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Analytics Analytic beginner beginn pandas

沙发
Lisrelchen 发表于 2017-4-16 01:39:00 |只看作者 |坛友微信交流群
  1. Basic Tweet Stats

  2. So, given what data is shown in the output of running head() on the dataset above, and having a rough intuition of what tweet metrics would be useful, we will grab the following stats:

  3. Retweets - Mean RTs per tweet & top 5 RTed tweets
  4. Likes - Mean likes per tweet & top 5 liked tweets
  5. Impressions - Mean impressions per tweet & top 5 tweets with most impressions
  6. # Total tweets
  7. print 'Total tweets this period:', len(tweet_df.index), '\n'

  8. # Retweets
  9. tweet_df = tweet_df.sort_values(by='retweets', ascending=False)
  10. tweet_df = tweet_df.reset_index(drop=True)
  11. print 'Mean retweets:', round(tweet_df['retweets'].mean(),2), '\n'
  12. print 'Top 5 RTed tweets:'
  13. print '------------------'
  14. for i in range(5):
  15.     print tweet_df['Tweet text'].ix[i], '-', tweet_df['retweets'].ix[i]
  16. print '\n'
  17.    
  18. # Likes
  19. tweet_df = tweet_df.sort_values(by='likes', ascending=False)
  20. tweet_df = tweet_df.reset_index(drop=True)
  21. print 'Mean likes:', round(tweet_df['likes'].mean(),2), '\n'
  22. print 'Top 5 liked tweets:'
  23. print '-------------------'
  24. for i in range(5):
  25.     print tweet_df['Tweet text'].ix[i], '-', tweet_df['likes'].ix[i]
  26. print '\n'

  27. # Impressions
  28. tweet_df = tweet_df.sort_values(by='impressions', ascending=False)
  29. tweet_df = tweet_df.reset_index(drop=True)
  30. print 'Mean impressions:', round(tweet_df['impressions'].mean(),2), '\n'
  31. print 'Top 5 tweets with most impressions:'
  32. print '-----------------------------------'
  33. for i in range(5):
  34.     print tweet_df['Tweet text'].ix[i], '-', tweet_df['impressions'].ix[i]
复制代码

使用道具

藤椅
Lisrelchen 发表于 2017-4-16 01:40:14 |只看作者 |坛友微信交流群
  1. Top #Hashtags and @Mentions

  2. It's no secret that hashtags play an important role in Twitter, and mentions can also help grow your network and influence. Together they help put the 'social' in social networking, transforming platforms like Twitter from passive experiences to very active ones. With that, getting a handle on the most social aspect of this social network can be a helpful endeavour.

  3. # Hashtags & mentions
  4. tag_dict = {}
  5. mention_dict = {}

  6. for i in tweet_df.index:
  7.     tweet_text = tweet_df.ix[i]['Tweet text']
  8.     tweet = tweet_text.lower()
  9.     tweet_tokenized = tweet.split()

  10.     for word in tweet_tokenized:
  11.         # Hashtags - tokenize and build dict of tag counts
  12.         if (word[0:1] == '#' and len(word) > 1):
  13.             key = word.translate(string.maketrans("",""), string.punctuation)
  14.             if key in tag_dict:
  15.                 tag_dict[key] += 1
  16.             else:
  17.                 tag_dict[key] = 1

  18.         # Mentions - tokenize and build dict of mention counts
  19.         if (word[0:1] == '@' and len(word) > 1):
  20.             key = word.translate(string.maketrans("",""), string.punctuation)
  21.             if key in mention_dict:
  22.                 mention_dict[key] += 1
  23.             else:
  24.                 mention_dict[key] = 1

  25. # The 10 most popular tags and counts
  26. top_tags = dict(sorted(tag_dict.iteritems(), key=operator.itemgetter(1), reverse=True)[:10])
  27. top_tags_sorted = sorted(top_tags.items(), key=lambda x: x[1])[::-1]
  28. print 'Top 10 hashtags:'
  29. print '----------------'
  30. for tag in top_tags_sorted:
  31.     print tag[0], '-', str(tag[1])
  32.    
  33. # The 10 most popular mentions and counts
  34. top_mentions = dict(sorted(mention_dict.iteritems(), key=operator.itemgetter(1), reverse=True)[:10])
  35. top_mentions_sorted = sorted(top_mentions.items(), key=lambda x: x[1])[::-1]
  36. print '\nTop 10 mentions:'
  37. print '----------------'
  38. for mention in top_mentions_sorted:
  39.     print mention[0], '-', str(mention[1])
复制代码

使用道具

板凳
Lisrelchen 发表于 2017-4-16 01:41:19 |只看作者 |坛友微信交流群
  1. Time-series Analysis

  2. Finally, let's have a look at some very basic temporal data. We will check mean impressions for tweets based -- independently -- on both the hour of day and day of week that they are tweeted. I caution (once gain) that this is based on very little data, and so nothing useful will likely be gleaned. However, given much larger amounts of tweet data, entire social media campaigns are planned.

  3. While this is based on impressions, it could just as reasonably (and easily changed to) be based on engagements, or RTs, or whatever else you pleased. Working in advertising, and promoting tweets? Maybe you are more interested in some of those promotion* metrics we hacked off the dataset at the start.

  4. We have to convert the Twitter supplied date field to a legitimate Python datetime object, bin the data based on which hourly slot it falls into, identify days of week, and then capture this data in a couple of additional columns in the DataFrame, which we will pillage for stats afterward.

  5. # Time-series impressions (DOW, HOD, etc) (0 = Sunday... 6 = Saturday)
  6. gmt_offset = -4

  7. # Create proper datetime column, apply local GMT offset
  8. tweet_df['ts'] = pd.to_datetime(tweet_df['time'])
  9. tweet_df['ts'] = tweet_df.ts + pd.to_timedelta(gmt_offset, unit='h')

  10. # Add hour of day and day of week columns
  11. tweet_df['hod'] = [t.hour for t in tweet_df.ts]
  12. tweet_df['dow'] = [t.dayofweek for t in tweet_df.ts]

  13. hod_dict = {}
  14. hod_count = {}
  15. dow_dict = {}
  16. dow_count = {}
  17. weekday_dict = {0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', 4: 'Fri', 5: 'Sat', 6: 'Sun'}

  18. # Process tweets, collect stats
  19. for i in tweet_df.index:
  20.     hod = tweet_df.ix[i]['hod']
  21.     dow = tweet_df.ix[i]['dow']
  22.     imp = tweet_df.ix[i]['impressions']

  23.     if hod in hod_dict:
  24.         hod_dict[hod] += int(imp)
  25.         hod_count[hod] += 1
  26.     else:
  27.         hod_dict[hod] = int(imp)
  28.         hod_count[hod] = 1

  29.     if dow in dow_dict:
  30.         dow_dict[dow] += int(imp)
  31.         dow_count[dow] += 1
  32.     else:
  33.         dow_dict[dow] = int(imp)
  34.         dow_count[dow] = 1

  35. print 'Average impressions per tweet by hour tweeted:'
  36. print '----------------------------------------------'
  37. for hod in hod_dict:
  38.     print hod, '-', hod+1, ':', hod_dict[hod]/hod_count[hod], '=>', hod_count[hod], 'tweets'

  39. print '\nAverage impressions per tweet by day of week tweeted:'
  40. print '-----------------------------------------------------'
  41. for dow in dow_dict:
  42.     print weekday_dict[dow], ':', dow_dict[dow]/dow_count[dow], '=>', dow_count[dow], ' tweets'
复制代码

使用道具

报纸
auirzxp 学生认证  发表于 2017-4-16 01:50:54 |只看作者 |坛友微信交流群
感谢分享

使用道具

地板
MouJack007 发表于 2017-4-16 03:41:04 |只看作者 |坛友微信交流群
谢谢楼主分享!

使用道具

7
MouJack007 发表于 2017-4-16 03:41:28 |只看作者 |坛友微信交流群

使用道具

8
franky_sas 发表于 2017-4-16 13:41:29 |只看作者 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-28 05:39