楼主: oliyiyi
924 1

Deep thinking about your data [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

In the on-going series of posts about the IMDB dataset, from Kaggle, I have so far looked at several of the scraped variables, including the number of faces on movie posters (1, 2), plot keywords (3), and movie rating by title year (4).

In this post, I tackle the variables resulting from a data merge between IMDB and Facebook. These columns have names like "Director Facebook Likes", "Actor 1 Facebook Likes", etc. I didn't investigate exactly how they did this merging. Presumably, they matched the names of the actors and directors to their official fan pages on Facebook, and took the Like counts. (I suppose identifying the right pages is not a trivial task.)

There is clearly a "theory" behind computing these variables: star power. So we should expect that movies directed by famous directors, or starring famous actors or actresses would be correlated with bigger box office receipts. The number of Facebook likes is a proxy measure for the concept of "star power" or "fame" or "popularity".

Proxies are necessary when dealing with quantities that are hard or impossible to measure directly. But be careful to choose proxies that describe the underlying quantities properly.

Facebook likes as a proxy for star power has a host of problems. For example:

  • Not all actors and directors use social media, or favor social media. If they do, Facebook may not be their chosen platform. In the extreme case, some countries like China block Facebook so obviously, a Chinese actor or actress would not be investing in a Facebook presence.
  • The Facebook like count is a snapshot. What is captured is count on the day on which the data was compiled. What you want is the Facebook like count in the days or weeks prior to the release of the respective movie!
  • Not only the protagonists but also the fans have preference for social media platforms. Actors and directors with older followers will have different Facebook statistics than those followed by younger generations.
  • All famous directors or actors have a breakthrough movie. They were nobodies before this movie, and became stars after this movie. The Facebook like count is a summary of the entire career of each person.

A quick glimpse of the data should give the analyst a pause.


[color=rgb(255, 255, 255) !important]


Christopher Nolan with 22,000 likes dwarfs anybody else in this snippet but James Cameron with zero? Bryan Singer, zero? Sam Mendes, zero? (This could be a data merge error, or it could be a structural problem.)

Actors are not much more telling either, as this list of the top actors shows:


[color=rgb(255, 255, 255) !important]


Darcy Donavan is 13 times more "famous" than Robin Williams (RIP) according to this metric. Darcy is primarily a TV actress so that's yet another issue with using Facebook likes to predict movie receipts.

Let's get back to the basic premise. We hypothesize that the Facebook like count of the director and/or top-billed actors and actresses is predictive of the movie's box office. But as you can see from Robin Williams's various entries, the Facebook like count for a given actor is invariant so if this factor is deemed useful to the model, it will contribute equally to each of Robin's movies throughout his career. When this model is used to predict revenues for early-career movies, it is using information that is out of bounds - the model in effect learned that Robin Williams would become a superstar in the future.

***

The lesson here is that proxies have lives of their own. There are a whole host of factors that drive the value of the proxy metrics. Understanding those issues and how they muddle the picture of your primary metric is essential to constructing a meaningful model.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Thinking think About Bout King about

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
nieqiang110 学生认证  发表于 2017-2-4 20:03:19 |只看作者 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-19 16:43