发帖

楼主: oliyiyi

924 1

Deep thinking about your data [推广有奖]

1关注
184
粉丝

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: 271951 个
通用积分: 31269.3519
学术水平: 1435 点
热心指数: 1554 点
信用等级: 1345 点
经验: 383775 点
帖子: 9598
精华: 66
在线时间: 5468 小时
注册时间: 2007-5-21
最后登录: 2024-4-18

楼主

oliyiyi 发表于 2017-2-4 15:05:01 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

In the on-going series of posts about the IMDB dataset, from Kaggle, I have so far looked at several of the scraped variables, including the number of faces on movie posters (1, 2), plot keywords (3), and movie rating by title year (4).

In this post, I tackle the variables resulting from a data merge between IMDB and Facebook. These columns have names like "Director Facebook Likes", "Actor 1 Facebook Likes", etc. I didn't investigate exactly how they did this merging. Presumably, they matched the names of the actors and directors to their official fan pages on Facebook, and took the Like counts. (I suppose identifying the right pages is not a trivial task.)

There is clearly a "theory" behind computing these variables: star power. So we should expect that movies directed by famous directors, or starring famous actors or actresses would be correlated with bigger box office receipts. The number of Facebook likes is a proxy measure for the concept of "star power" or "fame" or "popularity".

Proxies are necessary when dealing with quantities that are hard or impossible to measure directly. But be careful to choose proxies that describe the underlying quantities properly.

Facebook likes as a proxy for star power has a host of problems. For example:

Not all actors and directors use social media, or favor social media. If they do, Facebook may not be their chosen platform. In the extreme case, some countries like China block Facebook so obviously, a Chinese actor or actress would not be investing in a Facebook presence.
The Facebook like count is a snapshot. What is captured is count on the day on which the data was compiled. What you want is the Facebook like count in the days or weeks prior to the release of the respective movie!
Not only the protagonists but also the fans have preference for social media platforms. Actors and directors with older followers will have different Facebook statistics than those followed by younger generations.
All famous directors or actors have a breakthrough movie. They were nobodies before this movie, and became stars after this movie. The Facebook like count is a summary of the entire career of each person.

A quick glimpse of the data should give the analyst a pause.

[color=rgb(255, 255, 255) !important]

Christopher Nolan with 22,000 likes dwarfs anybody else in this snippet but James Cameron with zero? Bryan Singer, zero? Sam Mendes, zero? (This could be a data merge error, or it could be a structural problem.)

Actors are not much more telling either, as this list of the top actors shows:

[color=rgb(255, 255, 255) !important]

Darcy Donavan is 13 times more "famous" than Robin Williams (RIP) according to this metric. Darcy is primarily a TV actress so that's yet another issue with using Facebook likes to predict movie receipts.

Let's get back to the basic premise. We hypothesize that the Facebook like count of the director and/or top-billed actors and actresses is predictive of the movie's box office. But as you can see from Robin Williams's various entries, the Facebook like count for a given actor is invariant so if this factor is deemed useful to the model, it will contribute equally to each of Robin's movies throughout his career. When this model is used to predict revenues for early-career movies, it is using information that is out of bounds - the model in effect learned that Robin Williams would become a superstar in the future.

***

The lesson here is that proxies have lives of their own. There are a whole host of factors that drive the value of the proxy metrics. Understanding those issues and how they muddle the picture of your primary metric is essential to constructing a meaningful model.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Thinking think About Bout King about

Deep thinking about your data [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Deep thinking about your data [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群