楼主: oliyiyi
1322 3

Doing Statistics with SQL [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

By Jean-Nicholas Hould, JeanNicholasHould.com

I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it.

For this post, I am using an open data set. The million songs data sets is a «freely-available collection of audio features and metadata for a million contemporary popular music tracks».

If you want to follow along, here are the steps:

Why Descriptive Statistics?


When I start the analysis of a new data set, I run some basic queries to get a sense of how the data is organized, how the values are distributed. I am trying to understand what I’m dealing with. For numerical data, I will often run some descriptive statistics on the data set: I will measure the central tendency (mean, median, mode) and measure the level of variability of the data. Those measurements are generally a good start to data exploration. They will often lead to new questions that fuel my analysis.

Central Tendency


Mean

The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.

SELECT CAST(AVG(songs.year) as int) as avg_year FROM songs-- | avg_year |-- |----------|-- | 934      |

  • CAST: Run-time data type conversion between compatible data types. In this case, I’m converting a float to an integer for rounding purpose.
  • AVG: Aggregation function that returns the mean of the input expression value.
  • as avg_year: Temporarily renames a column heading - This is only for readbility purpose, to make the code more human-friendly. I will use this aliasing throughout the post.

Median

The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.

SELECT songs.year as median_yearFROM songs ORDER BY songs.year LIMIT 1 OFFSET (SELECT COUNT(*) FROM songs) / 2-- | median_year |-- |-------------|-- | 0           |

  • ORDER BY: Sorts the resulting data set by one or more column. The ordering can be ascending ASC (default) or descending DESC.
  • COUNT: Aggregation function that returns the # of rows that matches a criteria.
  • LIMIT: Specifies the maximum number of rows that can be returned from the resulting data set.
  • OFFSET: Skip X rows before beginning to return rows. In this specific example, we are skipping 5000 rows (which is equal to the total row countCOUNT(*) divided by 2).

Mode

The mode is the value that appears most often in a set of data.

SELECT     songs.year,    COUNT(*) as countFROM songsGROUP BY songs.yearORDER BY COUNT(*) DESCLIMIT 1-- | year | count |-- |------|-------|-- | 0    | 5320  |

  • GROUP BY: Used with aggregation functions such as COUNT, AVG, etc. Groups the resulting data set by one or more column.
Variability


Min/Max Values

Minimum/Maximum value in a data set.

SELECT     MIN(songs.year) as min_year,    MAX(songs.year) as max_yearFROM    songs-- | min_year | max_year |-- |----------|----------|-- | 0        | 2010     |

  • MIN: Aggregation function that returns the smallest value in a data set.
  • MAX: Aggregation function that returns the largest value in a data set.

Distribution of songs per year

Count of songs released in each years

SELECT     songs.year,    COUNT(*) songs_countFROM songsGROUP BY songs.yearORDER BY songs.year ASC-- | year | song_count |-- |------|------------|-- | 0    | 5320       |-- | 1926 | 2          |-- | 1927 | 3          |-- | ...  | ...        |-- | 2009 | 250        |-- | 2010 | 64         |

Next Steps


The SQL queries in this post are fairly simple. They are not the result of technical gymmstastics. They are just simple measurements helping us understand the data set and that’s what’s great about them.

In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to 0. This means we are either looking at a data set of very old songs or that we are dealing with missing values. The latter is more realistic. If we filter out songs from year 0, our data makes more sense. The songs are ranging from 1926 to 2010 and the median is the year 2001.

With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Statistics statistic Statist Statis Doing collection recently concrete examples features

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
hjtoh 发表于 2016-8-13 21:23:40 来自手机 |只看作者 |坛友微信交流群
oliyiyi 发表于 2016-8-13 20:46
By Jean-Nicholas Hould, JeanNicholasHould.comI recently wrote an article on why you should learn SQL ...
谢谢分享

使用道具

藤椅
h2h2 发表于 2016-8-14 13:57:32 |只看作者 |坛友微信交流群
谢谢分享

使用道具

板凳
derekhsuan 发表于 2016-12-11 03:18:57 |只看作者 |坛友微信交流群
Thanks a lot.

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 01:20