Doing Statistics with SQL - LATEX论坛 - 经管之家(原人大经济论坛)

1关注
184
粉丝

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: 271951 个
通用积分: 31269.3519
学术水平: 1435 点
热心指数: 1554 点
信用等级: 1345 点
经验: 383775 点
帖子: 9598
精华: 66
在线时间: 5468 小时
注册时间: 2007-5-21
最后登录: 2024-4-18

楼主

oliyiyi 发表于 2016-8-13 20:46:29 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

By Jean-Nicholas Hould, JeanNicholasHould.com

I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it.

For this post, I am using an open data set. The million songs data sets is a «freely-available collection of audio features and metadata for a million contemporary popular music tracks».

If you want to follow along, here are the steps:

Download the Million Song Data Set. I used the 10K songs subset.
Download a SQLite Client
Open subset_track_metadata.db file with SQLite and start exploring

Why Descriptive Statistics?

When I start the analysis of a new data set, I run some basic queries to get a sense of how the data is organized, how the values are distributed. I am trying to understand what I’m dealing with. For numerical data, I will often run some descriptive statistics on the data set: I will measure the central tendency (mean, median, mode) and measure the level of variability of the data. Those measurements are generally a good start to data exploration. They will often lead to new questions that fuel my analysis.

Central Tendency

Mean

The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.

SELECT CAST(AVG(songs.year) as int) as avg_year FROM songs-- | avg_year |-- |----------|-- | 934 |

CAST: Run-time data type conversion between compatible data types. In this case, I’m converting a float to an integer for rounding purpose.
AVG: Aggregation function that returns the mean of the input expression value.
as avg_year: Temporarily renames a column heading - This is only for readbility purpose, to make the code more human-friendly. I will use this aliasing throughout the post.

Median

The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.

SELECT songs.year as median_yearFROM songs ORDER BY songs.year LIMIT 1 OFFSET (SELECT COUNT(*) FROM songs) / 2-- | median_year |-- |-------------|-- | 0 |

ORDER BY: Sorts the resulting data set by one or more column. The ordering can be ascending ASC (default) or descending DESC.
COUNT: Aggregation function that returns the # of rows that matches a criteria.
LIMIT: Specifies the maximum number of rows that can be returned from the resulting data set.
OFFSET: Skip X rows before beginning to return rows. In this specific example, we are skipping 5000 rows (which is equal to the total row countCOUNT(*) divided by 2).

Mode

The mode is the value that appears most often in a set of data.

SELECT songs.year, COUNT(*) as countFROM songsGROUP BY songs.yearORDER BY COUNT(*) DESCLIMIT 1-- | year | count |-- |------|-------|-- | 0 | 5320 |

GROUP BY: Used with aggregation functions such as COUNT, AVG, etc. Groups the resulting data set by one or more column.

Variability

Min/Max Values

Minimum/Maximum value in a data set.

SELECT MIN(songs.year) as min_year, MAX(songs.year) as max_yearFROM songs-- | min_year | max_year |-- |----------|----------|-- | 0 | 2010 |

MIN: Aggregation function that returns the smallest value in a data set.
MAX: Aggregation function that returns the largest value in a data set.

Distribution of songs per year

Count of songs released in each years

SELECT songs.year, COUNT(*) songs_countFROM songsGROUP BY songs.yearORDER BY songs.year ASC-- | year | song_count |-- |------|------------|-- | 0 | 5320 |-- | 1926 | 2 |-- | 1927 | 3 |-- | ... | ... |-- | 2009 | 250 |-- | 2010 | 64 |

Next Steps

The SQL queries in this post are fairly simple. They are not the result of technical gymmstastics. They are just simple measurements helping us understand the data set and that’s what’s great about them.

In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to 0. This means we are either looking at a data set of very old songs or that we are dealing with missing values. The latter is more realistic. If we filter out songs from year 0, our data makes more sense. The songs are ranging from 1926 to 2010 and the median is the year 2001.

With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Statistics statistic Statist Statis Doing collection recently concrete examples features

Doing Statistics with SQL [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Doing Statistics with SQL [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群