楼主: oliyiyi
1615 7

Introduction to Correlation [推广有奖]

版主

已卖:2997份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
27300 个
通用积分
31674.3271
学术水平
1454 点
热心指数
1573 点
信用等级
1364 点
经验
384134 点
帖子
9629
精华
66
在线时间
5508 小时
注册时间
2007-5-21
最后登录
2025-7-8

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

楼主
oliyiyi 发表于 2017-2-24 16:06:23 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

By DataScience.comSponsored Post.

Introduction: What Is Correlation and Why Is It Useful?

Correlation is one of the most widely used — and widely misunderstood — statistical concepts. In this overview, we provide the definitions and intuition behind several types of correlation and illustrate how to calculate correlation using the Python pandas library.

The term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer. Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models.

So, why is correlation a useful metric?

  • Correlation can help in predicting one quantity from another
  • Correlation can (but often does not, as we will see in some examples below) indicate the presence of a causal relationship
  • Correlation is used as a basic quantity and foundation for many other modeling techniques

More formally, correlation is a statistical measure that describes the association between random variables. There are several methods for calculating the correlation coefficient, each measuring different types of strength of association. Below we summarize three of the most widely used methods.

Types of Correlation

Before we go into the details of how correlation is calculated, it is important to introduce the concept of covariance. Covariance is a statistical measure of association between two variables X and Y. First, each variable is centered by subtracting its mean. These centered scores are multiplied together to measure whether the increase in one variable associates with the increase in another. Finally, expected value (E) of the product of these centered scores is calculated as a summary of association. Intuitively, the product of centered scores can be thought of as the area of a rectangle with each point's distance from the mean describing a side of the rectangle:

If both variables tend to move in the same direction, we expect the "average" rectangle connecting each point (X_i, Y_i) to the means (X_bar, Y_bar) to have a large and positive diagonal vector, corresponding to a larger positive product in the equation above. If both variables tend to move in opposite directions, we expect the average rectangle to have a diagonal vector that is large and negative, corresponding to a larger negative product in the equation above. If the variables are unrelated, then the vectors should, on average, cancel out — and the total diagonal vector should have a magnitude near 0, corresponding to a product near 0 in the equation above.

If you are wondering what "expected value" is, it is another way of saying the average, or mean μ, of a random variable. It is also referred to as "expectation." In other words, we can write the following equation to express the same quantity in a different way:

The problem with covariance is that it keeps the scale of the variables X and Y, and therefore can take on any value. This makes interpretation difficult and comparing covariances to each other impossible. For example, Cov(X, Y)  = 5.2 and Cov(Z, Q) = 3.1 tell us that these pairs are positively associated, but it is difficult to tell whether the relationship between X and Y is stronger than Z and Q without looking at the means and distributions of these variables. This is where correlation becomes useful — by standardizing covariance by some measure of variability in the data, it produces a quantity that has intuitive interpretations and consistent scale.

Pearson Correlation Coefficient

Pearson is the most widely used correlation coefficient. Pearson correlation measures the linear association between continuous variables. In other words, this coefficient quantifies the degree to which a relationship between two variables can be described by a line. Remarkably, while correlation can have many interpretations, the same formula developed by Karl Pearson over 120 years ago is still the most widely used today.

In this section, we will introduce several popular formulations and intuitive interpretations for Pearson correlation (referred to as ρ).

The original formula for correlation, developed by Pearson himself, uses raw data and the means of two variables, X and Y:

In this formulation, raw observations are centered by subtracting their means and re-scaled by a measure of standard deviations.

A different way to express the same quantity is in terms of expected values, means μX, μY, and standard deviations σX, σY:

Notice that the numerator of this fraction is identical to the above definition of covariance, since mean and expectation can be used interchangeably. Dividing the covariance between two variables by the product of standard deviations ensures that correlation will always fall between -1 and 1. This makes interpreting the correlation coefficient much easier.

The figure below shows three examples of Pearson correlation. The closer ρ is to 1, the more an increase in one variable associates with an increase in the other. On the other hand, the closer ρ is to -1, the increase in one variable would result in decrease in the other. Note that if X and Y are independent, then ρ is close to 0, but not vice versa! In other words, Pearson correlation can be small even if there is a strong relationship between two variables. We will see shortly how this can be the case.

So, how can we interpret Pearson correlation?

Read the rest on DataScience.com : Introduction to Correlation



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝


缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

沙发
h2h2 发表于 2017-2-25 02:28:18
谢谢分享

藤椅
MouJack007 发表于 2017-2-25 05:11:24
谢谢楼主分享!

板凳
MouJack007 发表于 2017-2-25 05:11:44

报纸
jcyang 发表于 2017-2-25 08:48:15
看看,谢谢

地板
nieqiang110 学生认证  发表于 2017-2-25 10:58:45
Introduction to Correlation

7
franky_sas 发表于 2017-2-25 15:57:48

8
kkkm_db 发表于 2017-2-25 17:44:39
谢谢分享

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-27 13:11