楼主: oliyiyi
980 1

Statistical Data Analysis in Python [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects, taking the form of a set of IPython notebooks.

By Christopher Fonnesbeck, Vanderbilt University School of Medicine.

Editor's note: This tutorial was originally published as course instructional material, and may contain out-of-context references to other courses therein; this takes nothing away from the validity or usefulness of the material.

Description


This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Student Instructions


For students familiar with Git, you may simply clone this repository to obtain all the materials (iPython notebooks and data) for the tutorial. Alternatively, you may download a zip file containing the materials. A third option is to simply view static notebooks by clicking on the titles of each section below.

Outline


Introduction to Pandas

  • Importing data
  • Series and DataFrame objects
  • Indexing, data selection and subsetting
  • Hierarchical indexing
  • Reading and writing files
  • Sorting and ranking
  • Missing data
  • Data summarization

Data Wrangling with Pandas

  • Date/time types
  • Merging and joining DataFrame objects
  • Concatenation
  • Reshaping DataFrame objects
  • Pivoting
  • Data transformation
  • Permutation and sampling
  • Data aggregation and GroupBy operations

Plotting and Visualization

  • Plotting in Pandas vs Matplotlib
  • Bar plots
  • Histograms
  • Box plots
  • Grouped plots
  • Scatterplots
  • Trellis plots

Statistical Data Modeling

  • Statistical modeling
  • Fitting data to probability distributions
  • Fitting regression models
  • Model selection
  • Bootstrapping
Required Packages
  • Python 2.7 or higher (including Python 3)
  • pandas >= 0.11.1 and its dependencies
  • NumPy >= 1.6.1
  • matplotlib >= 1.0.0
  • pytz
  • IPython >= 0.12
  • pyzmq
  • tornado

Optional: statsmodels, xlrd and openpyxl

For students running the latest version of Mac OS X (10.8), the easiest way to obtain all the packages is to install the Scipy Superpack which works with Python 2.7.2 that ships with OS X.

Otherwise, another easy way to install all the necessary packages is to use Continuum Analytics' Anaconda.

Statistical Reading List


The Ecological Detective: Confronting Models with Data, Ray Hilborn and Marc Mangel

Though targeted to ecologists, Mangel and Hilborn identify key methods that scientists can use to build useful and credible models for their data. They don't shy away from the math, but the book is very readable and example-laden.

Data Analysis Using Regression and Multilevel/Hierarchical Models, Andrew Gelman and Jennifer Hill

The go-to reference for applied hierarchical modeling.

The Elements of Statistical Learning, Hastie, Tibshirani and Friedman

A comprehensive machine learning guide for statisticians.

A First Course in Bayesian Statistical Methods, Peter Hoff

An excellent, approachable book to get started with Bayesian methods.

Regression Modeling Strategies, Frank Harrell

Frank Harrell's bag of tricks for regression modeling. I pull this off the shelf every week.

Bio: Christopher Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Statistical statistica statistic Analysis Statist University references published material nothing

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
h2h2 发表于 2016-7-20 20:03:36 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 04:16