楼主: ReneeBK
3355 16

Instant Data Intensive Apps with pandas How-to [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49417 个
通用积分
52.0504
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

1论坛币

Trent Hauck
May 2013
Manipulate, visualize, and analyze your data with pandas
Book Description
Publication Date: May 23, 2013
In Detail
Pandas helps to alleviate a genuinely complex situation in data analytics libraries. Many incumbent languages aren't approachable or are fairly unproductive in general computing tasks in comparison to Python. However with Pandas it's easy to begin working with tabular datasets in a language that's easier to learn and use.
Instant Data Intensive Apps with Pandas How-to starts with Pandas’ functionalities such as joining datasets, cleaning data, and other data munging tasks. It quickly moves onto building a data reporting tool, which consists of analysis in Pandas to determine what’s relevant and present that relevant data in an easy-to-consume manner.
Instant Data Intensive Apps with Pandas How-to starts with data manipulation and other practical tasks for a fundamental understanding, and through successive recipes you will gain a more profitable understanding of Pandas.
Throughout this book the recipes are presented in a structured way. It starts with data transformation techniques, but builds up to more complex examples such as performing statistical analysis and integrating Pandas objects with web applications. The other recipes cover visualization and machine learning, among other things.
Instant Data Intensive Apps with Pandas How-to will get the reader up and running quickly with Pandas and put the user in a position to move up the learning curve faster.
Approach
Filled with practical, step-by-step instructions and clear explanations for the most important and useful tasks. This book has a practical approach with step-by-step recipes to help readers get to grips with Pandas.
Who this book is for
Users of other data analysis tools will find value in seeing tasks they commonly encounter translated to Pandas and users of Python will encounter an introduction to a very impressive tool in a syntax they inherently know. In terms of general skills, it is assumed that the reader understands basic data structures such as arrays or lists dictionaries or hash map as well as having some understanding of command line work. Installing Pandas is not covered, but the online documentation is straightforward. Also, readers are encouraged to use IPython to interact and experiment with the code

Product Details
  • File Size: 594 KB
  • Print Length: 50 pages
  • Publisher: Packt Publishing (May 23, 2013)
  • Sold by: Amazon Digital Services, Inc.
  • Language: English
  • ASIN: B00CZ6Y0QW
  • Text-to-Speech: Enabled


最佳答案

tigerwolf 查看完整内容

**** 本内容被作者隐藏 ****
关键词:intensive Instant pandas panda apps comparison computing situation general complex

本帖被以下文库推荐

沙发
tigerwolf 发表于 2015-1-5 02:28:40 |只看作者 |坛友微信交流群


本帖隐藏的内容

Instant Data Intensive Apps with Pandas How-to (PacktPub 2013) Trent Hauck.rar (924.45 KB, 需要: 20 个论坛币) 本附件包括:
  • Instant Data Intensive Apps with Pandas How-to (PacktPub 2013) Trent Hauck.pdf


已有 1 人评分经验 论坛币 学术水平 热心指数 收起 理由
Nicolle + 100 + 100 + 5 + 5 精彩帖子

总评分: 经验 + 100  论坛币 + 100  学术水平 + 5  热心指数 + 5   查看全部评分

使用道具

藤椅
ReneeBK 发表于 2015-1-5 02:30:53 |只看作者 |坛友微信交流群
Working with files (Simple)

In this recipe we'll introduce the pandas DataFrame by doing some quick exercises, then move onto one of the most fundamental parts of data analysis; getting data in and out of files.

Getting ready

Most of the rest of the book is working with data once it's in a pandas data structure, but this recipe is about those structures themselves and getting data in and out of them. Open your interpreter, preferably IPython.


How to do it...
  • Create an incredibly simple DataFrame to start with. A DataFrame can handle lists, NumPy arrays, dicts of strings, and more.
    1. > import pandas as pd
    2. #standard convention throughout the book
    3. > import numpy as np
    4. > my_df = pd.DataFrame([1,2,3])
    5. > my_df
    复制代码

  • The first example is too simple, and isn't useful. Add some column headers and index for more information about the DataFrame.
    1. > cols = ['A', 'B']
    2. > idx = pd.Index(list('name'), name='a')
    3. > data = np.random.normal(10, 1, (4, 2))
    4. > df = pd.DataFrame(data, columns=cols, index=idx)
    复制代码
    1. #a single column is a series
    2. > df.A
    复制代码
    1. Create a Panel by passing a dictionary of DataFrames to the constructor.
    2. # multiple DataFrames is a panel
    3. > pan = pd.Panel({'df1': df, 'df2': df})
    4. > pan
    5. <class 'pandas.core.panel.Panel'>
    复制代码



    1. There are many ways to do I/O with pandas; in this step we will write the DataFrame out to several mediums.
    2. #write df to different file types
    3. > df.to_csv('df.csv')
    4. > df.to_latex('df.tex') #useful with Pweave
    5. > df.to_excel('df.xlsx') #requires extra packages
    6. > df.to_html('df.html')
    7. > df.to_string()
    复制代码
    1. #read df from the files, output methods aren't symmetric
    2. #often there's an intermediate step

    3. > pd.read_csv('df.csv')

    4. #back and forth with json
    5. #json isn't officially supported, the reasons why are beyond #the scope of this book
    6. > with open('df.json', 'w') as f:
    7.   json.dump(df.to_dict(), f)

    8. > with open('df.json') as f:
    9.   df_json = json.load(f)
    复制代码




Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.




How it works...

Most of the file input and output in pandas is the orchestration behind the scenes of formatting the value outputs, and then writing those values to a file. There are many options for formatting file output. The to_csv method takes many parameters. Some of the more common parameters are as follows:

  • sep: It specifies the value to separate with, in the output file
  • index: It is a Boolean that decides whether or not to print the index
  • na_rep: It specifies what to substitute for the na values

The following snippet writes the DataFrame df and writes it to a file called file.tsv, and it's formatted according to the parameters passed to the method.

  1. > df.to_csv('file.tsv', sep='\t', index=False, na_rep='NULL')
复制代码


There's more...

In addition to standard file input and output functionalities, pandas has several built-in niceties.

Parsing dates at file read time

Using Panda's sophisticated date parser, a CSV can read and parse dates at the same time, as shown in the following command line:

  1. > df = pd.read_csv('dates.csv', date_parser=True, parse_dates='YYYY-MM-DD', index_col=0)
复制代码

Besides the parsing capabilities, pandas also has a very handy date_range function, which returns a range of dates determined by the inputs. For example, it's very easy to get the months of 2012 in a series. This is shown in the following command line:

  1. > pd.date_range('2012-01-01', '2012-12-31', freq='M')
复制代码


Accessing data from a public source

pandas can also read CSV data from the Web, assuming http://www.example.com/data.csv is the URL. Take a look at the following example:

  1. > df = pd.read_csv(url)
复制代码





使用道具

板凳
ReneeBK 发表于 2015-1-5 02:34:26 |只看作者 |坛友微信交流群
Slicing pandas objects (Simple)

In this recipe we'll walk through some basic functionalities about slicing pandas objects. If you're familiar with array slicing, this will be very familiar to you, but with a few idiosyncrasies for pandas.

Getting ready

Open up your interpreter, and execute the following steps:


How to do it...
  • Create a simple DataFrame to explore the different slicing abilities of pandas.
    1. > dim = (10, 3)
    2. > df = pd.DataFrame(np.random.normal(0, 1, dim), columns ['one', 'two', 'three'])
    复制代码


  • Select the first two rows of the column named 'one'.
    1. > df['one'][:2]
    复制代码

  • Pass an array of column names instead of 'one'.
    1. > df[['one', 'two']][:2]
    复制代码

  • Use a negative index to navigate backwards through the DataFrame.
    1. > df[['one', 'two']][-3:-2]
    复制代码

  • Select every fifth row from the DataFrame df.
    1. > df[::5]
    复制代码

  • Use the head and tail functions to easily select the top and bottom of the DataFrame.
    1. > df.head(2)
    复制代码




How it works...

At some level, pandas objects behave similar to NumPy arrays; they are after all abstractions built on top of them. However, because we have more metadata about the data structures we can use that to our advantage.

After the initial pandas object is created, simple slicing occurs according to the following structure:

  1. > df[column names][rows]
复制代码

Here column names is a string (or an array, if multiple columns) and rows is the number of rows that we wish to use.


There's more...

The methods that have already been described are very useful at a higher level, but there are more granular operations available.

Direct index access

The .ix command is an advanced method for selecting and slicing a DataFrame. Taking the sample from the preceding example, df.ix[1:3 ,[ 'one', 'two']] = 10will not only select the specified subset of the data, but also set its value equal to 10. The .xs command has a more explicit interface for working with indexes.


Resetting the index

Often, the index of the DataFrame becomes out of alignment when slicing data. In pandas, the easiest way to reset an index is with the reset_index() method of the DataFrame object.




使用道具

报纸
nkunku 发表于 2015-1-5 03:03:22 |只看作者 |坛友微信交流群
Using indexes to manipulate objects (Medium)

Indexes are not advanced because they're difficult, but if we want to be an expert with pandas it is important that we use them well. We will discuss hierarchical indexes in the following There's more... section.

Getting ready

A good understanding of indexes in pandas is crucial to quickly move the data around. From a business intelligence perspective, they create a distinction similar to that of metrics and dimensions in an OLAP cube. To illustrate this point, this recipe walks through getting stock data out of pandas, combining it, then reindexing it for easy chomping.


How to do it...
  • Use the DataReader object to transfer stock price information into a DataFrame and to explore the basic axis of Panel.
    1. > from pandas.i git push -u origin master
    2. o.data import DataReader
    3. > tickers = ['gs', 'ibm', 'f', 'ba', 'axp']
    4. > dfs = {}
    5. > for ticker in tickers:
    6.         dfs[ticker] = DataReader(ticker, "yahoo", '2006-01-01')

    7. # a yet undiscussed data structure, in the same way the a
    8. # DataFrame is a collection of Series, a Panel is a collection of # DataFrames
    9. > pan = pd.Panel(dfs)
    10. > pan

    11. > pan.items
    12. Index([axp, ba, f, gs, ibm], dtype=object)

    13. > pan.minor_axis
    14. Index([Open, High, Low, Close, Volume, Adj Close], dtype=object)

    15. > pan.major_axis
    复制代码

  • Use the axis selectors to easily compute different sets of summary statistics.
    1. > pan.minor_xs('Open').mean()

    2. Convert the Panel to a DataFrame.
    3. > dfs = []
    4. > for df in pan:
    5.     idx = pan.major_axis
    6.     idx = pd.MultiIndex.from_tuples(zip([df]*len(idx), idx))
    7.     idx.names = ['ticker', 'timestamp']
    8.     dfs.append(pd.DataFrame(pan[df].values, index=idx,  columns=pan.minor_axis))
    9.    
    10. > df = pd.concat(dfs)
    11. > df
    复制代码


  • # major axis is sliceable as well> day_slice = pan.major_axis[1]> pan.major_xs(day_slice)[['gs', 'ba']]
  • Perform the analogous operations as in the preceding examples on the newly created DataFrame.
    1. # selecting from a MultiIndex isn't much different than the Panel
    2. # (output muted)
    3. > df.ix['gs':'ibm']
    4. > df['Open']
    复制代码




How it works...

The previous example was certainly contrived, but when indexing and statistical techniques are incorporated, the power of pandas begins to come through. Statistics will be covered in an upcoming recipe.

pandas' indexes by themselves can be thought of as descriptors of a certain point in the DataFrame. When ticker and timestamp are the only indexes in a DataFrame, then the point is individualized by the ticker, timestamp, and column name. After the point is individualized, it's more convenient for aggregation and analysis.


There's more...

Indexes show up all over the place in pandas so it's worthwhile to see some other use cases as well.

Advanced header indexes

Hierarchical indexing isn't limited to rows. Headers can also be represented by MultiIndex, as shown in the following command line:

  1. > header_top = ['Price', 'Price', 'Price', 'Price', 'Volume', 'Price']
  2. > df.columns = pd.MultiIndex.from_tuples(zip(header_top, df.columns)
复制代码


Performing aggregate operations with indexes

As a prelude to the following sections, we'll do a single groupby function here since they work with indexes so well.

  1. > df.groupby(level=['tickers', 'day'])['Volume'].mean()
复制代码

This answers the question for each ticker and for each day (not date), that is, what was the mean volume over the life of the data.



使用道具

地板
Lisrelchen 发表于 2015-1-5 03:03:33 |只看作者 |坛友微信交流群
Working with dates (Medium)

In this recipe we'll talk about working with dates in pandas. Because pandas was initially written with financial time series, it has a lot of out of the box date functionalities.

Getting ready

Open up your interpreter and follow the command progression in the following section. Difficult financial analysis was the mother of pandas creation; therefore, it has many efficient and easy ways for dealing with dates.


How to do it...
  • Let's examine the date_range functionality within pandas.
    1. > Y2K = pd.date_range('2000-01-01', '2000-12-31')
    2. > Y2K


    3. #it is very easy to create date range of a different frequency
    4. > Y2K_hourly = pd.date_range('2000-01-01', '2000-12-31', freq='H')
    5. > Y2K_hourly
    复制代码

  • Create a time series and slice it by passing a range of dates to Series.
    1. > Y2K_temp = pd.Series(np.random.normal(75, 10, len(Y2K)), index=Y2K)
    2. > Y2K_temp.head()

    3. > Y2K_temp['2000-01-01':'2000-01-02']

    4. > from datetime import date
    5. > Y2K_temp[date(2000, 1, 1):date(2000, 1, 2)]

    6. #pandas has functionality to move into and out-of date scopes
    7. > Y2K_temp.resample('H', fill_method='pad')[:1]
    复制代码




How it works...

The date_range function is defined by dates and frequencies. See the following section for the various frequency designations. The easiest way is to define a start date, end date, and frequency, but there are other ways as well. You can also change the frequency, or resample to a smaller or larger time interval.


There's more...

pandas adds a lot more functionalities to handle dates. These are mostly convenient methods because working with dates is a necessary evil of data analysis.

Alternative date range specification

Time series in pandas don't have to be defined by a start and end date. In pandas, it is possible to represent the time of the Series as an interval of dates with a common period between data points. For example, if we want to create a Series just like Y2K, we can do so as follows:

  1. pd.date_range(start='2012-01-01', periods=366, freq='D')
复制代码
Upsampling and downsampling Series

pandas offers the ability to move up and down the granularity of a time series. For example, given a Series of random numbers s for all the days in 2012, calculating the sum for each month is done by the following formula:

  1. s.resample('M', how='sum')
复制代码

In the preceding example, the 'M' variable specifies that we're upsampling to month. Downsampling is also done in a similar way; however, pandas provides functionalities for handling the disaggregation in a convenient way.




使用道具

7
Lisrelchen 发表于 2015-1-5 03:10:26 |只看作者 |坛友微信交流群
Modifying data with functions (Simple)

In this recipe we'll walk through the process of applying a function to a DataFrame. This is a simple but very important part of data analysis. Rarely, if ever, will a data in raw form be sufficient for data analysis. Often, that data needs to be transformed into some other form, and to do that you'll need to apply functions to pandas objects.

Getting ready

Open up your interpreter, and type the following commands successively.


How to do it...
  • Create a simple Series of simulated open and close for a year.
    1. > data = {'Open': np.random.normal(100, 5, 366),
    2.           'Close': np.random.normal(100, 5, 366)}

    3. > df = pd.DataFrame(data)

    4. > df
    复制代码

  • Apply element-wise functions.
    1. > df.apply(np.mean, axis=1).head(3)

    2. #passing a lambda is a common pattern
    3. > df.apply(lambda x: (x['Open'] - x['Close']), axis=1).head(3)

    4. #define a more complex function
    5. > def percent_change(x):
    6.     return (x['Open'] - x['Close']) / x['Open']
    7. > df.apply(percent_change, axis=1).head(3)

    8. #change axis, axis = 0 is default
    9. > df.apply(np.mean, axis=0)
    复制代码


  • Define a standalone function that takes two arguments. One is the element itself, and another argument.
    1. > def greater_than_x(element, x):
    2.     return element > x

    3. > df.Open.apply(greater_than_x, args=(100,)).head(3)

    4. #This can be used as in conjunction with subset capabilities
    5. > mask = df.Open.apply(greater_than_x, args=(100,))

    6. > df.Open[mask].head()

    7. #It's also possible to do a rolling apply, this applys #aggregate functions over a certain number of rows
    8. #For instance we can get the five day moving average
    9. > pd.rolling_apply(df.Close, 5, np.mean)

    10. #There are actually a several built-in rolling functions
    11. > pd.rolling_corr(df.Close, df.Open, 5)[:5]
    复制代码
    How it works...




pandas sits on top of NumPy; thus pandas takes advantage of the broadcasting capabilities inherent within NumPy. For example, execute the following script to see the differences in NumPy:

  1. > a = [1, 2, 3]
  2. > a * 2
  3. > b = np.array(a)
  4. > b * 2
复制代码

Understanding the underlying NumPy structure is beyond the scope, but is extremely helpful in the long run.


There's more...

pandas makes additional use of the apply function in place of the for loop function. Quite often it's necessary to do more complex operations on an entire column(s) of a DataFrame, but broadcasting or looping won't cut it.

Other apply options

There are other apply functions in the family. For example, the applymap function operates in a slightly different manner than the apply function. The applymapfunction operates on a single value and returns a single value, whereas the apply function takes an array-like data structure as an input.


Alternative solutions

Functions can also be applied iteratively; however, this tends to make the functions slow and leads to unnecessarily verbose code.

  1. > for x in df.Open:
  2.   some_function(x)
  3. vs.
  4. > df.Open.apply(some_function)
复制代码



使用道具

8
Lisrelchen 发表于 2015-1-5 03:14:11 |只看作者 |坛友微信交流群
Combining datasets (Medium)

Given that we have several different types of DataFrames, how can we best join them into one DataFrame for additional use? We'll also talk about merging and appending them in the following There's more... section.

Getting ready

Open up your interpreter and type the following given commands successively. Very rarely will an analyst receive data in a single flat file. Quite often, data will need to be either appended to the bottom of the DataFrame or attached to the side. For example, if a set of data comes directly from a normalized database, the analyst will need to combine them by joining them using Primary and Foreign Keys.


How to do it...
  • Create two basic DataFrames df1 and df2.
    1. > rng = pd.date_range('2000-01-01', '2000-01-05')

    2. > tickers = pd.DataFrame(['MSFT', 'AAPL'], columns= ['Ticker'])

    3. > df1 = pd.DataFrame({'TickerID': [0]*5,
    4.   'Price': np.random.normal(100, 10, 5)}, index=rng)

    5. > df2 = pd.DataFrame({'TickerID': [1]*5,
    6.   'Price': np.random.normal(100, 10, 5)}, index=rng)
    复制代码


  • The concat method is similar to the union command in SQL.
    1. #select the first and last value from concat
    2. > pd.concat([df1, df2]).ix[[0, -1]]
    复制代码

  • Merge the two DataFrames into a single DataFrame.
    1. > pd.merge(df1, df2, left_index=True, right_index=True)
    2.            
    3. > pd.merge(df1, tickers, right_index=True, left_on='TickerID')
    复制代码




How it works...

If the reader is familiar with R's functionalities, then he/she can see that joining data in pandas is not much different than in R. We'll cover more on indexes later, but thinking of the default index as a Primary Key, or the combination of hierarchical index as a Composite Key, elucidates the joining process.


There's more...

There are many options that can be supplied to the merge and join methods to modify the DataFrames' behaviour.

Merge and join details

The merge (and join) method uses a how parameter, which is a string of the join database. The possible values are 'left', 'right', 'outer', and 'inner'.


Specifying outputs in join

The join function (not the previously mentioned one) is easy to use to join DataFrames.

  1. > f1.join(df2, lsuffix=".1", rsuffix=".2")
复制代码

Use suffixes to disambiguate columns if the DataFrames have similar column names. join defaults to joining of indexes, but the on parameter can be used to specify a column. For example:

  1. a_df.join(another_df, on='columnA')
复制代码


Concatenation

One way to join the datasets is to just stack them on top of each other. This is similar to a union command in SQL. Given two DataFrames, One and Two, a concatenation is done in the following way:

  1. pd.concat([One, Two])
复制代码

A list can also be used. Although it will be awkward for two DataFrames, it makes much more sense in the event of 50 DataFrames.




使用道具

9
soccy 发表于 2015-1-5 04:31:41 |只看作者 |坛友微信交流群
pandas的书还真少。

使用道具

10
oink-oink 发表于 2015-1-5 06:53:06 |只看作者 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-5-11 10:14