【独家发布】【2016新书】Foundations for Analytics with Python: From Non-Programmer to Hack

166
关注
1385
粉丝

已卖：10462份资源

泰斗

38%

还不是VIP/贵宾

-

TA的文库 其他...

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

如果喜欢该文档，欢迎订阅【2016新书】文库，https://bbs.pinggu.org/forum.php?mod=collection&action=view&ctid=3187

图书名称：Foundations for Analytics with Python: From Non-Programmer to Hacker
作者：Clinton W. Brownley
出版社：O’Reilly Media
页数：356
出版时间：2016
语言：English
格式：pdf
内容简介：
If you’re like many of Excel’s 750 million users, you want to do more with your data—like repeating similar analyses over hundreds of files, or combining data in many files for analysis at one time. This practical guide shows ambitious non-programmers how to automate and scale the processing and analysis of data in different formats—by using Python.

After author Clinton Brownley takes you through Python basics, you’ll be able to write simple scripts for processing data in spreadsheets as well as databases. You’ll also learn how to use several Python modules for parsing files, grouping data, and producing statistics. No programming experience is necessary.

回复免费：

本帖隐藏的内容

Foundations for Analytics with Python_ From Non-Programmer to Hacke.pdf (6.39 MB)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏7 回帖

关键词：Foundations foundation Programmer Analytics Programme action Maria

本帖被以下文库推荐

· kindle电子书|主题: 4393, 订阅: 4066

沙发

benji427

发表于 2016-8-25 20:01:12

#!/usr/bin/env python3
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Read the data set into a pandas DataFrame
churn = pd.read_csv('churn.csv', sep=',', header=0)
churn.columns = [heading.lower() for heading in \
churn.columns.str.replace(' ', '_').str.replace("\'", "").str.strip('?')]
churn['churn01'] = np.where(churn['churn'] == 'True.', 1., 0.)
print(churn.head())
print(churn.describe())
# Calculate descriptive statistics for grouped data
print(churn.groupby(['churn'])[['day_charge', 'eve_charge', 'night_charge', 'intl_charge', 'account_length', 'custserv_calls']].agg(['count', 'mean', 'std']))
# Specify different statistics for different variables
print(churn.groupby(['churn']).agg({'day_charge' : ['mean', 'std'],
'eve_charge' : ['mean', 'std'],
'night_charge' : ['mean', 'std'],
'intl_charge' : ['mean', 'std'],
'account_length' : ['count', 'min', 'max'],
'custserv_calls' : ['count', 'min', 'max']}))
# Create total_charges, split it into 5 groups, and
# calculate statistics for each of the groups
churn['total_charges'] = churn['day_charge'] + churn['eve_charge'] + \
churn['night_charge'] + churn['intl_charge']
factor_cut = pd.cut(churn.total_charges, 5, precision=2)
def get_stats(group):
return {'min' : group.min(), 'max' : group.max(),
'count' : group.count(), 'mean' : group.mean(),
'std' : group.std()}
grouped = churn.custserv_calls.groupby(factor_cut)
print(grouped.apply(get_stats).unstack())
# Split account_length into quantiles and
# calculate statistics for each of the quantiles
factor_qcut = pd.qcut(churn.account_length, [0., 0.25, 0.5, 0.75, 1.])
grouped = churn.custserv_calls.groupby(factor_qcut)
print(grouped.apply(get_stats).unstack())
# Create binary/dummy indicator variables for intl_plan and vmail_plan
# and join them with the churn column in a new DataFrame
intl_dummies = pd.get_dummies(churn['intl_plan'], prefix='intl_plan')
vmail_dummies = pd.get_dummies(churn['vmail_plan'], prefix='vmail_plan')
churn_with_dummies = churn[['churn']].join([intl_dummies, vmail_dummies])
print(churn_with_dummies.head())
# Split total_charges into quartiles, create binary indicator variables
# for each of the quartiles, and add them to the churn DataFrame
qcut_names = ['1st_quartile', '2nd_quartile', '3rd_quartile', '4th_quartile']
total_charges_quartiles = pd.qcut(churn.total_charges, 4, labels=qcut_names)
dummies = pd.get_dummies(total_charges_quartiles, prefix='total_charges')
churn_with_dummies = churn.join(dummies)
print(churn_with_dummies.head())
# Create pivot tables
print(churn.pivot_table(['total_charges'], index=['churn', 'custserv_calls']))
print(churn.pivot_table(['total_charges'], index=['churn'], columns=['custserv_calls']))
print(churn.pivot_table(['total_charges'], index=['custserv_calls'], columns=['churn'], \
aggfunc='mean', fill_value='NaN', margins=True))
# Fit a logistic regression model
dependent_variable = churn['churn01']
independent_variables = churn[['account_length', 'custserv_calls', 'total_charges']]
independent_variables_with_constant = sm.add_constant(independent_variables, prepend=True)
logit_model = sm.Logit(dependent_variable, independent_variables_with_constant).fit()
#logit_model = smf.glm(output_variable, input_variables, family=sm.families.Binomial()).fit()
print(logit_model.summary())
print("\nQuantities you can extract from the result:\n%s" % dir(logit_model))
print("\nCoefficients:\n%s" % logit_model.params)
print("\nCoefficient Std Errors:\n%s" % logit_model.bse)
#logit_marginal_effects = logit_model.get_margeff(method='dydx', at='overall')
#print(logit_marginal_effects.summary())
print("\ninvlogit(-7.2205 + 0.0012*mean(account_length) + 0.4443*mean(custserv_calls) + 0.0729*mean(total_charges))")
def inverse_logit(model_formula):
from math import exp
return (1.0 / (1.0 + exp(-model_formula)))*100.0
at_means = float(logit_model.params[0]) + \
float(logit_model.params[1])*float(churn['account_length'].mean()) + \
float(logit_model.params[2])*float(churn['custserv_calls'].mean()) + \
float(logit_model.params[3])*float(churn['total_charges'].mean())
print(churn['account_length'].mean())
print(churn['custserv_calls'].mean())
print(churn['total_charges'].mean())
print(at_means)
print("Probability of churn when independent variables are at their mean values: %.2f" % inverse_logit(at_means))
cust_serv_mean = float(logit_model.params[0]) + \
float(logit_model.params[1])*float(churn['account_length'].mean()) + \
float(logit_model.params[2])*float(churn['custserv_calls'].mean()) + \
float(logit_model.params[3])*float(churn['total_charges'].mean())
cust_serv_mean_minus_one = float(logit_model.params[0]) + \
float(logit_model.params[1])*float(churn['account_length'].mean()) + \
float(logit_model.params[2])*float(churn['custserv_calls'].mean()-1.0) + \
float(logit_model.params[3])*float(churn['total_charges'].mean())
print(cust_serv_mean)
print(churn['custserv_calls'].mean()-1.0)
print(cust_serv_mean_minus_one)
print("Probability of churn when account length changes by 1: %.2f" % (inverse_logit(cust_serv_mean) - inverse_logit(cust_serv_mean_minus_one)))
# Predict churn for "new" observations
new_observations = churn.ix[churn.index.isin(xrange(10)), independent_variables.columns]
new_observations_with_constant = sm.add_constant(new_observations, prepend=True)
y_predicted = logit_model.predict(new_observations_with_constant)
y_predicted_rounded = [round(score, 2) for score in y_predicted]
print(y_predicted_rounded)
# Fit a logistic regression model
output_variable = churn['churn01']
vars_to_keep = churn[['account_length', 'custserv_calls', 'total_charges']]
inputs_standardized = (vars_to_keep - vars_to_keep.mean()) / vars_to_keep.std()
input_variables = sm.add_constant(inputs_standardized, prepend=False)
logit_model = sm.Logit(output_variable, input_variables).fit()
#logit_model = smf.glm(output_variable, input_variables, family=sm.families.Binomial()).fit()
print(logit_model.summary())
print(logit_model.params)
print(logit_model.bse)
#logit_marginal_effects = logit_model.get_margeff(method='dydx', at='overall')
#print(logit_marginal_effects.summary())
# Predict output value for a new observation based on its mean standardized input values
input_variables = [0., 0., 0., 1.]
predicted_value = logit_model.predict(input_variables)
print("Predicted value: %.5f") % predicted_value

复制代码

藤椅

soccy 发表于 2016-8-25 20:26:35

#!/usr/bin/env python3
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('selected_columns_all_worksheets')
my_columns = ['Customer Name', 'Sale Amount']
first_worksheet = True
with open_workbook(input_file) as workbook:
data = [my_columns]
index_of_cols_to_keep = []
for worksheet in workbook.sheets():
if first_worksheet:
header = worksheet.row_values(0)
for column_index in range(len(header)):
if header[column_index] in my_columns:
index_of_cols_to_keep.append(column_index)
first_worksheet = False
for row_index in range(1, worksheet.nrows):
row_list = []
for column_index in index_of_cols_to_keep:
cell_value = worksheet.cell_value(row_index, column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value,workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

板凳

ryuuzt 发表于 2016-8-25 20:45:39

#!/usr/bin/env python3
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('set_of_worksheets')
my_sheets = [0,1]
threshold = 1900.0
sales_column_index = 3
first_worksheet = True
with open_workbook(input_file) as workbook:
data = []
for sheet_index in range(workbook.nsheets):
if sheet_index in my_sheets:
worksheet = workbook.sheet_by_index(sheet_index)
if first_worksheet:
header_row = worksheet.row_values(0)
data.append(header_row)
first_worksheet = False
for row_index in range(1,worksheet.nrows):
row_list = []
sale_amount = worksheet.cell_value(row_index, sales_column_index)
if sale_amount > threshold:
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value,workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
if row_list:
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

报纸

ryuuzt 发表于 2016-8-25 20:45:39

#!/usr/bin/env python3
import glob
import os
import sys
from xlrd import open_workbook
input_directory = sys.argv[1]
workbook_counter = 0
for input_file in glob.glob(os.path.join(input_directory, '*.xls*')):
workbook = open_workbook(input_file)
print('Workbook: {}'.format(os.path.basename(input_file)))
print('Number of worksheets: {}'.format(workbook.nsheets))
for worksheet in workbook.sheets():
print('Worksheet name:', worksheet.name, '\tRows:',\
worksheet.nrows, '\tColumns:', worksheet.ncols)
workbook_counter += 1
print('Number of Excel workbooks: {}'.format(workbook_counter))

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

地板

ekscheng 发表于 2016-8-25 22:39:58

7楼

steventung 发表于 2016-8-25 22:51:30

#!/usr/bin/env python3
import glob
import os
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_folder = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('sums_and_averages')
all_data = []
sales_column_index = 3
header = ['workbook', 'worksheet', 'worksheet_total', 'worksheet_average',\
'workbook_total', 'workbook_average']
all_data.append(header)
for input_file in glob.glob(os.path.join(input_folder, '*.xls*')):
with open_workbook(input_file) as workbook:
list_of_totals = []
list_of_numbers = []
workbook_output = []
for worksheet in workbook.sheets():
total_sales = 0
number_of_sales = 0
worksheet_list = []
worksheet_list.append(os.path.basename(input_file))
worksheet_list.append(worksheet.name)
for row_index in range(1,worksheet.nrows):
try:
total_sales += float(str(worksheet.cell_value(row_index,sales_column_index)).strip('
).replace(',',''))
number_of_sales += 1.
except:
total_sales += 0.
number_of_sales += 0.
average_sales = '%.2f' % (total_sales / number_of_sales)
worksheet_list.append(total_sales)
worksheet_list.append(float(average_sales))
list_of_totals.append(total_sales)
list_of_numbers.append(float(number_of_sales))
workbook_output.append(worksheet_list)
workbook_total = sum(list_of_totals)
workbook_average = sum(list_of_totals)/sum(list_of_numbers)
for list_element in workbook_output:
list_element.append(workbook_total)
list_element.append(workbook_average)
all_data.extend(workbook_output)
for list_index, output_list in enumerate(all_data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

8楼

挚爱酸模 发表于 2016-8-25 23:20:46

#!/usr/bin/env python3
import sys
from xlrd import open_workbook
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
for row_index in range(worksheet.nrows):
for column_index in range(worksheet.ncols):
output_worksheet.write(row_index, column_index, worksheet.cell_value(row_index, column_index))
output_workbook.save(output_file)

复制代码

9楼

smartlife

发表于 2016-8-25 23:25:37

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

10楼

bbslover

发表于 2016-8-26 00:40:09

thanks for sharing

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

【独家发布】【2016新书】Foundations for Analytics with Python: From Non-Programmer to Hack [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

特级学术勋章

特级热心勋章

特级信用勋章

高级学术勋章

高级热心勋章

高级信用勋章

本版微信群

【独家发布】【2016新书】Foundations for Analytics with Python: From Non-Programmer to Hack [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

特级学术勋章

特级热心勋章

特级信用勋章

高级学术勋章

高级热心勋章

高级信用勋章

本版微信群

扫码加我拉你入群