楼主: 牛尾巴
14149 75

【独家发布】【2016新书】Foundations for Analytics with Python: From Non-Programmer to Hack   [推广有奖]

已卖:10459份资源

泰斗

38%

还不是VIP/贵宾

-

TA的文库  其他...

最新e书

2018新书

2017新书

威望
8
论坛币
630079 个
通用积分
57024.8151
学术水平
12700 点
热心指数
12976 点
信用等级
12465 点
经验
569184 点
帖子
9169
精华
66
在线时间
13174 小时
注册时间
2008-2-13
最后登录
2025-9-22

特级学术勋章 特级热心勋章 特级信用勋章 高级学术勋章 高级热心勋章 高级信用勋章

楼主
牛尾巴 发表于 2016-8-25 19:50:56 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
如果喜欢该文档,欢迎订阅【2016新书】文库,https://bbs.pinggu.org/forum.php?mod=collection&action=view&ctid=3187

图书名称:Foundations for Analytics with Python: From Non-Programmer to Hacker

作者:Clinton W. Brownley

出版社:O’Reilly Media

页数:356
出版时间:2016
                           
语言:English

格式:pdf
内容简介:
If you’re like many of Excel’s 750 million users, you want to do more with your data—like repeating similar analyses over hundreds of files, or combining data in many files for analysis at one time. This practical guide shows ambitious non-programmers how to automate and scale the processing and analysis of data in different formats—by using Python.

After author Clinton Brownley takes you through Python basics, you’ll be able to write simple scripts for processing data in spreadsheets as well as databases. You’ll also learn how to use several Python modules for parsing files, grouping data, and producing statistics. No programming experience is necessary.

回复免费:

本帖隐藏的内容

Foundations for Analytics with Python_ From Non-Programmer to Hacke.pdf (6.39 MB)



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Foundations foundation Programmer Analytics Programme action Maria

已有 2 人评分经验 论坛币 学术水平 热心指数 信用等级 收起 理由
Sunknownay + 3 + 3 + 3 热心帮助其他会员
Nicolle + 100 + 100 精彩帖子

总评分: 经验 + 100  论坛币 + 100  学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

本帖被以下文库推荐

沙发
benji427 在职认证  发表于 2016-8-25 20:01:12
  1. #!/usr/bin/env python3
  2. import numpy as np
  3. import pandas as pd
  4. import seaborn as sns
  5. import matplotlib.pyplot as plt
  6. import statsmodels.api as sm
  7. import statsmodels.formula.api as smf

  8. # Read the data set into a pandas DataFrame
  9. churn = pd.read_csv('churn.csv', sep=',', header=0)

  10. churn.columns = [heading.lower() for heading in \
  11. churn.columns.str.replace(' ', '_').str.replace("\'", "").str.strip('?')]

  12. churn['churn01'] = np.where(churn['churn'] == 'True.', 1., 0.)
  13. print(churn.head())
  14. print(churn.describe())


  15. # Calculate descriptive statistics for grouped data
  16. print(churn.groupby(['churn'])[['day_charge', 'eve_charge', 'night_charge', 'intl_charge', 'account_length', 'custserv_calls']].agg(['count', 'mean', 'std']))

  17. # Specify different statistics for different variables
  18. print(churn.groupby(['churn']).agg({'day_charge' : ['mean', 'std'],
  19.                                 'eve_charge' : ['mean', 'std'],
  20.                                 'night_charge' : ['mean', 'std'],
  21.                                 'intl_charge' : ['mean', 'std'],
  22.                                 'account_length' : ['count', 'min', 'max'],
  23.                                 'custserv_calls' : ['count', 'min', 'max']}))

  24. # Create total_charges, split it into 5 groups, and
  25. # calculate statistics for each of the groups
  26. churn['total_charges'] = churn['day_charge'] + churn['eve_charge'] + \
  27.                                                  churn['night_charge'] + churn['intl_charge']
  28. factor_cut = pd.cut(churn.total_charges, 5, precision=2)
  29. def get_stats(group):
  30.         return {'min' : group.min(), 'max' : group.max(),
  31.                         'count' : group.count(), 'mean' : group.mean(),
  32.                         'std' : group.std()}
  33. grouped = churn.custserv_calls.groupby(factor_cut)
  34. print(grouped.apply(get_stats).unstack())

  35. # Split account_length into quantiles and
  36. # calculate statistics for each of the quantiles
  37. factor_qcut = pd.qcut(churn.account_length, [0., 0.25, 0.5, 0.75, 1.])
  38. grouped = churn.custserv_calls.groupby(factor_qcut)
  39. print(grouped.apply(get_stats).unstack())

  40. # Create binary/dummy indicator variables for intl_plan and vmail_plan
  41. # and join them with the churn column in a new DataFrame
  42. intl_dummies = pd.get_dummies(churn['intl_plan'], prefix='intl_plan')
  43. vmail_dummies = pd.get_dummies(churn['vmail_plan'], prefix='vmail_plan')
  44. churn_with_dummies = churn[['churn']].join([intl_dummies, vmail_dummies])
  45. print(churn_with_dummies.head())

  46. # Split total_charges into quartiles, create binary indicator variables
  47. # for each of the quartiles, and add them to the churn DataFrame
  48. qcut_names = ['1st_quartile', '2nd_quartile', '3rd_quartile', '4th_quartile']
  49. total_charges_quartiles = pd.qcut(churn.total_charges, 4, labels=qcut_names)
  50. dummies = pd.get_dummies(total_charges_quartiles, prefix='total_charges')
  51. churn_with_dummies = churn.join(dummies)
  52. print(churn_with_dummies.head())

  53. # Create pivot tables
  54. print(churn.pivot_table(['total_charges'], index=['churn', 'custserv_calls']))
  55. print(churn.pivot_table(['total_charges'], index=['churn'], columns=['custserv_calls']))
  56. print(churn.pivot_table(['total_charges'], index=['custserv_calls'], columns=['churn'], \
  57.                                                 aggfunc='mean', fill_value='NaN', margins=True))

  58. # Fit a logistic regression model
  59. dependent_variable = churn['churn01']
  60. independent_variables = churn[['account_length', 'custserv_calls', 'total_charges']]
  61. independent_variables_with_constant = sm.add_constant(independent_variables, prepend=True)
  62. logit_model = sm.Logit(dependent_variable, independent_variables_with_constant).fit()
  63. #logit_model = smf.glm(output_variable, input_variables, family=sm.families.Binomial()).fit()
  64. print(logit_model.summary())
  65. print("\nQuantities you can extract from the result:\n%s" % dir(logit_model))
  66. print("\nCoefficients:\n%s" % logit_model.params)
  67. print("\nCoefficient Std Errors:\n%s" % logit_model.bse)
  68. #logit_marginal_effects = logit_model.get_margeff(method='dydx', at='overall')
  69. #print(logit_marginal_effects.summary())

  70. print("\ninvlogit(-7.2205 + 0.0012*mean(account_length) + 0.4443*mean(custserv_calls) + 0.0729*mean(total_charges))")

  71. def inverse_logit(model_formula):
  72.         from math import exp
  73.         return (1.0 / (1.0 + exp(-model_formula)))*100.0

  74. at_means = float(logit_model.params[0]) + \
  75.         float(logit_model.params[1])*float(churn['account_length'].mean()) + \
  76.         float(logit_model.params[2])*float(churn['custserv_calls'].mean()) + \
  77.         float(logit_model.params[3])*float(churn['total_charges'].mean())

  78. print(churn['account_length'].mean())
  79. print(churn['custserv_calls'].mean())
  80. print(churn['total_charges'].mean())
  81. print(at_means)
  82. print("Probability of churn when independent variables are at their mean values: %.2f" % inverse_logit(at_means))

  83. cust_serv_mean = float(logit_model.params[0]) + \
  84.         float(logit_model.params[1])*float(churn['account_length'].mean()) + \
  85.         float(logit_model.params[2])*float(churn['custserv_calls'].mean()) + \
  86.         float(logit_model.params[3])*float(churn['total_charges'].mean())
  87.                
  88. cust_serv_mean_minus_one = float(logit_model.params[0]) + \
  89.                 float(logit_model.params[1])*float(churn['account_length'].mean()) + \
  90.                 float(logit_model.params[2])*float(churn['custserv_calls'].mean()-1.0) + \
  91.                 float(logit_model.params[3])*float(churn['total_charges'].mean())

  92. print(cust_serv_mean)
  93. print(churn['custserv_calls'].mean()-1.0)
  94. print(cust_serv_mean_minus_one)
  95. print("Probability of churn when account length changes by 1: %.2f" % (inverse_logit(cust_serv_mean) - inverse_logit(cust_serv_mean_minus_one)))

  96. # Predict churn for "new" observations
  97. new_observations = churn.ix[churn.index.isin(xrange(10)), independent_variables.columns]
  98. new_observations_with_constant = sm.add_constant(new_observations, prepend=True)
  99. y_predicted = logit_model.predict(new_observations_with_constant)
  100. y_predicted_rounded = [round(score, 2) for score in y_predicted]
  101. print(y_predicted_rounded)

  102. # Fit a logistic regression model
  103. output_variable = churn['churn01']
  104. vars_to_keep = churn[['account_length', 'custserv_calls', 'total_charges']]
  105. inputs_standardized = (vars_to_keep - vars_to_keep.mean()) / vars_to_keep.std()
  106. input_variables = sm.add_constant(inputs_standardized, prepend=False)
  107. logit_model = sm.Logit(output_variable, input_variables).fit()
  108. #logit_model = smf.glm(output_variable, input_variables, family=sm.families.Binomial()).fit()
  109. print(logit_model.summary())
  110. print(logit_model.params)
  111. print(logit_model.bse)
  112. #logit_marginal_effects = logit_model.get_margeff(method='dydx', at='overall')
  113. #print(logit_marginal_effects.summary())

  114. # Predict output value for a new observation based on its mean standardized input values
  115. input_variables = [0., 0., 0., 1.]
  116. predicted_value = logit_model.predict(input_variables)
  117. print("Predicted value: %.5f") % predicted_value
复制代码

藤椅
soccy 发表于 2016-8-25 20:26:35
  1. #!/usr/bin/env python3
  2. import sys
  3. from datetime import date
  4. from xlrd import open_workbook, xldate_as_tuple
  5. from xlwt import Workbook

  6. input_file = sys.argv[1]
  7. output_file = sys.argv[2]

  8. output_workbook = Workbook()
  9. output_worksheet = output_workbook.add_sheet('selected_columns_all_worksheets')

  10. my_columns = ['Customer Name', 'Sale Amount']
  11.        
  12. first_worksheet = True
  13. with open_workbook(input_file) as workbook:
  14.         data = [my_columns]
  15.         index_of_cols_to_keep = []
  16.         for worksheet in workbook.sheets():
  17.                 if first_worksheet:
  18.                         header = worksheet.row_values(0)
  19.                         for column_index in range(len(header)):
  20.                                 if header[column_index] in my_columns:
  21.                                         index_of_cols_to_keep.append(column_index)
  22.                         first_worksheet = False
  23.                 for row_index in range(1, worksheet.nrows):
  24.                         row_list = []
  25.                         for column_index in index_of_cols_to_keep:       
  26.                                 cell_value = worksheet.cell_value(row_index, column_index)
  27.                                 cell_type = worksheet.cell_type(row_index, column_index)
  28.                                 if cell_type == 3:
  29.                                         date_cell = xldate_as_tuple(cell_value,workbook.datemode)
  30.                                         date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
  31.                                         row_list.append(date_cell)
  32.                                 else:
  33.                                         row_list.append(cell_value)
  34.                         data.append(row_list)

  35.         for list_index, output_list in enumerate(data):
  36.                 for element_index, element in enumerate(output_list):
  37.                         output_worksheet.write(list_index, element_index, element)

  38. output_workbook.save(output_file)
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

板凳
ryuuzt 发表于 2016-8-25 20:45:39
  1. #!/usr/bin/env python3
  2. import sys
  3. from datetime import date
  4. from xlrd import open_workbook, xldate_as_tuple
  5. from xlwt import Workbook

  6. input_file = sys.argv[1]
  7. output_file = sys.argv[2]

  8. output_workbook = Workbook()
  9. output_worksheet = output_workbook.add_sheet('set_of_worksheets')

  10. my_sheets = [0,1]
  11. threshold = 1900.0
  12. sales_column_index = 3

  13. first_worksheet = True
  14. with open_workbook(input_file) as workbook:
  15.         data = []
  16.         for sheet_index in range(workbook.nsheets):
  17.                 if sheet_index in my_sheets:
  18.                         worksheet = workbook.sheet_by_index(sheet_index)
  19.                         if first_worksheet:
  20.                                 header_row = worksheet.row_values(0)
  21.                                 data.append(header_row)
  22.                                 first_worksheet = False
  23.                         for row_index in range(1,worksheet.nrows):
  24.                                 row_list = []
  25.                                 sale_amount = worksheet.cell_value(row_index, sales_column_index)
  26.                                 if sale_amount > threshold:
  27.                                         for column_index in range(worksheet.ncols):
  28.                                                 cell_value = worksheet.cell_value(row_index,column_index)
  29.                                                 cell_type = worksheet.cell_type(row_index, column_index)
  30.                                                 if cell_type == 3:
  31.                                                         date_cell = xldate_as_tuple(cell_value,workbook.datemode)
  32.                                                         date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
  33.                                                         row_list.append(date_cell)
  34.                                                 else:
  35.                                                         row_list.append(cell_value)
  36.                                 if row_list:
  37.                                         data.append(row_list)

  38.         for list_index, output_list in enumerate(data):
  39.                 for element_index, element in enumerate(output_list):
  40.                         output_worksheet.write(list_index, element_index, element)

  41. output_workbook.save(output_file)
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

报纸
ryuuzt 发表于 2016-8-25 20:45:39
  1. #!/usr/bin/env python3
  2. import glob
  3. import os
  4. import sys
  5. from xlrd import open_workbook

  6. input_directory = sys.argv[1]

  7. workbook_counter = 0
  8. for input_file in glob.glob(os.path.join(input_directory, '*.xls*')):
  9.         workbook = open_workbook(input_file)
  10.         print('Workbook: {}'.format(os.path.basename(input_file)))
  11.         print('Number of worksheets: {}'.format(workbook.nsheets))
  12.         for worksheet in workbook.sheets():
  13.                 print('Worksheet name:', worksheet.name, '\tRows:',\
  14.                                 worksheet.nrows, '\tColumns:', worksheet.ncols)
  15.         workbook_counter += 1
  16. print('Number of Excel workbooks: {}'.format(workbook_counter))
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

地板
ekscheng 发表于 2016-8-25 22:39:58

7
steventung 发表于 2016-8-25 22:51:30
  1. #!/usr/bin/env python3
  2. import glob
  3. import os
  4. import sys
  5. from datetime import date
  6. from xlrd import open_workbook, xldate_as_tuple
  7. from xlwt import Workbook

  8. input_folder = sys.argv[1]
  9. output_file = sys.argv[2]

  10. output_workbook = Workbook()
  11. output_worksheet = output_workbook.add_sheet('sums_and_averages')

  12. all_data = []
  13. sales_column_index = 3

  14. header = ['workbook', 'worksheet', 'worksheet_total', 'worksheet_average',\
  15.                                         'workbook_total', 'workbook_average']
  16. all_data.append(header)

  17. for input_file in glob.glob(os.path.join(input_folder, '*.xls*')):
  18.         with open_workbook(input_file) as workbook:
  19.                 list_of_totals = []
  20.                 list_of_numbers = []
  21.                 workbook_output = []
  22.                 for worksheet in workbook.sheets():
  23.                         total_sales = 0
  24.                         number_of_sales = 0
  25.                         worksheet_list = []
  26.                         worksheet_list.append(os.path.basename(input_file))
  27.                         worksheet_list.append(worksheet.name)
  28.                         for row_index in range(1,worksheet.nrows):
  29.                                 try:
  30.                                         total_sales += float(str(worksheet.cell_value(row_index,sales_column_index)).strip('
  31. ).replace(',',''))
  32.                                         number_of_sales += 1.
  33.                                 except:
  34.                                         total_sales += 0.
  35.                                         number_of_sales += 0.
  36.                         average_sales = '%.2f' % (total_sales / number_of_sales)
  37.                         worksheet_list.append(total_sales)
  38.                         worksheet_list.append(float(average_sales))
  39.                         list_of_totals.append(total_sales)
  40.                         list_of_numbers.append(float(number_of_sales))
  41.                         workbook_output.append(worksheet_list)
  42.                 workbook_total = sum(list_of_totals)
  43.                 workbook_average = sum(list_of_totals)/sum(list_of_numbers)
  44.                 for list_element in workbook_output:
  45.                         list_element.append(workbook_total)
  46.                         list_element.append(workbook_average)
  47.                 all_data.extend(workbook_output)
  48.                
  49. for list_index, output_list in enumerate(all_data):
  50.         for element_index, element in enumerate(output_list):
  51.                 output_worksheet.write(list_index, element_index, element)

  52. output_workbook.save(output_file)
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

8
挚爱酸模 发表于 2016-8-25 23:20:46
  1. #!/usr/bin/env python3
  2. import sys
  3. from xlrd import open_workbook
  4. from xlwt import Workbook

  5. input_file = sys.argv[1]
  6. output_file = sys.argv[2]

  7. output_workbook = Workbook()
  8. output_worksheet = output_workbook.add_sheet('jan_2013_output')

  9. with open_workbook(input_file) as workbook:
  10.         worksheet = workbook.sheet_by_name('january_2013')
  11.         for row_index in range(worksheet.nrows):
  12.                 for column_index in range(worksheet.ncols):
  13.                         output_worksheet.write(row_index, column_index, worksheet.cell_value(row_index, column_index))
  14. output_workbook.save(output_file)
复制代码

9
smartlife 在职认证  发表于 2016-8-25 23:25:37
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

10
bbslover 在职认证  发表于 2016-8-26 00:40:09
thanks for sharing
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2025-12-30 21:44