你的代码有一个错误,
第60行应该为:
dict1 = {'secCode':secCode,'secName':secName,'url':url,'title':title,
'publishTime':publishTime}
另外,你的地址设定不够灵活,稍微修改了一下,供你参考。
from pathlib import Path
# 提取title中字符串获取年份
data_download_pdf['Year'] = data_download_pdf['title'].str.extract('([0-9]{4})')
cwd = Path().cwd()
# file_path = "G:\\深交所年报\\"
file_path = Path(cwd, '深交所年报')
Path(file_path).mkdir(parents=True, exist_ok=True)
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
# 文件路径要以\\结尾,如放入F盘年报文件夹,现在F盘创建文件夹,将路径写成 F:\\年报\\
for each in range(data_download_pdf.shape[0]):
# each = 1
# pdf_url = "http://disc.static.szse.cn/download//disc/disk02/finalpage/2019-07-05/dde0ce5e-e2c7-4c09-b6f4-a03ad9d593ee.PDF"
code = data_download_pdf.at[each,'secCode']
name = data_download_pdf.at[each,'secName'].replace("*","")
year = data_download_pdf.at[each,'Year']
print("开始下载{},股票代码{}的{}年报".format(name,code,year))
file_name = "{}{}{}.pdf".format(code,name,year)
file_full_name = Path(file_path, file_name)
# file_full_name = 'F:\\1.pdf'
print(file_full_name)
pdf_url = data_download_pdf.at[each,'url']
rs = requests.get(pdf_url,headers= headers, stream=True)
with open(file_full_name, "wb") as fp:
for chunk in rs.iter_content(chunk_size=10240):
if chunk:
fp.write(chunk)
time.sleep(random.uniform(1,2)) # 控制访问速度
print("===================下载完成==========================")
感谢你的代码,很有收获。谢谢! 可以多交流。
|