人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › [Python Library]BeautifulScraper

发帖

楼主: Lisrelchen

1068 3

[Python Library]BeautifulScraper [推广有奖]

0关注
62粉丝

VIP

已卖：4196份资源

院士

67%

还不是VIP/贵宾

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望: 0 级
论坛币: 50294 个
通用积分: 83.8106
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-7-7 21:51:10 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

BeautifulScraper

Simple wraper around BeautifulSoup for HTML parsing and urllib2 for HTTP(S) request/response handling. BeautifulScraper also overrides some of the default handlers in urllib2 in order to:

Handle cookies properly
Offer full control of included cookies
Return the actual response from the server, un-mangled and not reprocessed

Installation# pip install beautifulscraper

# git clone git://github.com/adregner/beautifulscraper.git# cd beautifulscraper/# python setup.py installExamples

Getting started is brain-dead simple.

>>> from beautifulscraper import BeautifulScraper>>> scraper = BeautifulScraper()

Start by requesting something.

>>> body = scraper.go("https://github.com/adregner/beautifulscraper")

The response will be a plain BeautifulSoup object. See their documentation for how to use it.

>>> body.select(".repository-meta-content")[0].text'\n\n Python web-scraping library that wraps urllib2 and BeautifulSoup\n \n'

The headers from the server's response are accessiable.

>>> for header, value in scraper.response_headers.items():... print "%s: %s" % (header, value)...status: 200 OKcontent-length: 36179set-cookie: _gh_sess=BAh7BzoQX2NzcmZfdG9rZW4iMUNCOWxnbFpVd3EzOENqVk9GTUFXbDlMVUJIbGxsNEVZUFZJNiswRjhwejQ9Og9zZXNzaW9uX2lkIiUyNmQ2ODE5ZDdiZjM3MTA2N2VlZDk3Y2VlMDViYzI2OA%3D%3D--5d31df13d5c0eeb8f3cccb140392124968abc374; path=/; expires=Sat, 01-Jan-2022 00:00:00 GMT; secure; HttpOnlystrict-transport-security: max-age=2592000connection: closeserver: nginxx-runtime: 98etag: "1c595b5b6a25eb7f021e68c3476d61da"cache-control: private, max-age=0, must-revalidatedate: Wed, 31 Oct 2012 02:14:08 GMTx-frame-options: denycontent-type: text/html; charset=utf-8

So is the response code as an integer.

>>> type(scraper.response_code), scraper.response_code(<type 'int'>, 200)

The scraper will keep track of all cookies it sees via the cookielib.CookieJar class. You can read the cookies if you'd like. The Cookie object's are just a collection of properties.

>>> scraper.cookies[0].name'_gh_sess'

See the pydoc for more information.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

沙发

MouJack007 发表于 2017-7-7 22:59:18

谢谢楼主分享！

藤椅

MouJack007 发表于 2017-7-7 22:59:46

板凳

hjtoh 发表于 2017-7-8 06:01:23 来自手机

Lisrelchen 发表于 2017-7-7 21:51
BeautifulScraperSimple wraper around BeautifulSoup for HTML parsing and urllib2 for HTTP ...

很不错呢

返回列表

发帖

本版微信群

加好友,备注jltj
拉您入交流群

京ICP备16021002号-2 京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明

[Python Library]BeautifulScraper [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

浏览过的帖子

浏览过的版块

本版微信群

[Python Library]BeautifulScraper [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群