楼主: Lisrelchen
3017 15

[GitHub]Python Web Scraping - Second Edition [推广有奖]

  • 0关注
  • 62粉丝

VIP

已卖:4194份资源

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50288 个
通用积分
83.6306
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

楼主
Lisrelchen 发表于 2017-7-7 21:54:11 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Python Web Scraping - Second Edition

本帖隐藏的内容

Python-Web-Scraping-Second-Edition-master.zip (61.33 KB)

This is the code repository for Python Web Scraping - Second Edition, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.

About the Book

The internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online.

This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you’ll see how to extract data from static web pages. You’ll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you’ll get hands-on practice in building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers.

Instructions and Navigation

All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.

The code will look like the following:

from urllib.request import urlopenfrom urllib.error import URLErrorurl = 'http://example.webscraping.com'try: html = urlopen(url).read()except urllib2.URLError as e: html = None

To help illustrate the crawling examples we have created a sample website at http://example.webscraping.com. The source code used to generate this website is available at http://bitbucket.org/WebScrapingWithPython/website, which includes instructions how to host the website yourself if you prefer. We decided to build a custom website for the examples instead of scraping live websites so we have full control over the environment. This provides us stability - live websites are updated more often than books and by the time you try a scraping example it may no longer work. Also a custom website allows us to craft examples that illustrate specific skills and avoid distractions. Finally a live website might not appreciate us using them to learn about web scraping and might then block our scrapers. Using our own custom website avoids these risks, however the skills learnt in these examples can certainly still be applied to live websites.

Related ProductsSuggestions and Feedback

Click here if you have any feedback or suggestions.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Edition python dition editio Second

本帖被以下文库推荐

沙发
Lisrelchen 发表于 2017-7-7 21:56:18
  1. import re
  2. import urllib.request
  3. from urllib import robotparser
  4. from urllib.parse import urljoin
  5. from urllib.error import URLError, HTTPError, ContentTooShortError
  6. from chp1.throttle import Throttle


  7. def download(url, user_agent='wswp', num_retries=2, charset='utf-8', proxy=None):
  8.     """ Download a given URL and return the page content
  9.         args:
  10.             url (str): URL
  11.         kwargs:
  12.             user_agent (str): user agent (default: wswp)
  13.             charset (str): charset if website does not include one in headers
  14.             proxy (str): proxy url, ex 'http://IP' (default: None)
  15.             num_retries (int): number of retries if a 5xx error is seen (default: 2)
  16.     """
  17.     print('Downloading:', url)
  18.     request = urllib.request.Request(url)
  19.     request.add_header('User-agent', user_agent)
  20.     try:
  21.         if proxy:
  22.             proxy_support = urllib.request.ProxyHandler({'http': proxy})
  23.             opener = urllib.request.build_opener(proxy_support)
  24.             urllib.request.install_opener(opener)
  25.         resp = urllib.request.urlopen(request)
  26.         cs = resp.headers.get_content_charset()
  27.         if not cs:
  28.             cs = charset
  29.         html = resp.read().decode(cs)
  30.     except (URLError, HTTPError, ContentTooShortError) as e:
  31.         print('Download error:', e.reason)
  32.         html = None
  33.         if num_retries > 0:
  34.             if hasattr(e, 'code') and 500 <= e.code < 600:
  35.                 # recursively retry 5xx HTTP errors
  36.                 return download(url, num_retries - 1)
  37.     return html


  38. def get_robots_parser(robots_url):
  39.     " Return the robots parser object using the robots_url "
  40.     rp = robotparser.RobotFileParser()
  41.     rp.set_url(robots_url)
  42.     rp.read()
  43.     return rp


  44. def get_links(html):
  45.     " Return a list of links (using simple regex matching) from the html content "
  46.     # a regular expression to extract all links from the webpage
  47.     webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
  48.     # list of all links from the webpage
  49.     return webpage_regex.findall(html)


  50. def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
  51.                  proxy=None, delay=3, max_depth=4):
  52.     """ Crawl from the given start URL following links matched by link_regex. In the current
  53.         implementation, we do not actually scrapy any information.

  54.         args:
  55.             start_url (str): web site to start crawl
  56.             link_regex (str): regex to match for links
  57.         kwargs:
  58.             robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt)
  59.             user_agent (str): user agent (default: wswp)
  60.             proxy (str): proxy url, ex 'http://IP' (default: None)
  61.             delay (int): seconds to throttle between requests to one domain (default: 3)
  62.             max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
  63.     """
  64.     crawl_queue = [start_url]
  65.     # keep track which URL's have seen before
  66.     seen = {}
  67.     if not robots_url:
  68.         robots_url = '{}/robots.txt'.format(start_url)
  69.     rp = get_robots_parser(robots_url)
  70.     throttle = Throttle(delay)
  71.     while crawl_queue:
  72.         url = crawl_queue.pop()
  73.         # check url passes robots.txt restrictions
  74.         if rp.can_fetch(user_agent, url):
  75.             depth = seen.get(url, 0)
  76.             if depth == max_depth:
  77.                 print('Skipping %s due to depth' % url)
  78.                 continue
  79.             throttle.wait(url)
  80.             html = download(url, user_agent=user_agent, proxy=proxy)
  81.             if not html:
  82.                 continue
  83.             # TODO: add actual data scraping here
  84.             # filter for links matching our regular expression
  85.             for link in get_links(html):
  86.                 if re.match(link_regex, link):
  87.                     abs_link = urljoin(start_url, link)
  88.                     if abs_link not in seen:
  89.                         seen[abs_link] = depth + 1
  90.                         crawl_queue.append(abs_link)
  91.         else:
  92.             print('Blocked by robots.txt:', url)
复制代码

藤椅
Lisrelchen 发表于 2017-7-7 21:57:54
  1. from bs4 import BeautifulSoup
  2. from chp1.advanced_link_crawler import download

  3. url = 'http://example.webscraping.com/view/UnitedKingdom-239'
  4. html = download(url)
  5. soup = BeautifulSoup(html, 'html5lib')

  6. # locate the area row
  7. tr = soup.find(attrs={'id': 'places_area__row'})
  8. td = tr.find(attrs={'class': 'w2p_fw'})  # locate the data
  9. area = td.text  # extract the data
  10. print(area)
复制代码

板凳
Lisrelchen 发表于 2017-7-7 21:58:52
  1. from bs4 import BeautifulSoup
  2. from chp1.advanced_link_crawler import download

  3. broken_html = '<ul class=country><li>Area<li>Population</ul>'

  4. soup = BeautifulSoup(broken_html, 'html.parser')
  5. fixed_html = soup.prettify()
  6. print(fixed_html)

  7. # still broken, so try a different parser

  8. soup = BeautifulSoup(broken_html, 'html5lib')
  9. fixed_html = soup.prettify()
  10. print(fixed_html)

  11. # now we can try and extract the data from the html

  12. ul = soup.find('ul', attrs={'class': 'country'})
  13. print(ul.find('li'))  # returns just the first match
  14. print(ul.find_all('li'))  # returns all matches
复制代码

报纸
Lisrelchen 发表于 2017-7-7 21:59:47
  1. from lxml.html import fromstring
  2. from chp1.advanced_link_crawler import download

  3. url = 'http://example.webscraping.com/view/UnitedKingdom-239'
  4. html = download(url)

  5. tree = fromstring(html)
  6. td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]
  7. area = td.text_content()
  8. print(area)
复制代码

地板
Lisrelchen 发表于 2017-7-7 22:00:57
  1. from lxml.html import fromstring
  2. from chp1.advanced_link_crawler import download

  3. url = 'http://example.webscraping.com/view/UnitedKingdom-239'
  4. html = download(url)

  5. tree = fromstring(html)
  6. area = tree.xpath('//tr[@id="places_area__row"]/td[@class="w2p_fw"]/text()')[0]
  7. print(area)
复制代码

7
Rootcn 发表于 2017-7-7 22:35:45
谢谢楼主分享
已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

8
heiyaodai 发表于 2017-7-7 22:43:44
谢谢分享
已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

9
MouJack007 发表于 2017-7-7 23:00:16
谢谢楼主分享!

10
MouJack007 发表于 2017-7-7 23:00:36
已有 1 人评分论坛币 收起 理由
Nicolle + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-2 00:46