楼主: Lisrelchen
1750 15

Introduction to Web and Data Scraping Tutorial [推广有奖]

  • 0关注
  • 62粉丝

VIP

已卖:4194份资源

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50288 个
通用积分
83.6306
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

楼主
Lisrelchen 发表于 2017-7-7 21:40:05 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

https://github.com/kjam/python-web-scraping-tutorial


A tutorial-based introduction to web scraping with Python.


Virtual Env

If you'd like to use virtual environments, please follow the following instructions. It is not required for the tutorial but may be helpful.

For more details on virtual environments

If you don't have virtual env wrapper and/or pip:

$ easy_install pip$ pip install virtualenvwrapper

and read the additional instructions here

$ mkvirtualenv scraper_tutorial$ pip install -r requirements.txtLXML and Selenium

You will need both LXML and Selenium to follow this tutorial in it's entirety.

If you are using a Mac, I would highly recommend using Homebrew. It will help make pip install very easy for you to use.

If you are using Windows, it might be worth it to run this within a Linux Virtual Machine. If you are a Windows + Python guru, please follow these installation instructions. I can help as needed but I have not programmed on Windows in more than 5 years.

Please reach out to me if you have any questions on getting the initial requirements set up. Thanks!

Firefox Web Browser

Firefox comes as the default web driver for Selenium. To use Selenium easily, please download and install Firefox.

Using PIP

If you have never used PIP before you will need to sudo easy_install pip or brew install pip. PIP is a python package manager and it's really super so I highly advise using it!

Questions?

/msg kjam on freenode or @kjam on twitter


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:introduction troduction Tutorial intro tutor

沙发
Lisrelchen 发表于 2017-7-7 21:41:13
  1. from bs4 import BeautifulSoup

  2. # Let's grab the simple page source.
  3. simple_page = open('data/simple.html').read()

  4. # Let's open it with BS so we can iterate over the family tree.
  5. simple_soup = BeautifulSoup(simple_page)

  6. # Let's highlight our current element and return it so we can play around!
  7. current_elem = simple_soup.findAll('div', {'class': 'navblock'})[1]
复制代码

藤椅
Lisrelchen 发表于 2017-7-7 21:41:42
  1. import json
  2. from urllib import urlopen

  3. ip_info = urlopen('http://freegeoip.net/json/').read()

  4. my_ip = json.loads(ip_info)

  5. print "I think you're at: %f lat, %f long and in %s" % (
  6.     my_ip.get('latitude'), my_ip.get('longitude'), my_ip.get('city'))
复制代码

板凳
Lisrelchen 发表于 2017-7-7 21:42:27
  1. from selenium import webdriver
  2. from selenium.webdriver.common.keys import Keys
  3. from selenium.webdriver.common.action_chains import ActionChains
  4. from selenium.webdriver.common.by import By
  5. from selenium.webdriver.support.ui import WebDriverWait
  6. from selenium.webdriver.support import expected_conditions as EC
  7. from time import sleep

  8. MY_EMAIL = ''
  9. MY_PASSWORD = ''
  10. MY_PROFILE_NAME = ''

  11. browser = webdriver.Firefox()
  12. browser.get('http://netflix.com')
  13. browser.find_element_by_link_text('Sign In').click()
  14. email = browser.find_element_by_css_selector('input#email')
  15. email.send_keys(MY_EMAIL)
  16. pw = browser.find_element_by_css_selector('input#password')
  17. pw.send_keys(MY_PASSWORD, Keys.RETURN)
  18. browser.implicitly_wait(10)  # seconds
  19. browser.find_element_by_link_text(MY_PROFILE_NAME).click()
  20. browser.maximize_window()
  21. rows = browser.find_elements_by_css_selector('div.mrow')

  22. for r in rows:
  23.     if 'Top Picks' in r.text:
  24.         top_pix = r
  25.         break

  26. movie_recs = top_pix.find_elements_by_css_selector(
  27.     'div.agMovieSet div.agMovie')


  28. first_movie = movie_recs[0].location
  29. scroll_down = ActionChains(browser).move_by_offset(
  30.     10, first_movie.get('y') - 10)
  31. scroll_down.perform()

  32. movie_dict = {}

  33. for movie in movie_recs:
  34.     movie_link = movie.find_element_by_css_selector('a.bobbable')
  35.     try:
  36.         arrow = top_pix.find_element_by_css_selector('div.next.sliderButton')
  37.         if arrow.location.get('x') - movie_link.location.get('x') < 80:
  38.             hover = ActionChains(browser).move_to_element(arrow)
  39.             hover.perform()
  40.             sleep(4)
  41.         hover = ActionChains(browser).move_to_element(movie_link)
  42.         hover.perform()
  43.     except Exception, e:
  44.         print e
  45.         hover = ActionChains(browser).move_to_element(arrow)
  46.         hover.perform()
  47.         sleep(5)
  48.         move_off_arrow = ActionChains(browser).move_by_offset(-450, -130)
  49.         move_off_arrow.perform()
  50.         hover = ActionChains(browser).move_to_element(movie_link)
  51.         hover.perform()
  52.     try:
  53.         movie_info = WebDriverWait(browser, 10).until(
  54.             EC.element_to_be_clickable((By.ID, 'BobMovie')))
  55.         title = movie_info.find_element_by_class_name('title').text
  56.         link = movie_info.find_element_by_class_name(
  57.             'mdpLink').get_attribute('href')
  58.         desc = movie_info.find_element_by_class_name(
  59.             'bobMovieContent').text.split('\n')[0]
  60.         cast = movie_info.find_element_by_tag_name('dd').text
  61.         movie_dict[title] = {'link': link, 'title': title,
  62.                              'desc': desc, 'cast': cast}
  63.     except:
  64.         print "taking too long!"
  65.     scroll_off = ActionChains(browser).move_by_offset(30, -130)
  66.     scroll_off.perform()
  67.     sleep(2)
复制代码

报纸
crossbone254 发表于 2017-7-7 21:43:52
Introduction to Web and Data Scraping Tutorial

地板
西门高 发表于 2017-7-7 21:45:39
谢谢分享

7
auirzxp 学生认证  发表于 2017-7-7 22:07:57
提示: 作者被禁止或删除 内容自动屏蔽

8
pingguagain 发表于 2017-7-7 22:12:53
kankan

9
MouJack007 发表于 2017-7-7 23:03:10
谢谢楼主分享!

10
MouJack007 发表于 2017-7-7 23:07:00

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-3 12:12