Web Scraping with Python
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Part I. Building Scrapers
1. Your First Web Scraper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Advanced HTML Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3. Writing Web Crawlers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4. Web Crawling Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5. Scrapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Part II. Advanced Scraping
7. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Document Encoding 107
Text 108
Text Encoding and the Global Internet 109
CSV 113
Reading CSV Files 113
PDF 115
Microsoft Word and .docx 117
8. Cleaning Your Dirty Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Cleaning in Code 121
Data Normalization 124
Cleaning After the Fact 126
OpenRefine 126
9. Reading and Writing Natural Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Summarizing Data 132
Markov Models 135
Six Degrees of Wikipedia: Conclusion 139
Natural Language Toolkit 142
Installation and Setup 142
Statistical Analysis with NLTK 143
Lexicographical Analysis with NLTK 145
Additional Resources 149
10. Crawling Through Forms and Logins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Python Requests Library 151
Submitting a Basic Form 152
Radio Buttons, Checkboxes, and Other Inputs 154
Submitting Files and Images 155
Handling Logins and Cookies 156
HTTP Basic Access Authentication 157
Other Form Problems 158
11. Scraping JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A Brief Introduction to JavaScript 162
Common JavaScript Libraries 163
Ajax and Dynamic HTML 165
Executing JavaScript in Python with Selenium 166
Additional Selenium Webdrivers 171
Handling Redirects 171
A Final Note on JavaScript 173
12. Crawling Through APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A Brief Introduction to APIs 175
HTTP Methods and APIs 177
More About API Responses 178
Parsing JSON 179
Undocumented APIs 181
Finding Undocumented APIs 182
Documenting Undocumented APIs 184
Finding and Documenting APIs Automatically 184
Combining APIs with Other Data Sources 187
More About APIs 190
13. Image Processing and Text Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Overview of Libraries 194
Pillow 194
Tesseract 195
NumPy 197
Processing Well-Formatted Text 197
Adjusting Images Automatically 200
Scraping Text from Images on Websites 203
Reading CAPTCHAs and Training Tesseract 206
Training Tesseract 207
Retrieving CAPTCHAs and Submitting Solutions 211
14. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
A Note on Ethics 215
Looking Like a Human 216
Adjust Your Headers 217
Handling Cookies with JavaScript 218
Timing Is Everything 220
Common Form Security Features 221
Hidden Input Field Values 221
Avoiding Honeypots 223
The Human Checklist 224
15. Testing Your Website with Scrapers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
An Introduction to Testing 227
What Are Unit Tests? 228
Python unittest 228
Testing Wikipedia 230
Testing with Selenium 233
Interacting with the Site 233
unittest or Selenium? 236
16. Web Crawling in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Processes versus Threads 239
Multithreaded Crawling 240
Race Conditions and Queues 242
The threading Module 245
Multiprocess Crawling 247
Multiprocess Crawling 249
Communicating Between Processes 251
Multiprocess Crawling—Another Approach 253
17. Scraping Remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Why Use Remote Servers? 255
Avoiding IP Address Blocking 256
Portability and Extensibility 257
Tor 257
PySocks 259
Remote Hosting 259
Running from a Website-Hosting Account 260
Running from the Cloud 261
Additional Resources 262
18. The Legalities and Ethics of Web Scraping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Trademarks, Copyrights, Patents, Oh My! 263
Copyright Law 264
Trespass to Chattels 266
The Computer Fraud and Abuse Act 268
robots.txt and Terms of Service 269
Three Web Scrapers 272
eBay versus Bidder’s Edge and Trespass to Chattels 272
United States v. Auernheimer and The Computer Fraud and Abuse Act 274
Field v. Google: Copyright and robots.txt 275
Moving Forward 276
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279