eBook - ePub

Python Web Scraping - Second Edition

Name: Python Web Scraping - Second Edition
Author: Katharine Jarmul, Richard Lawson

Katharine Jarmul, Richard Lawson

Share book

220 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Python Web Scraping - Second Edition

Katharine Jarmul, Richard Lawson

Book details

Book preview

Table of contents

Citations

About This Book

Successfully scrape data from any website with the power of Python 3.xAbout This Book• A hands-on guide to web scraping using Python with solutions to real-world problems• Create a number of different web scrapers in Python to extract information• This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needsWho This Book Is ForThis book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved.What You Will Learn• Extract data from web pages with simple Python programming• Build a concurrent crawler to process web pages in parallel• Follow links to crawl a website• Extract features from the HTML• Cache downloaded HTML for reuse• Compare concurrent models to determine the fastest crawler• Find out how to parse JavaScript-dependent websites• Interact with forms and sessionsIn DetailThe Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online.This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers.You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites.By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics.Style and approachThis hands-on guide is full of real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Python Web Scraping - Second Edition an online PDF/ePUB?

Yes, you can access Python Web Scraping - Second Edition by Katharine Jarmul, Richard Lawson in PDF and/or ePUB format, as well as other popular books in Informatique & Développement Web. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2017

ISBN

9781786464293

Edition

Topic

Informatique

Subtopic

Développement Web

Introduction to Web Scraping

Welcome to the wide world of web scraping! Web scraping is used by many fields to collect data not easily available in other formats. You could be a journalist, working on a new story, or a data scientist extracting a new dataset. Web scraping is a useful tool even for just a casual programmer, if you need to check your latest homework assignments on your university page and have them emailed to you. Whatever your motivation, we hope you are ready to learn!

In this chapter, we will cover the following topics:

Introducing the field of web scraping
Explaining the legal challenges
Explaining Python 3 setup
Performing background research on our target website
Progressively building our own advanced web crawler
Using non-standard libraries to help scrape the Web

When is web scraping useful?

Suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own; however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book.

In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove, or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want. Therefore we need to learn about web scraping techniques.

Is web scraping legal?

Web scraping, and what is legally permissible when web scraping, are still being established despite numerous rulings over the past two decades. If the scraped data is being used for personal and private use, and within fair use of copyright laws, there is usually no problem. However, if the data is going to be republished, if the scraping is aggressive enough to take down the site, or if the content is copyrighted and the scraper violates the terms of service, then there are several legal precedents to note.

In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court decided scraping and republishing facts, such as telephone listings, are allowed. A similar case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd, demonstrated that only data with an identifiable author can be copyrighted. Another scraped content case in the United States, evaluating the reuse of Associated Press stories for an aggregated news product, was ruled a violation of copyright in Associated Press v. Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular crawling and deep linking is permissible.

There have also been several cases in which companies have charged the plaintiff with aggressive scraping and attempted to stop the scraping via a legal order. The most recent case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it could not be considered intentional harm, despite the crawler activity leading to some site stability issues.

These cases suggest that, when the scraped data constitutes public facts (such as business locations and telephone listings), it can be republished following fair use rules. However, if the data is original (such as opinions and reviews or private user data), it most likely cannot be republished for copyright reasons. In any case, when you are scraping data from a website, remember you are their guest and need to behave politely; otherwise, they may ban your IP address or proceed with legal action. This means you should make download requests at a reasonable rate and define a user agent to identify your crawler. You should also take measures to review the Terms of Service of the site and ensure the data you are taking is not considered private or copyrighted.

If you have doubts or questions, it may be worthwhile to consult a media lawyer regarding the precedents in your area of residence.

You can read more about these legal cases at the following sites:

Feist Publications Inc. v. Rural Telephone Service Co.
(http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=499&invol=340)
Telstra Corporation Limited v. Phone Directories Company Pvt Ltd
(http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html)

Associated Press v.Meltwater
(http://www.nysd.uscourts.gov/cases/show.php?db=special&id=279)

ofir.dk vs home.dk
(http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf)

QVC v. Resultly
(https://www.paed.uscourts.gov/documents/opinions/16D0129P.pdf)

Python 3

Throughout this second edition of Web Scraping with Python, we will use Python 3. The Python Software Foundation has announced Python 2 will be phased out of development and support in 2020; for this reason, we and many other Pythonistas aim to move development to the support of Python 3, which at the time of this publication is at version 3.6. This book is complaint with Python 3.4+.

If you are familiar with using Python Virtual Environments or Anaconda, you likely already know how to set up Python 3 in a new environment. If you'd like to install Python 3 globally, we recommend searching for your operating system-specific documentation. For my part, I simply use Virtual Environment Wrapper (https://virtualenvwrapper.readthedocs.io/en/latest/) to easily maintain many different environments for different projects and versions of Python. Using either Conda environments or virtual environments is highly recommended, so that you can easily change dependencies based on your project needs without affecting other work you are doing. For beginners, I recommend using Conda as it requires less setup. The Conda introductory documentation (https://conda.io/docs/intro.html) is a good place to start!

From this point forward, all code and commands will assume you have Python 3 properly installed and are working with a Python 3.4+ environment. If you see Import or Syntax errors, please check that you are in the proper environment and look for pesky Python 2.7 file paths in your Traceback.

Background research

Before diving into crawling a website, we should develop an understanding about the scale and structure of our target website. The website itself can help us via the robots.txt and Sitemap files, and there are also external tools available to provide further details such as Google Search and WHOIS.

Checking robots.txt

Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:

 # section 1 
User-agent: BadCrawler 
Disallow: / 

# section 2 
User-agent: * 
Crawl-delay: 5 
Disallow: /trap 

# section 3 
Sitemap: http://example.webscraping.com/sitemap.xml

In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.

Section 2 specifies a crawl delay of 5 seconds between download requ...