eBook - ePub

Hands-On Web Scraping with Python

Name: Hands-On Web Scraping with Python
Author: Anish Chapagain

Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others

Anish Chapagain

Partager le livre

350 pages
English
ePUB (adapté aux mobiles)
Disponible sur iOS et Android

eBook - ePub

Hands-On Web Scraping with Python

Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others

Anish Chapagain

Détails du livre

Aperçu du livre

Table des matières

Citations

À propos de ce livre

Collect and scrape different complexities of data from the modern Web using the latest tools, best practices, and techniques

Key Features

Learn different scraping techniques using a range of Python libraries such as Scrapy and Beautiful Soup
Build scrapers and crawlers to extract relevant information from the web
Automate web scraping operations to bridge the accuracy gap and manage complex business needs

Book Description

Web scraping is an essential technique used in many organizations to gather valuable data from web pages. This book will enable you to delve into web scraping techniques and methodologies.The book will introduce you to the fundamental concepts of web scraping techniques and how they can be applied to multiple sets of web pages. You'll use powerful libraries from the Python ecosystem such as Scrapy, lxml, pyquery, and bs4 to carry out web scraping operations. You will then get up to speed with simple to intermediate scraping operations such as identifying information from web pages and using patterns or attributes to retrieve information. This book adopts a practical approach to web scraping concepts and tools, guiding you through a series of use cases and showing you how to use the best tools and techniques to efficiently scrape web pages. You'll even cover the use of other popular web scraping tools, such as Selenium, Regex, and web-based APIs.By the end of this book, you will have learned how to efficiently scrape the web using different techniques with Python and other popular tools.

What you will learn

Analyze data and information from web pages
Learn how to use browser-based developer tools from the scraping perspective
Use XPath and CSS selectors to identify and explore markup elements
Learn to handle and manage cookies
Explore advanced concepts in handling HTML forms and processing logins
Optimize web securities, data storage, and API use to scrape data
Use Regex with Python to extract data
Deal with complex web entities by using Selenium to find and extract data

Who this book is for

This book is for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need! A working knowledge of the Python programming language is expected.

]]>

Foire aux questions

Comment puis-je résilier mon abonnement ?

Il vous suffit de vous rendre dans la section compte dans paramètres et de cliquer sur « Résilier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez résilié votre abonnement, il restera actif pour le reste de la période pour laquelle vous avez payé. Découvrez-en plus ici.

Puis-je / comment puis-je télécharger des livres ?

Pour le moment, tous nos livres en format ePub adaptés aux mobiles peuvent être téléchargés via l’application. La plupart de nos PDF sont également disponibles en téléchargement et les autres seront téléchargeables très prochainement. Découvrez-en plus ici.

Quelle est la différence entre les formules tarifaires ?

Les deux abonnements vous donnent un accès complet à la bibliothèque et à toutes les fonctionnalités de Perlego. Les seules différences sont les tarifs ainsi que la période d’abonnement : avec l’abonnement annuel, vous économiserez environ 30 % par rapport à 12 mois d’abonnement mensuel.

Qu’est-ce que Perlego ?

Nous sommes un service d’abonnement à des ouvrages universitaires en ligne, où vous pouvez accéder à toute une bibliothèque pour un prix inférieur à celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! Découvrez-en plus ici.

Prenez-vous en charge la synthèse vocale ?

Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte à haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accélérer ou le ralentir. Découvrez-en plus ici.

Est-ce que Hands-On Web Scraping with Python est un PDF/ePUB en ligne ?

Oui, vous pouvez accéder à Hands-On Web Scraping with Python par Anish Chapagain en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Informatique et Traitement des données. Nous disposons de plus d’un million d’ouvrages à découvrir dans notre catalogue.

Informations

Éditeur

Packt Publishing

Année

2019

ISBN

9781789536195

Sujet

Informatique

Sous-sujet

Traitement des données

Section 1: Introduction to Web Scraping

In this section, you will be given an overview of web scraping (scraping requirements, the importance of data), web contents (patterns and layouts), Python programming and libraries (the basics and advanced), and data managing techniques (file handling and databases).

This section consists of the following chapter:

Chapter 1, Web Scraping Fundamentals

Web Scraping Fundamentals

In this chapter, we will learn about and explore certain fundamental concepts related to web scraping and web-based technologies, assuming that you have no prior experience of web scraping.

So, to start with, let's begin by asking a number of questions:

Why is there a growing need or demand for data?
How are we going to manage and fulfill the requirement for data with resources from the World Wide Web (WWW)?

Web scraping addresses both these questions, as it provides various tools and technologies that can be deployed to extract data or assist with information retrieval. Whether its web-based structured or unstructured data, we can use the web scraping process to extract data and use it for research, analysis, personal collections, information extraction, knowledge discovery, and many more purposes.

We will learn general techniques that are deployed to find data from the web and explore those techniques in depth using the Python programming language in the chapters ahead.

In this chapter, we will cover the following topics:

Introduction to web scraping
Understanding web development and technologies
Data finding techniques

Introduction to web scraping

Scraping is the process of extracting, copying, screening, or collecting data. Scraping or extracting data from the web (commonly known as websites or web pages, or internet-related resources) is normally termed web scraping.

Web scraping is a process of data extraction from the web that is suitable for certain requirements. Data collection and analysis, and its involvement in information and decision making, plus research-related activities, make the scraping process sensitive for all types of industry.

The popularity of the internet and its resources is causing information domains to evolve every day, which is also causing a growing demand for raw data. Data is the basic requirement in the fields of science, technology, and management. Collected or organized data is processed with varying degrees of logic to obtain information and gain further insights.

Web scraping provides the tools and techniques used to collect data from websites as appropriate for either personal or business-related needs, but with a number of legal considerations.

There are a number of legal factors to consider before performing scraping tasks. Most websites contain pages such as Privacy Policy, About Us, and Terms and Conditions, where legal terms, prohibited content policies, and general information are available. It's a developer's ethical duty to follow those policies before planning any crawling and scraping activities from websites.

Scraping and crawling are both used quite interchangeably throughout the chapters in this book. Crawling, also known as spidering, is a process used to browse through the links on websites and is often used by search engines for indexing purposes, whereas scraping is mostly related to content extraction from websites.

Understanding web development and technologies

A web page is not only a document container. Today's rapid developments in computing and web technologies have transformed the web into a dynamic and real-time source of information.

At our end, we (the users) use web browsers (such as Google Chrome, Firefox Mozilla, Internet Explorer, and Safari) to access information from the web. Web browsers provide various document-based functionalities to users and contain application-level features that are often useful to web developers.

Web pages that users view or explore through their browsers are not only single documents. Various technologies exist that can be used to develop websites or web pages. A web page is a document that contains blocks of HTML tags. Most of the time, it is built with various sub-blocks linked as dependent or independent components from various interlinked technologies, including JavaScript and CSS.

An understanding of the general concepts of web pages and the techniques of web development, along with the technologies found inside web pages, will provide more flexibility and control in the scraping process. A lot of the time, a developer can also employ reverse engineering techniques.

Reverse engineering is an activity that involves breaking down and examining the concepts that were required to build certain products. For more information on reverse engineering, please refer to the GlobalSpec article, How Does Reverse Engineering Work?, available at https://insights.globalspec.com/article/7367/how-does-reverse-engineering-work.

Here, we will introduce and explore a few of the techniques that can help and guide us in the process of data extraction.

HTTP

Hyper Text Transfer Protocol (HTTP) is an application protocol that transfers resources such as HTML documents between a client and a web server. HTTP is a stateless protocol that follows the client-server model. Clients (web browsers) and web servers communicate or exchange information using HTTP Requests and HTTP Responses:

HTTP (client-server communication)

With HTTP requests or HTTP methods, a client or browser submits requests to the server. There are various methods (also known as HTTP request methods) for submitting requests, such as GET, POST, and PUT:

GET: This is a common method for requesting information. It is considered a safe method, as the resource state is not altered. Also, it is used to provide query strings such as http://www.test-domain.com/, requesting information from servers based on the id and display parameters sent with the request.
POST: This is used to make a secure request to a server. The requested resource state can be altered. Data posted or sent to the requested URL is not visible in the URL, but rather transferred with the request body. It's used...