Introduction to Python web scraping

Learn Web scraping using Python with Examples

What is Web Scraping?

Web Scraping or web data extraction is the process of extracting data from different websites through HTTP (HyperText Transfer Protocol) or through a web browser. Web scraping is an automated process that extracts the web data and a data analyst or a data scientist can parse through the gathered data and create a cluster of important data or quality data.

Web scraping using Python can be done using BeautifulSoup library.

Web Scraping with BeautifulSoup

Python web scraping using Beautiful Soup

BeautifulSoup is a Python Library which converts the incoming data to Unicode format and outgoing data to UTF-8 format. BeautifulSoup parses everything you pass to it and creates tree structure on its own.

Basics while performing Web Scraping

While performing web scraping, we work with HTML tags and use elements to fetch the data. Thus one should have a good understanding of HTML.

HTML code starts with the <html> tag and ends with </html> tag.

The visible part of HTML document is between <body> and </body> tag.

To make raw HTML data more pretty, we need to parse the data by using different parsers. e.g.

    1. HTML5 lib
    2. XML parser
    3. HTML parser

Security

Web scraping shouldn’t be performed without the permission of system administrator, else it may lead to illegal activity. Data Scientists many times perform web scraping on their own web pages or company data. It should be taken care that only open source sites or free sites should be used to perform web scraping.

Working with BeautifulSoup

Let’s try a simple code using Python3 and BeautifulSoup. When a web page is visited the browser sends a request to a web server. This will be a GET request as we are getting the file from the server. The server then sends the file as a response and tells the browser how to render the page. There are the following types of files:

  1. HTML (Hypertext Mark-up Language) – It contains the main content part of the page.
  2. CSS (Cascading Style Sheet) – It makes the page look nice.
  3. JS (JavaScript) – They are used for web page interaction.

Learn Hands-On Web Scraping in Python with real data Projects.

HTML

It is a mark-up language that tells the web browser how to layout the content of the page.HTML code is made up of HTML tags. The main part of the HTML tag is head and body. e.g.:

<html>

<head>

</head>

<body>

<p>

This is paragraph of text

</p>

<p>

This is the second paragraph of text

</p>

</body>

</html>

BeautifulSoup can be installed using the following command,

pip install BeautifulSoup4

Once, the BeautifulSoup library has been installed, it’s time to use it to web scrape our data based on the following steps.

First, the necessary libraries need to be imported in the program using the following commands

#import libraries
from bs4 import BeautifulSoup

Requests library

To scrape the page we need to download it and this part can be done using Python’s requests library. Requests make a GET request to the webserver which downloads the HTML contents of the page. e.g.:

Here, we have requested from https://dataquestio.github.io/web-scraping-pages/simple.html

import requests
page = requests.get("https://dataquest.github.io/web-scraping-pages/simple.html")
print(page)

<Response [200]>

After running the above code you will get a status code that tells you whether the page is downloaded successfully or not.

<Response [200]>

200 status code means that the page is downloaded successfully. The status code that starts with 2 generally indicates success whereas 4 or 5 indicates an error.

Parsing the page using BeautifulSoup

We need to import the library in a python file and fetch the text from <p> tag.

soup = BeautifulSoup(page.content, 'html.parser')

We can also print the HTML content of the page with the help of the soup object.

print(soup.prettify())
<DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>

Finding all instances of tag at once

soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

To get the text without the HTML tags

soup.find('p').get_text()

‘Here is some simple content for this page’

Web Scraping Example

Let’s, look into another example where we would web-scrape data from the https://www.apartments.com/chicago-il/ website as below.

Learn web scraping using Python with examplesFirst, the necessary libraries need to be imported

#import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

Saving the URL and declaring the headers as {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36’} to avoid my TIMEOUT conflict

# Saving the url and declaring the headers to avoid any timeout conflict
url = "https://www.apartments.com/chicago-11/"
headers = {'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_11_6) 
AppleWebkit/537.36 (KHTML, like Gecko) Chrome/61.0.3163'}

The HTML content is requested using the GET command

#Requesting the html content from the webpage
page = requests.get(url, headers=headers)
print(page)

<Response [200]>

Parsing the HTML content

soup = Beautifulsoup(page.content, 'html.parser')

Using prettify(), to structure our HTML content in a proper format

# Structuring the HTML file
print(soup.prettify())

<!DOCTYPE html> <html data-placeholder-focus=”false” lag=”en”> <head> <meta charset=”utf-8″/> <meta content=”width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no” name=”viewport”/> <meta content=”telephone-no” name=”format-detection”/> <meta content=”email-no” name=”format-detection”/> <meta content=”true” name=”HandheldFriendly”/> <title> Apartments for Rent in Chicago IL | Apartments.com </title> <link href=”/a/8af55f9/favicon.ico” rel=”icon”/> <link href=”https://www.apartments.com/chicago-il/” rel=”canonical”> <link href=”https://plus.google.com/+apartments.com/” rel=”publisher”/> <meta content=”See all 35,085 apartments in Chicago, IL currently available for rent. Each Apartment.com listing has verified availability, rental rates, photos, floor plans and more.” name=”description”/> <meta content=”en” name=”lamguage”/>

Once, the HTML content, we would find the first and the last page of the website to get the entire data across all the pages

After analyzing this content, it is found the Page Number falls under the div id paging inside the div id placardContainer

# Extracting the first and last page numbers
paging = soup.find("div",{"id":"placardContainer"}).find("div",{"id":paging})
.find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text

Looping through the pages to create links from the page numbers

for page_number in range(int(start_page),int(last_page) +1):

 #To form the url based on page numbers
 url = url+str(page_number)+"/.html"
 r = requests.get(url+str(page_number)+"/", headers=headers)
 c = r.content
 soup = BeautifulSoup(c, "html.parser")

Now, we would get the name of the property and its location

# Extracting the Title and the Location
placard_header = soup.find_all("header", {"class":"placardHeader"})

Other details like rent, phone number

# Extract the Rent, No of Beds and Phone Number under the section class 
placard_content
placard_content = soup.findall("section",{"class":"placardContent"})

Getting all the placard_header and placard_content data in a single frame

web_content_dict {}
 for item_header,item_content in zip(placard_header,placard_content):

   #To store the information to a directory
    
web_content_dict["Title"]=item_header.find("a",{"class":"placardTitle 
js-placardTitle"}).text.replace("\r", "").replace("\n")
web_content_dict["Address"]=item_header.find("div",{"class":"location"}).text
web_content_dict["Price"]=item_content.fifnd("span",{"class":"altRentDisplay"}).text
web_content_dict["Beds"]=item_content.find("span",{"class":"unitLabel"}).text
web_content_dict["Phone"]=item_content.find("div",{"class":"phone"}).find("span").text

  # Strong the dictionary into a list

content_list = []
content_list.append(web_content_dict)

The content is saved in a Data Frame and uploaded to a CSV file

# Converting the list into a dataframe
df = pd.DataFrame(content_list)

#Saving it as a csv file
df.to_csv("Output.csv")

Business examples of Web Scraping

  1. Real estate data gathering is a huge market which is growing very rapidly where the businesses are using already listed properties of the different apartment, sites, farmhouse etc. Many real estate agents use it.
  2. Email address gathering is one of the key parts of B2B and B2C business. This is used by a lot of business companies where they use the email-id of a person to contact and send information about their product. Email-id is also used to get more details of a person, for e.g.: using email-id of a person, we can get his/her professional details by searching him on LinkedIn.
  3. Social media data is the most valuable data in today’s century. Collecting data from different social media platform and using that data to attract more customers is one of the best business models in today’s world. Providing quality data to an app company will help them to fetch more customer by giving the customer what they really want and not what the company is providing. e.g.:

A food app can get to know a customer’s favorite dish and provide him the same dish for a cheaper price with attractive discounts.

Advantages of Web Scraping

  1. It reduces the cost of obtaining information from the web.
  2. Easy to learn and implement.
  3. The accuracy of data extracted using BeautifulSoup is very much accurate compared to other APIs. The accuracy is very important when data is extracted from websites that deal with sales price, real estate price, stock price or any other kind of financial data because even slight amount of change in the number may cause huge loss to a company.

Disadvantages of Web Scraping

  1. Scraping may cause slight confusion while processing the data because of the HTML tags.
  2. Some websites don’t allow users to scrape the data which may lead to illegal activity.
  3. BeautifulSoup mostly works using an HTML tag, this can also be considered as a disadvantage because if a developer has scraped the data of a website based on the previous version.

Conclusion

Web-Scraping is an important concept to master these days due to the enormous amount of unstructured data that’s available to us.

The understanding of regular expression with Python is crucial for web-scraping tasks – https://www.w3schools.com/python/python_regex.asp

Additional resource – https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Explore latest cutting edge technologies using brand stories –

Learn Reinforcement learning the Google way

Netflix Recommender systems - How Netflix uses NLP for personalization

How uber uses clustering algorithms for cluster analysis

LEAVE A REPLY

Please enter your comment!
Please enter your name here