In the growing world of data science, data sources are very important to gather data from and retrieving valuable insights from the same. At times data is gathered at the administrative end of the companies, sometimes we need to extract data for analytics from other sources as well, and what else would be a better source of data than websites.

Let’s understand the concept of web scraping using python.

What is Web Scraping?

Websites store a huge amount of data aggregated from various sources that can be extracted using scraping. Scraping is a process where we parse through a web page and collect data along the way. Web scraping Python is performed using a “web scraper” also knows as “bot” or “spider” or “crawler”. A web scraper is a program that sends a request to a web page, downloads the content, collects only the required data from the response, and storing it into a database.

What are the steps involved in web scraping?

Web Scraping using Python is a three-step process.

Step 1 – Sending an HTTP request to the webpage you want to scrape

First, we send an HTTP request to the target URL of the webpage we want to access. Then similar to what happens for a browser, the server responds to the request by returning the HTML content of the target webpage. I.e. we get the HTML code of our target website as a response to our request made using python.

To send web requests in python, we need to import the request library.

import request
Response = request.get(‘target URL’)

Step 2 – Parsing the HTML content

Once we have the HTML content of the webpage, we need a method to parse the HTML content. We cannot simply extract the data from the code because HTML data is nested. So we need an HTML parser to do that. A parser is needed to create a nested structure of the HTML data. There are many parsers available in python such as Html5lib.

Step 3 – Pulling data out of HTML using beautiful soup

Now that we have HTML data all we need to do is to navigate and search the parsed data that we collected. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.

We first look at the required classes where data is used using the inspect element feature of chrome and then access all the data inside those classed blocks using our beautiful soup library.

Following is code, explaining the above process.

Scraper for extracting mobile phone name, price, rating, and description from Flipkart.

Importing our required libraries

#Importing the Beautiful Soup Library
from bs4 import BeautifulSoup

#Importing the requests Library
import requests

Sending Request to the URL

response = requests.get('''

Parsing the returned HTML data

soup = BeautifulSoup(response.text, 'lxml')

Accessing the required data from the HTML content

mname, mrating, mprice, mdesc = list(), list(), list(), list()
mobile_name = soup.find_all(class_='_3wU53n')
rating = soup.find_all(class_='hGSR34')
price = soup.find_all(class_='_1vC4OE _2rQ-NK')
description = soup.find_all(class_='vFw0gD')

  Explore our Goals – 

TechLearn Career path - Data Scientist

TechLearn career path - Python programmer

TechLearn career path - Deep Learning Engineer


Please enter your comment!
Please enter your name here