In the growing world of data science, data sources are very important to gather data from and retrieving valuable insights from the same. At times data is gathered at the administrative end of the companies, sometimes we need to extract data for analytics from other sources as well, and what else would be a better source of data than websites.
Let’s understand the concept of web scraping using python.
What is Web Scraping?
Websites store a huge amount of data aggregated from various sources that can be extracted using scraping. Scraping is a process where we parse through a web page and collect data along the way. Web scraping Python is performed using a “web scraper” also knows as “bot” or “spider” or “crawler”. A web scraper is a program that sends a request to a web page, downloads the content, collects only the required data from the response, and storing it into a database.
What are the steps involved in web scraping?
Web Scraping using Python is a three-step process.
Step 1 – Sending an HTTP request to the webpage you want to scrape
First, we send an HTTP request to the target URL of the webpage we want to access. Then similar to what happens for a browser, the server responds to the request by returning the HTML content of the target webpage. I.e. we get the HTML code of our target website as a response to our request made using python.
To send web requests in python, we need to import the request library.
import request Response = request.get(‘target URL’)
Step 2 – Parsing the HTML content
Once we have the HTML content of the webpage, we need a method to parse the HTML content. We cannot simply extract the data from the code because HTML data is nested. So we need an HTML parser to do that. A parser is needed to create a nested structure of the HTML data. There are many parsers available in python such as Html5lib.
Step 3 – Pulling data out of HTML using beautiful soup
Now that we have HTML data all we need to do is to navigate and search the parsed data that we collected. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.
We first look at the required classes where data is used using the inspect element feature of chrome and then access all the data inside those classed blocks using our beautiful soup library.
Following is code, explaining the above process.
Scraper for extracting mobile phone name, price, rating, and description from Flipkart.
Importing our required libraries
#Importing the Beautiful Soup Library from bs4 import BeautifulSoup #Importing the requests Library import requests
Sending Request to the URL
response = requests.get('''https://www.flipkart.com/search?q=nokia+smartphones& sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_0_10_na_ na_pr&otracker1=AS_QueryStore_OrganicAutoSuggest_0_10_na_na_pr&as-pos=0&as-type= RECENT&suggestionId=nokia+smartphones&requestId= 675612e2-512b-4d0e-8b75-6bdf91921d7c&as-backfill=on''')
Parsing the returned HTML data
soup = BeautifulSoup(response.text, 'lxml')
Accessing the required data from the HTML content
mname, mrating, mprice, mdesc = list(), list(), list(), list() mobile_name = soup.find_all(class_='_3wU53n') rating = soup.find_all(class_='hGSR34') price = soup.find_all(class_='_1vC4OE _2rQ-NK') description = soup.find_all(class_='vFw0gD')