Web-Scrapping in Python
This task was just for educational purposes about how to use web-scrapping to extract data from the website.
Why Web-Scrapping is used?
In normal terms, let’s assume that you want to extract a large amount of data from any website so obviously you won’t be doing it manually. Ya, if you have enough time, you can go for it, but if you want as quick as possible then to help this, many python libraries are used to make this task easier.
What is Web-Scrapping ?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. It is the automatic method that is used to extract large data from your chosen website.
How would you find web-scraping is legal or not?
It may be possible that some of the websites restrict while others do allow. Now to check whether the website allows or deny web scraping, attach /robots.txt behind the URL which means let’s assume
In this tutorial, we are going to see how to extract hotels data i.e. its name, ratings, Number of reviews, Price, etc. from the Tripadvisor website.
How Do You Scrape Data From A Website?
- Find the URL that you want to scrape
- Inspecting the Page
- Find the data you want to extract
- Write the code
- Run the code and extract the data
- Store the data in the required format
Libraries Used:
- BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that are helpful to extract the data easily.
- Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format.
- lxml: lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API.
Now, let’s get started.
Step 1: Find the URL that you want to scrape
In our case, we will extract hotels data from the Tripadvisor in Ahmedabad, Gujarat from the below link:
https://www.tripadvisor.com/Hotels-g297608-Ahmedabad_Ahmedabad_District_Gujarat-Hotels.html
Step 2: Inspecting the Page
To open the inspect page press Ctrl+Shift+I on the webpage. You will see the below screen:
Step 3: Find the data you want to extract
- Now let assume you want to store the hotel name, price, reviews, and rating then for that you have to right-click on the hotel name and the last option would be the inspect page.
- Click on that button.
- You will find the <a> tag in which the name would be stored.
- But as you can see it is stored inside another <div> tag whose class-name is listing_title:
Step 4: Write the code
- Create the python file.
- import all the necessary libraries:
import requests
from bs4 import BeautifulSoup as soup
import pandas as pd
- Send the get request to our URL and get the response from it:
html = requests.get('https://www.tripadvisor.com/Hotels-g297608-Ahmedabad_Ahmedabad_District_Gujarat-Hotels.html')
print(html.status_code)
- Write the necessary code such that it will fetch the hotel name from the web page.
hotel = []
for name in bs_object.findAll('div',{'class':'listing_title'}):
hotel.append(name.text.strip())
print(hotel)
- Similarly find the class name of the no_of_reviews, ratings and price through inspect page and write the python code:
ratings = []
for rating in bs_object.findAll('a',{'class':'ui_bubble_rating'}):
ratings.append(rating['alt'])
print(ratings)reviews = []
for review in bs_object.findAll('a',{'class':'review_count'}):
reviews.append(review.text.strip())
print(reviews)price = []
for p in bs_object.findAll('div',{'class':'price-wrap'}):
price.append(p.text.replace('₹','').strip())
print(price)
- We have stored the data in the variable now it’s time to organize the data. it will be done with the help of the pandas library.
d1 = {'Hotel':hotel,'Ratings':ratings,'No_of_Reviews':reviews,'Price':price}
df = pd.DataFrame.from_dict(d1)
print(df)
Step 5: Run the code and extract the data
- When we will run our python final, the final output will look like this:
Step 6: Extract the data
df.to_csv('hotels.csv', index=False, encoding='utf-8')
- it will store the data in CSV format.
You will find the Source code from my GitHub link: https://github.com/dhruvshah1105/Data-Science-Practicals/blob/development/Practical-1/web-scrapping.py
Hope you learn about to scrape TripAdvisor Hotels data.