How to scrap data from the website using Python

Img source: Edureka
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. There are different ways to scrape websites such as online Services, APIs or writing your own code. In this article, we’ll see how to implement web scraping with python. We will use one of the websites I have built.
I will skip the installation of Python in this tutorial.
Using your prefered text editor, create a python file and name it whatever you want. I'll name mine scrapper.py
. We'll Import all the libraries that we'll need to build our scrapper. a library is a collection of precompiled routines that a program can use. The routines, sometimes called modules, are stored in object format.
# import libraries | |
import requests | |
import csv | |
from bs4 import BeautifulSoup |
Now let's get the url for the website from which we want to scrap data from. In this case, we'll use https://windhoeknamibia.github.io
Using our url, we'll now use the requests
library to fetch data from the website.
# import libraries | |
import requests | |
import csv | |
from bs4 import BeautifulSoup | |
# url to scrap | |
url = "https://windhoeknamibia.github.io/" | |
#send http request | |
resp = requests.get(url) | |
# prints html layout of the website | |
print(resp.text) |
Create a soup object to get the title of the website
#create a soup object | |
soup = BeautifulSoup(req.content, 'html.parser') | |
#get the title of the website | |
title = soup.find(id="title") | |
print(title) | |
print(title.string) |
Create a soup object to find places!
# get all <h4> with a className: "place-name" | |
places_obj = soup.find_all("h4", {"class":"place-name"}) | |
print(places_obj) |
Write all place names to a csv
file.
# create a list of place names | |
list_of_places = [["Place Names"]] | |
#loop through the places object and append them to your list of places | |
for place in places_obj: | |
list_of_places.append([place.string]) | |
print(list_of_places) | |
with open('places.csv', 'w', newline='') as csv_file: # creates a new csv file | |
writerObj = csv.writer(csv_file) # create a csv writer object | |
writerObj.writerows(list_of_places) # writerows is used to write data into your csv file | |
# the csv file will be displayed in your workspace |
Let's create a soup object to help us get all the image src
links.
# all images are inside a div with a className "whk-place" | |
img_obj = soup.find_all('div', {'class': 'whk-place'}) |
Now let's print out all the image links.
imgLinks = [] | |
for link in img_obj: | |
imgLinks.append(link.find('img').get('src')) # find the src attribute in each image | |
print(imgLinks) | |
# prints a list of image links |
I hope this article helped you understand web scrapping and how to use python libraries to scrap websites. You can continue to do more practical examples using different websites.
Be careful not to scrap data from websites which do not give you permission to do so.
To know whether a website allows web scraping or not, you can look at the website’s “robots.txt” file. You can find this file by appending “/robots.txt” to the URL that you want to scrape.
To do!
Try to write all image src links into your csv file.Happy coding!