Saturday, 13 October 2018

Webscrapping For Sale Properties Data Using BeautifulSoup and Mapping Geographical Data


Initially, I was going to write a program to scrap property listings on www.realestate.com.au and map them out. However, I discovered two challenges that were difficult to overcome. The first one is that the website frequently include texts like "Auction", "10am", "Contact Agent" and other random texts along with the address text. The second one is that latitude and longitude information might be unavailable for a particular address. I have decided to skip these error causing property addresses and only map those with easy extractable addresses and geo-coordinates.

Below is a screenshot of a map showing some properties for sale in Eastwood, NSW 2122, Australia:






What you would need to write this program:
  • Python 3 installed on your computer.
  • The libraries geopy and BeautifulSoup (bs4) installed on your Python installation.
  • A text editior/notebook to write the code: preferrably Jupyter Notebook or Atom. 
    • Note: A few quick google searches would show you how to install these things. Depending on the type of Python 3 installation you have, how to install geopy and BeautifulSoup might be different.
When writing a program where there are no clear pathways to get to your results, it is important to break down your program as a series of objectives that would put togeter as the final program. An example of this is webscrapping data. Each webpage has its own html structure so we would have to look at the actual code to determine our approach to extract our data.

Here are the steps I have set out for myself:

1) Find out url style:


".../in-SUBURB,+nsw+POSTCODE/list-1". So the parts we need to fill in are "SUBURB" and "POSTCODE".

2) Find out Address tag information.

3) Extract Address tag information.

4) Find out Room, Bathroom, Garage tag information.

5) Extract rooms information.

6) Use geopy and Nominatim to find out the Latitude and Longitude of the property based on the address.

7) Map the properties using markers.

8) Add address and rooms information to the markers.
You would notice that the webscrapping part takes up the majority of the steps, and indeed I spent more time of this part than the mapping code! Describing all the details of this programming project would be too long for most people so I am going to describe the 3 distinctive ways I used to scrap data.

Extracting information by searching html tags:

The first step in webscrapping a particular webpage after loading it is to look at the html code and work out the structure. Typically, a webpage would have many different objects with different names and classes. In this case, I had to look for an article object with the class "results-card residential-card":


Extracting information by using the .text method:

A different way to extract information is to look at the displayed text by using the .text method:



Extracting information by converting to string:

If searching for tags are too difficult, you can convert the soup object to string and use split methods to find the information you want:

 

.
.
.
 
For your convenience, the full Python code is below:
#Import libraries
import requests
from bs4 import BeautifulSoup

#import Nominatim from geopy
from geopy.geocoders import Nominatim

nom = Nominatim(user_agent="my-application",scheme='http')


base_url = 'https://www.realestate.com.au/buy/'

#Ask user for the two inputs
suburb = input('Enter Suburb in NSW:')
postcode = input('Enter Postcode:')

#Create the full url for BeautifulSoup to scrap.
full_url = base_url+'in-'+suburb+',+nsw+'+postcode+'/list-1'

#Load page using BeautifulSoup
r = requests.get(full_url)
c = r.content
soup = BeautifulSoup(c, 'html.parser')

#Extracting Addresses and put into a list:
all = soup.find_all('article', {'class':'results-card residential-card '})

addresses = []
templist =[]
top = range(0,len(all),1)
for i in top:
    addresses.append(str(all[i]).split('aria-label="')[1].split('" class='[0])[0]+' '+postcode)

#Extracting Property information and put into a list:
DwellingInfo=[]
for i in top:
    text = all[i].text
    DwellingInfo.append(text)

#Web links
links=[]
for i in top:
    linktag=all[i].find_all('a', {'class':'details-link residential-card__details-button'})
    linktag = str(linktag).split('>')
    links.append(linktag[0].split('/')[-1].split("\"")[0])
   
#Getting the Geo-coordinates
addresses_clean = []
for i in top:
    addresses_clean.append(addresses[i].split('/')[-1])
    addresses_clean.append(addresses[i].split('-')[-1])

#latitude and longitude lists
latitudes = []
longitudes = []

for i in top:
    try:
        latitudes.append(nom.geocode(addresses[i]).latitude)
        longitudes.append(nom.geocode(addresses[i]).longitude)       
    except:
        try:
            latitudes.append(nom.geocode(addresses_clean[i]).latitude)
            longitudes.append(nom.geocode(addresses_clean[i]).longitude)
        except:
            latitudes.append(-33.8)
            latitudes.append(155)

   
#Importing folium
import folium
#Creating the map, with a starting location
map = folium.Map(location = [latitudes[0],longitudes[0]], tiles = 'Stamen Toner', zoom_start= 14)
#Create a FeatureGroup to be added to the map
featuregroup1 = folium.FeatureGroup(name = 'Properties')

#Create the markers
for lat, lon, info, link in zip(latitudes,longitudes, DwellingInfo, links):
    featuregroup1.add_child(
    folium.CircleMarker(radius = 7, location = [lat, lon],
                        popup ='DESCRIPTION: %s ||LINK:  https://realestate.com.au/%s' %(info, link),
                        color = 'red'))
#Add the feature groups to map
map.add_child(featuregroup1)

#SaveMap
map.save('PropertySearch.html')

Portfolio Optimisation with Python

 Recently I have been busy so I have been neglecting this blog for a very long time. Just want to put out some new content. So there is this...