Data Econ: Modelling Melbourne Property Prices with Light GBM

Thursday, 4 October 2018

Modelling Melbourne Property Prices with Light GBM

This is a continuation of the first part where I visualised the property price data of Melbourne (link). This time I used the full dataset with prices and property data dating from Jan 2016 to Aug 2018 to create a model that is capable of predicting property prices in Melbourne.
Here is what the data look like:

There are 27247 property transactions with 21 features, including price, in the dataset. In my model, I dropped "Lattitude" and "Longtitude" [sic] since there are so much missing data. I created unique street names by extracting them from the addresses and combining them with suburb names. I also imputed some essential missing data using means.

I separated the data into training and testing sets. I used a gradient boosting algorithm to model the prices. Here is a good tutorial on the algorithm - Link. The algorithm is released by Microsoft and is available in both Python and R. This is what the predictions look like against the prices of the testing set:

I also plotted the features according to their importance/prediction power:

The most important features affecting house prices in Melbourne are:

Distance from CBD

Postcode

Dwelling type

Number of rooms

Distance from CBD is by far the most important feature. Unlike our visual analysis in Part 1, latitudes and longitudes were not used in the model here, but instead, we used features that are related to latitudes and longitudes: Postcodes, Suburbs and Council Area.
For me, the most suprising part is that the number of rooms in a property has a higher impact on the price than landsize or built area.

To see how I did everything step by step, please refer to my Jupyter Notebook on Kaggle:https://www.kaggle.com/wlsamchen/melbourne-house-price-modelling-light-gbm

Data Econ

Thursday, 4 October 2018

Modelling Melbourne Property Prices with Light GBM

No comments:

Post a Comment

Portfolio Optimisation with Python