Predicting Residential Property Prices

Dean De Cock’s (2011) data set on residential properties were on sale between 2006 and 2010 in Ames, Iowa. The data contains 2,930 observations and 80 features which include 46 categorical variables (23 nominal & 23 ordinal), 14 discrete variables and 20 continuous variables. The covariates quantify physical attributes of a property which are typically considered by a potential buyer.

Regression Modelling

Summary

The complexity of the data lends itself for comprehensive machine learning. Missing values, skewness, multicollinearity and outliers are screened, diagnosed and imputed in readiness for sale price predictions on 500 samples from the dataset. A series of regression models are built and compared to identify the most optimal simple, intermediate and complex algorithms that adequately predict sale price.

Actual vs Predicted Sale Price by Regression Model

The most accurate predictions are computed by the Ridge Regression model. The complex model has the lowest Mean Absolute Error however, the model is likely more sensitive to fitting noise despite having a marginally lower Adjusted R-Squared value than the intermediate model. There is an approximate difference of $2,600 between the Ridge Regression model and Huber Loss Function. Even so, the Huber Loss Function's Adjusted R-Squared value is indicative of a good overall fit as it satisfactorily balances complexity and explanatory power. As the dataset has a modest number of highly priced houses, underestimated sale prices are likely reflective of  the data set being heavily skewed towards relatively lower priced residential properties. The simple model which is similarly a Huber Loss Function performs noticeably worse on both metrics. Albeit the model’s simplicity, unseen data is likely prone to underfitting as data patterns may not be captured due to insufficient data.