Predicting Residential Property Prices
Dean De Cock’s (2011) data set on residential properties were on sale between 2006 and 2010 in Ames, Iowa. The data contains 2,930 observations and 80 features which include 46 categorical variables (23 nominal & 23 ordinal), 14 discrete variables and 20 continuous variables. The covariates quantify physical attributes of a property which are typically considered by a potential buyer.
Regression Modelling
Summary
The complexity of the data lends itself for comprehensive machine learning. Missing values, skewness, multicollinearity and outliers are screened, diagnosed and resolved in readiness for sale price predictions on 500 samples from the dataset. A series of regression models are built and compared to identify the most optimal simple, intermediate and complex algorithms that adequately predict sale price.
Actual vs Predicted Sale Price by Regression Model
The most accurate predictions are computed by the Ridge Regression model. The complex model has the lowest Mean Absolute Error of $14,864.57 however, the model is likely more sensitive to overfitting predictions despite having a marginally lower Adjusted R-Squared value exhibited by the Huber Loss Function. The intermediate model is indicative of a good overall fit as it satisfactorily balances complexity and explanatory power. The simple model which is similarly a Huber Loss Function performs noticeably worse on both metrics. Albeit the model’s simplicity, unseen data is likely prone to underfitting as data patterns may not be captured due to insufficient data.