Skip to content

Predict the value of single-family homes using a machine learning regression model.

Notifications You must be signed in to change notification settings

Robust-Analytics/zillow-project

Repository files navigation

Zillow Project

Robust Analytics™

Anthony Straine
Junior Data Scientist

Christopher Ortiz
Junior Data Scientist

Description

Predict the market value of single unit properties using properties that were sold in May and June of 2017.

Summary

Working together we discovered for our MVP that the features that appear to drive home value as measure by taxvaluedollarcnt are bathroomcnt, bedroomcnt, calculatedfinishedsquarefeet. We discovered this by going through an iterative, manual process of feature selection using a Pearson's R correlation test to select the top two features of bathroomcnt and calculatedfinishedsquarefeet and using industry knowledge to also include calculatedfinishedsquarefeet and homes having more than 2 bathrooms.

After testing a few models, a polynomial model performed the best. Features used in our model:

  • bathroomcnt
  • bedroomcnt
  • calculatedfinishedsquarefeet
  • taxvaluedollarcnt

These features explain 38% of the variance in the tax value dollar amount. For our next iteration we will look at additional features while controlling for outliers.

Data Dictionary

Feature Definition Data Type
id row index number, range: 0 - 2985216 int64
parcelid Unique numeric id assigned to each property: 10711725 - 169601949 int64
bathroomcnt Number of bathrooms a property has: 0 - 32 float64
bedroomcnt Number of bedrooms a property has: 0 - 25 float64
calculatedfinishedsquarefeet Number of square feet of the property: 1 - 952576 float64
fips (FIPS) Five digit number of which the first two are the FIPS code of the state to which the county belongs. Leading 0 is removed from the data: 6037=Los Angeles County, 6059=Orange County, 6111=Ventura County float64
lotsizesquarefeet The land the property occupies in squared feet : 100 - 371000512 float64
propertylandusetypeid Unique numeric id that identifies what the land is used for: the 261=Single Family Residential, 262=Rural Residence, 273=Bungalow float64
roomcnt Total number of rooms in the principal residence float64
yearbuilt Year the property was built float64
transactiondate The most recent date the property was sold: yyyy-mm-dd object
Target Definition Data Type
taxamount The total property tax assessed for that assessment year float64
taxvaluedollarcnt The total tax assessed value of the parcel float64

Project Organization

├── README.md           <- The top-level README for developers using this project.
│
│
├── mvp.ipynb           <- The main notebook for the project
│
│
├── acquire.py          <- The script to download or generate data
│
├── prepare.py          <- The script for preparing the raw data
│
├── wrangle.py          <- The script for preparing the raw data for exploration
│
├── model.py            <- The script for preprocessing, modeling, and interpreting

Requirements

  • numpy >= 1.1.2
  • pandas >= 1.18.1
  • scipy >=1.4.1
  • sklearn >= 0.23.2
  • matplotlib >= 3.3.1
  • seaborn >= 0.11.0

Setup

  1. Download a zip file of the repository here

  2. Clone this repository using:

$ git clone git@github.com:Robust-Analytics/zillow-project.git

To open the file in a jupyter notebook use following code:

import pandas as pd
df = pd.read_csv('zillow.csv')

Acknowledgements

Contact

How to reach Anthony

How to reach Chris