This repo contains the results of the EDA project in the neuefische Data Science, Machine Learning & AI Bootcamp. It consists of 2 notebooks:
-
The EDA notebook itself containing a classical EDA and a client-focused EDA:
-
A presentation notebook that was used to generate the corresponding Jupyter slides for the stakeholder meeting:
There are 3 interesting data insights that might be contrary to common views:
-
More rooms does mean higher price, but the relationship is not as strong as one might expect.
-
Older houses are not generally cheaper. The correlation is almost zero.
-
Surprisingly, just like agricultural products, house prices exhibit seasonality effects.
We recommend buying in February and to avoid buying in April.
We also recommend buying in the middle of the month and to avoid buying in the beginning.
Based on our client's needs, we recommend low-fluctuation neighborhoods. The plot below shows all zipcode areas, ranked according to their fluctuation. Our client should pick from the neighborhoods on the left-hand side.
Instead of specific buying recommendations, we decided to propose the following methodology to our (fictional) client:
-
Start with most affordable house with at least 3 bedrooms and 2 bathrooms
-
Ask yourself: would you be willing to pay for a neighborhood lower fluctuation?
The first five result of this procedure are shown in the table below. The least expensive option resulting from this procedure is a house with ID 15796 in Rainier Beach with 5 bedrooms for 133,000 USD. Notice that improving on the neighborhood can mean compromising on other aspects.
house_id | price | bedrooms | bathrooms | sqft_living |
---|---|---|---|---|
7129304540 | 133000.000000 | 5.000000 | 2.000000 | 1430.000000 |
1823049182 | 147400.000000 | 3.000000 | 2.000000 | 1080.000000 |
2976800749 | 150000.000000 | 4.000000 | 2.000000 | 1460.000000 |
3356403304 | 154000.000000 | 3.000000 | 3.000000 | 1530.000000 |
7129300595 | 158000.000000 | 3.000000 | 2.000000 | 1090.000000 |
This repo contains a requirements.txt file with a list of all the packages and dependencies you will need.
Before you can start with plotly in Jupyter Lab you have to install node.js (if you haven't done it before). Check Node version by run the following commands:
node -v
If you haven't installed it yet, begin at step_1
. Otherwise, proceed to step_2
.
Step_1:
Update Homebrew and install Node by following commands:
brew update
brew install node
Step_2:
Install the virtual environment and the required packages by following commands:
pyenv local 3.11.3
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Step_1:
Update Chocolatey and install Node by following commands:
choco upgrade chocolatey
choco install nodejs
Step_2:
Install the virtual environment and the required packages by following commands.
For PowerShell
CLI :
pyenv local 3.11.3
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt
For Git-Bash
CLI :
pyenv local 3.11.3
python -m venv .venv
source .venv/Scripts/activate
pip install --upgrade pip
pip install -r requirements.txt
Note:
If you encounter an error when trying to run pip install --upgrade pip
, try using the following command:
python.exe -m pip install --upgrade pip