King County

House Price

Build a housing price prediction model using Python.

Business Requirement

We have the 2014-2015 housing sales data for King County. Upon observation, the price fluctuations are significant, with the difference between the highest and lowest prices reaching $7,625,000.

Based on this, we conducted the analysis. We hope it will help homebuyers better understand the price differences in various areas of King County and the main factors affecting housing prices. This should enable homebuyers to have a better assessment of the prices of the houses they are interested in purchasing.

Skills

Data wrangling and merging

Geographic Visualization in Python

Machine Learning-Regression

Machine-Learning-Clustering

Sourcing & Analyzing Time Series Data

Select and build a model

Tools

PowerPoint

Python

Tableau

Excel

Analysis Process

  1. Clean and integrate the dataset.

  2. Geographic Factors Analysis: Analyzing the differences in housing prices across different zip code areas.

  3. Exploratory Data Analysis, analyze the correlation between each variable and the price.

  4. Select and build a model to reasonably predict housing prices.

Analysis & Visualization

Geographic Factors Analysis

Using Tableau's map feature, display the median housing prices across different zip code areas, and combine this with a bar chart to show specific information. This allows us to clearly see which areas have high housing prices and which areas have low housing prices.

Note: This process can also be completed using Python. Please refer to the link. [Code]

Exploratory Data Analysis

Combining Python and Tableau, use Python to organize the data and Tableau to visualize it. Create a correlation matrix of various variables and housing prices, and then select the three variables with the highest correlation coefficients for further analysis.


For the continuous variable SQFT, we create a scatter plot and analyze its relationship with price through linear regression. Additionally, we include the zip code as a filter to view the situation in specific zip code areas.

For the discrete variables, such as the number of bathrooms and grade, we use boxplots to analyze the variation in housing price distribution across different categories.

Select and build a model

Use Python to test and select the model. Finally, we have determined that the 'Random Forest' is the most suitable model.

Visualize the model results using Tableau.

Conclusion

1. There are significant differences in housing prices across different zip code areas in King County. The most expensive areas are 98039, 98004, and 98040. around the lake, while the cheapest areas are 98168, 98002, and 98032.  in the southwest corner.

2. Housing prices have a strong positive correlation with living square footage (living-SQFT). In specific zip code areas, this correlation becomes even stronger.

3.Housing prices are positively correlated with the number of bathrooms, and as the number of bathrooms increases, the price variability also becomes larger.

4.Housing prices are positively correlated with grade, and as the grade increases, the price variability also becomes larger.

5.Based on the above factors, we have developed a housing price prediction model that can serve as a reference for homebuyers.