Modeling Real Estate Price (Ames, IA)

Understanding Real Estate Pricing Through Data

Try the Predictor

The intention of project is to transform and utilize the Ames Housing Dataset to understand how different property features affect Sale Price.

You can find the source code and project files on GitHub:

View GitHub Repository

The Ames Housing dataset contains 2930 observations of sold residential properties from 2006 - 2010. (Ames, IA)

Data Structure


To anchor this project, we can make the following assumption using domain knowledge:

This can be translated into the following Null Hypotheses:


Data Treatment & Cleaning


1. Calculating Growth Rate Using Median Sale Price (2006–2024)

The growth rate of property values was calculated using the compound annual growth rate (CAGR) formula:

2. Compounding Sale Price Using Growth Rate

Using the calculated growth rate, the future value of sale prices was forecasted using the formula:

3. Cleaning Data
4. Encoding Features

Understanding the Dataset

To gain a contextual understanding of the dataset and sale price, let's first compare the summary statistics of the original sale price and the adjusted sale price.

Descriptive Statistics for Sale Price
Descriptive Statistics for Sale Price

Observing these two charts, we can identify significant outliers on both end of the spectrum. However, the typical price of properties often range between 130-200k for the orignal sale price and 200-300k for the adjusted sale price.

Just gleaning from this information we can make an assumption that Ames properties would fall in the category of affordable housing. Just as a note, moving forward any sale price will be referring to the adjusted sale price.

Let's also visualize the distribution of the sale price to see if we can glean any additional insights:

Distribution of Sale Price

Based on the distribution curve, we can observe:

Now let's visualize the relationship between TotalSQFT and Sale Price:

Scatter Plot of Total SQFT vs Sale Price

Based on the scatter plot above, we can observe:

Next, let's take a look at the relationship between neighborhoods and sale price:

Box Plot of Neighborhood and Sale Price

Based on the boxplot above, we can observe:

Moving on, let's visualize the relationship between overall quality and sale price:

Box Plot of Neighborhood and Sale Price

Based on the boxplot above, we can observe:

Moving on to Year Built:

Box Plot of Neighborhood and Sale Price

Based on the box plot above, we can observe:

Lastly, lets visualize the relationship between garage capacity and sale price:

garageCars

Based on the scatter plot above we can observe:


Data Modeling

The goal of this section is to go beyond simple visualizations and statistical tests to uncover the nuanced relationships between independent variables and the dependent variable (sale price). This dataset includes many correlated variables (e.g., Garage Cond, Garage Area, Garage Cars, Garage Qual), which can introduce multicollinearity in regression models. So first task on the agenda is to handle multi-collinearity.

This is the original numerical variables list:

Lot Frontage      float64
Lot Area          int64
Overall Qual      int64
Overall Cond      int64
Year Built        int64
Year Remod/Add    int64
Mas Vnr Area      float64
BsmtFin SF 1      float64
BsmtFin SF 2      float64
Bsmt Unf SF       float64
Total Bsmt SF     float64
1st Flr SF        int64
2nd Flr SF        int64
Low Qual Fin SF   int64
Gr Liv Area       int64
Bsmt Full Bath    float64
Bsmt Half Bath    float64
Full Bath         int64
Half Bath         int64
Bedroom AbvGr     int64
Kitchen AbvGr     int64
TotRms AbvGrd     int64
Fireplaces        int64
Garage Yr Blt     float64
Garage Cars       float64
Garage Area       float64
Wood Deck SF      int64
Open Porch SF     int64
Enclosed Porch    int64
3Ssn Porch        int64
Screen Porch      int64
Pool Area         int64
Misc Val          int64
Mo Sold           int64
Yr Sold           int64
SalePrice         int64
        

However, if you recall, we engineered 2 unique variables from their counterparts: TotalSQFT, and TotalBaths. So from these these three variables, we can remove the following variables:

BsmtFin SF 1      float64
BsmtFin SF 2      float64
Bsmt Unf SF       float64
Total Bsmt SF     float64
1st Flr SF        int64
2nd Flr SF        int64
Low Qual Fin SF   int64
Gr Liv Area       int64
Bsmt Full Bath    float64
Bsmt Half Bath    float64
Full Bath         int64
Half Bath         int64
        

After removal, we are left with:

Bedroom AbvGr     int64
Kitchen AbvGr     int64
TotRms AbvGrd     int64
Fireplaces        int64
Garage Yr Blt     float64
Garage Cars       float64
Garage Area       float64
Wood Deck SF      int64
Open Porch SF     int64
Enclosed Porch    int64
3Ssn Porch        int64
Screen Porch      int64
Pool Area         int64
Misc Val          int64
Mo Sold           int64
Yr Sold           int64
SalePrice         int64
Year Remod/Add    int64
Mas Vnr Area      float64
Lot Frontage      float64
Lot Area          int64
Overall Qual      int64
Overall Cond      int64
        

Based on the remaining numerical variables, only the following variables are experiencing multi-collinearity (corr > .7). As a solution, we can drop the garage area variable from our variable list.

Now that we've handled multi-collinearity, let's start off with a basic regression model to establish a baseline:

Regression Result

Dependent Variable: The model is designed to predict AdjustedSalePrice

R-squared (0.930): This metric quantifies the proportion of variability in AdjustedSalePrice explained by the predictors in the model. A value of 93% indicates the model provides an excellent fit to the data. However, this high value warrants scrutiny for potential overfitting, especially with a large number of predictors.

Adjusted R-squared (0.924): This adjusted metric accounts for the inclusion of 191 predictors, penalizing the addition of variables that do not significantly improve the model's explanatory power. The minimal difference between and Adjusted suggests that most predictors contribute meaningful information, though some may still be redundant or insignificant.

F-statistic (148.0) and Prob (F-statistic = 0.00):

  • The F-statistic tests the joint null hypothesis that all regression coefficients (excluding the intercept) are zero.
  • A high value of 148 and a p-value of 0.00 strongly reject this null hypothesis, indicating that at least one predictor significantly explains variation in AdjustedSalePrice.
  • The magnitude of the F-statistic reflects the collective explanatory strength of the model.

Number of Observations (2326): The sample size provides robust statistical power, reducing the likelihood of Type II errors. However, with 191 predictors, the ratio of predictors to observations (approximately 1:12) should be monitored to avoid overparameterization.

Degrees of Freedom (Df):

  • Df Residuals (2134): Indicates the number of independent observations remaining after estimating 191 model parameters. A large residual degree of freedom ensures stable parameter estimates.
  • Df Model (191): Represents the number of predictors, reflecting the model's complexity. With this many predictors, multicollinearity and overfitting risks must be addressed.

The OLS regression model provides a strong explanation of the variability in AdjustedSalePrice, with an R^2 of 93% and an adjusted R^2 of 92.4%, indicating excellent predictive power while accounting for the large number of predictors. The model is statistically significant overall, as evidenced by an F-statistic of 148 and a corresponding p-value of 0.00, which strongly rejects the null hypothesis that all predictors have no effect on AdjustedSalePrice. However, the complexity of the model, with 191 predictors and a predictor-to-observation ratio of approximately 1:12, warrants careful evaluation of multi-collinearity and overfitting.

To refine this model, let's build the model again, this time removing insignificant (variables with p-value < .05)

Based on the result of the initial regression model, the following variables has a p-value of < 0.05 (and will therefore, be dropped):

Lot Shape
Utilities
Land Slope
Year Remod/Add
Exter Cond
Bsmt Cond
BsmtFin Type 2
Electrical
Kitchen AbvGr
Fireplace Qu
Garage Yr Built
Garage Finish
Garage Qual
Garage Cond
Paved Drive
Open Porch SF
Enclosed Porch
3Ssn Porch
Fence
Misc Val
Mo Sold
MS Zoning
Street
Alley
Roof Style
Heating
Misc Feature
        

After dropping the insignificant variables and re-running model again, this is the result:

Refined Regression Result

After removing the redundant variables, the predictive power of the model remains largely unaffected, as evidenced by the negligible difference between the R^2 and Adjusted R^2 . This demonstrates that the excluded variables contributed little to explaining the variance in the dependent variable, reaffirming the validity of the remaining predictors. Additionally, the reduction in the number of parameters from 191 to 150 has significantly simplified the model, improving its interpretability and reducing the risk of overfitting. By focusing only on the most meaningful variables, the model strikes a balance between simplicity and statistical robustness, ensuring it remains both computationally efficient and generalizable to new data.

While reducing the model parameters to 150 is a step in the right direction, it is still relatively high for practical use, particularly if the goal is to serialize the model for deployment in real-world applications, such as predicting property prices using Zillow data. A simpler, more parsimonious model would streamline predictions and ensure ease of implementation in systems with limited computational resources. However, this simplification often comes with a trade-off—sacrificing some predictive power for practicality. Therefore, it is crucial to evaluate priorities: whether to prioritize a complex yet highly accurate model or a more practical, lightweight model that balances efficiency and usability.

My next step in simplifying the model involves removing all categorical variables from the regression. Based on my observations, the majority of categorical variables are statistically insignificant, with only a few showing meaningful relationships with the dependent variable. Retaining these variables adds unnecessary complexity without substantially improving the model's predictive power. By excluding them, the model can focus solely on the numerical predictors that are more relevant and impactful, further simplifying its structure while maintaining its validity for practical applications. This approach will also streamline model serialization and deployment for real-world use cases, such as property price prediction.

After dropping all categorical variables, and insignificant numerical variables this is the result of the regression:

Refined Numerical Regression Result

We successfully reduced number of predictors from 191 to just 17, significantly simplifying its structure. The R^2 value is now 87.1%, with an Adjusted R^2 of 87%, indicating a slight reduction in predictive power. However, considering the drastic reduction in complexity, this sacrifice is relatively minor and well worth the improved practicality. This streamlined model is far more suitable for real-world applications, such as predicting property prices. The next step is to assess whether these remaining predictors correspond to information readily available for properties listed on Zillow, ensuring the model's feasibility for deployment.

Using this zillow listing as an example, the following table highlights how the property details from Zillow are mapped to the regression model variables:

Model Variable Zillow Information Notes
Year Built 1975 Directly listed in Zillow.
Bedroom AbvGr 3 Listed under Bedrooms.
Kitchen AbvGr 1 Assumed to be 1 based on data.
TotalSQFT 2,050 sqft Sum of above and below ground living areas.
TotalBaths 2.0 Calculated from full and 3/4 bathrooms.
TotRms AbvGrd 6 Counted based on listed main-level rooms.
Fireplaces 0 Assumed absent as not mentioned.
Garage Yr Blt 1975 Assumed to match Year Built.
Garage Cars 1 Assumed based on garage presence.
Wood Deck SF Unknown Listed as a feature but no size provided.
Screen Porch 0 Assumed absent as not mentioned.
Yr Sold 2025 Based on listing year.
Year Remod/Add 1975 Assumed no remodeling occurred.
Mas Vnr Area 0 Assumed no masonry veneer.
Lot Frontage Unknown Not listed; may need estimation.
Lot Area 10,018 sqft Directly listed in Zillow.
Overall Qual Unknown Subjective; requires estimation.

Based on the mapping above, it appears that only Lot Frontage and Wood Deck SF should be considered for removal, as it lacks a clear assumption or direct mapping from available property data (e.g., Zillow). While it can be imputed with a median value to retain consistency in the dataset, doing so might introduce noise or reduce model interpretability. Instead, excluding Lot Frontage and Wood Deck SF simplifies the model further while only sacrificing a small amount of predictive power (reduced R^2 by .004), ensuring that the model remains practical and easy to deploy with real-world property data.

Now that we've decided on the significant predictors of properties, let's experiment with various predictive models, to see which yields the highest R^2 before we serialize the model.

Refined Numerical Regression Result

Given that XGBoost outperformed other models, the next step is to serialize the XGBoost model using joblib. Once serialized, we can turn it into an API that allows you to input property details, such as those from Zillow, and predict property prices seamlessly. This approach ensures that the model is both portable and easily deployable for real-world use.

But what if we wanted to really ask—what makes this process worth our time? What value can we truly extract from this dataset?

Despite the challenges of predicting current property prices due to limited data, this dataset offers invaluable insights into property valuation. By examining key features, we can form a practical, value-oriented strategy for identifying properties with strong investment potential.

Here's how each variable contributes to the story:

  • Bedrooms Above Ground, Kitchens Above Ground, Total Rooms Above Ground: These variables provide insight into a home's livable space. More rooms, especially in well-designed layouts, often attract higher sale prices due to greater functional appeal for families.
  • Fireplaces: Fireplaces add a traditional and aesthetic value to properties. Homes with this feature can stand out in the market, particularly in colder climates.
  • Garage (Year Built, Car Capacity, Area): The presence and quality of a garage significantly impact the property's value. Larger garages with multiple car capacity are appealing in suburban and high-income areas where vehicle ownership is common.
  • Wood Deck SF, Open Porch SF, Enclosed Porch, 3-Season Porch, Screen Porch: Outdoor living spaces, such as decks and porches, contribute to a property's recreational appeal. Well-maintained outdoor areas often increase buyer interest and perceived home value.
  • Pool Area: Pools are luxury features that can substantially increase property value in warm climates but might not add as much value in colder regions where maintenance costs deter potential buyers.
  • Miscellaneous Value: This represents other unique property features, such as sheds, fences, or custom landscaping, which may influence sale price depending on buyer priorities.
  • Month Sold, Year Sold: Seasonality plays a key role in real estate markets. Properties sold during peak seasons (e.g., spring and summer) tend to fetch higher prices due to increased demand.
  • Year Remodeled/Added: Recent renovations increase property appeal and marketability. Modern upgrades often translate into higher sale prices, particularly for older homes.
  • Lot Frontage and Lot Area: The size and frontage of a property affect both development potential and perceived spaciousness. Larger lots typically command higher prices, especially in desirable neighborhoods.
  • Overall Quality and Condition: These comprehensive measures assess a home's construction quality and current state. Properties with high-quality materials and excellent maintenance are more likely to sell at a premium.

By using these variables, investors and real estate professionals can create a structured approach to property selection. For instance, homes with favorable combinations of large lot sizes, recent renovations, and multiple amenities—such as garages and porches—are statistically likely to offer better resale value. This allows us to strategically prioritize which properties to renovate, acquire, or flip for maximum profitability.

Although precise predictions remain challenging due to dataset limitations, the ability to extract meaningful patterns from these variables helps guide informed decision-making in real estate investment, making this data analysis a valuable tool.

Limitations of Domain Knowledge in Implementation

While identifying statistically relevant features is crucial, the effective application of this information requires domain-specific knowledge. Real estate markets vary widely across regions, and the significance of certain features can differ depending on local buyer preferences, economic conditions, and regulatory factors. For instance, the value added by a pool or outdoor living area may be substantial in warmer climates but negligible in colder areas where these features are rarely used.

Additionally, factors such as zoning laws, school districts, and proximity to amenities are not captured within this dataset but are often critical to a property's value. Without an understanding of these domain-specific influences, our analysis risks misinterpreting or oversimplifying the data's implications.

Therefore, the key takeaway is that data analysis alone is not enough; success depends on combining statistical insights with expert knowledge of the real estate market. By marrying these two perspectives, investors can make better-informed decisions, minimizing risk and maximizing the potential for high returns.