Predicting housing prices

Photo by Breno Assis on Unsplash

Predicting housing prices

XGBoost Regressor predictions

ยท

2 min read

Using the Kaggle housing dataset, I practiced using machine learning to predict housing prices.


Kaggle notebook

Tableau dashboard


Here is an outline of the steps I took to perform the analysis:

  1. Data Exploration: Here are some of the helpful graphs created while exploring the data.

    1. Histogram of Home Sales Prices

       sns.displot(df_data['SalePrice'], 
                   bins=50, 
                   aspect=2,
                   kde=True, 
                   color='darkblue')
      
       plt.title(f'Home Sales Price. Average: ${(df_data.SalePrice.mean()):,.0f}')
       plt.xlabel('Price ($)')
       plt.ylabel('# of Homes')
      
       plt.show()
      
    2. Correlation heat map - This helps spot correlation between features and the target variable SalesPrice

    3. Average Sales Prices over time โฌ†๏ธ

    4. ๐Ÿ“Š I also created a tableau dashboard to help me visualize the data. This wasn't necessary, but in the past I have used this as a way to spot correlations or relationships.

  2. Data Cleaning

    1. Created variables for X (predictor) variables and y (target) variable.

    2. Train, test, split

    3. Handle missing data and transform columns

    4.  # Handle Missing Data
       numeric_cols = X.select_dtypes(include=['number']).columns
       categorical_cols = X.select_dtypes(exclude=['number']).columns
      
       numeric_imputer = SimpleImputer(strategy='mean')
       categorical_imputer = SimpleImputer(strategy='most_frequent')
      
       # Create transformers for preprocessing
       numeric_transformer = Pipeline(steps=[
           ('imputer', numeric_imputer),
           ('scaler', StandardScaler())
       ])
      
       categorical_transformer = Pipeline(steps=[
           ('imputer', categorical_imputer),
           ('encoder', OneHotEncoder(handle_unknown='ignore'))
       ])
      
       # Use ColumnTransformer to apply transformers to the appropriate columns
       from sklearn.compose import ColumnTransformer
      
       preprocessor = ColumnTransformer(
           transformers=[
               ('num', numeric_transformer, numeric_cols),
               ('cat', categorical_transformer, categorical_cols)
           ])
      
  3. XGB model and Predictions

     # Create a XGBoost Regressor
     xgb = XGBRegressor(n_estimators=500, learning_rate=0.04)
    
     # Bundle preprocessing and modeling code in a pipeline
     my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('model', xgb)
                                  ])
    
     # Preprocessing of training data, fit model 
     my_pipeline.fit(X_train, y_train)
    
     # Preprocessing of validation data, get predictions
     preds = my_pipeline.predict(X_valid)
    
     # Evaluate the model
     mae = mean_absolute_error(y_valid, preds)
     mse = mean_squared_error(y_valid, preds)
     r2 = r2_score(y_valid, preds)
     print(f"Mean Absolute Error (MAE): {mae:.2f}")
     print(f"Mean Squared Error (MSE): {mse:.2f}")
     print(f"R-squared (R2): {r2:.2f}")
    
  4. Hyper-parameter tuning

    1. Setting up parameters to tune
param_tuning = {
    'model__learning_rate': [0.01, 0.1, 0.05],
    'model__max_depth': [3, 5, 7, 10],
    'model__min_child_weight': [1, 3, 5],
    'model__subsample': [0.5, 0.7],
    'model__colsample_bytree': [0.5, 0.7],
    'model__n_estimators': [100, 200, 500, 1000],
    'model__objective': ['reg:squarederror']
}
  • Grid Search
xgb_model = XGBRegressor()
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', xgb_model)
                             ])


xgb_cv = GridSearchCV(estimator=my_pipeline,
                           param_grid = param_tuning,                        
                           scoring = 'neg_mean_absolute_error', #MAE
                           #scoring = 'neg_mean_squared_error',  #MSE
                           cv = 5)

xgb_cv.fit(X_train, y_train)
print("Best Score: ", xgb_cv.best_score_)
print("Best Params: ", xgb_cv.best_params_)

Hyper-parameter tuning helped decrease the model error ๐ŸŽ‰

ย