# Predicting housing prices

Using the Kaggle housing [dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data), I practiced using machine learning to predict housing prices.

---

### 🔗 Links

Kaggle [notebook](https://www.kaggle.com/code/glenn23/housing-prices/notebook)

Tableau [dashboard](https://public.tableau.com/app/profile/donald.tucker4155/viz/HousingPrices_16960239787650/Sheet12)

---

Here is an outline of the steps I took to perform the analysis:

1. Data Exploration: Here are some of the helpful graphs created while exploring the data.
    
    1. Histogram of Home Sales Prices
        
        ```python
        sns.displot(df_data['SalePrice'], 
                    bins=50, 
                    aspect=2,
                    kde=True, 
                    color='darkblue')
        
        plt.title(f'Home Sales Price. Average: ${(df_data.SalePrice.mean()):,.0f}')
        plt.xlabel('Price ($)')
        plt.ylabel('# of Homes')
        
        plt.show()
        ```
        
    2. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1704131992758/c0a8d540-4cad-46bd-ac79-a06b5117efd4.png align="center")
        
        Correlation heat map - This helps spot correlation between features and the target variable `SalesPrice`
        
    3. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1704132054288/4bd4c7a3-1484-44ab-a9a0-6fdc0d6c59da.png align="center")
        
        Average `Sales Prices` over time ⬆️
        
    4. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1704132120634/e5f33f80-6ed7-41d9-b768-71f7a0d4225e.png align="center")
        
        📊 I also created a tableau [dashboard](https://public.tableau.com/app/profile/donald.tucker4155/viz/HousingPrices_16960239787650/Sheet12) to help me visualize the data. This wasn't necessary, but in the past I have used this as a way to spot correlations or relationships.
        
2. Data Cleaning
    
    1. Created variables for `X` (predictor) variables and `y` (target) variable.
        
    2. Train, test, split
        
    3. Handle missing data and transform columns
        
    4. ```python
        # Handle Missing Data
        numeric_cols = X.select_dtypes(include=['number']).columns
        categorical_cols = X.select_dtypes(exclude=['number']).columns
        
        numeric_imputer = SimpleImputer(strategy='mean')
        categorical_imputer = SimpleImputer(strategy='most_frequent')
        
        # Create transformers for preprocessing
        numeric_transformer = Pipeline(steps=[
            ('imputer', numeric_imputer),
            ('scaler', StandardScaler())
        ])
        
        categorical_transformer = Pipeline(steps=[
            ('imputer', categorical_imputer),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ])
        
        # Use ColumnTransformer to apply transformers to the appropriate columns
        from sklearn.compose import ColumnTransformer
        
        preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_cols),
                ('cat', categorical_transformer, categorical_cols)
            ])
        ```
        
3. XGB model and Predictions
    
    ```python
    # Create a XGBoost Regressor
    xgb = XGBRegressor(n_estimators=500, learning_rate=0.04)
    ```
    
    ```python
    # Bundle preprocessing and modeling code in a pipeline
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('model', xgb)
                                 ])
    
    # Preprocessing of training data, fit model 
    my_pipeline.fit(X_train, y_train)
    
    # Preprocessing of validation data, get predictions
    preds = my_pipeline.predict(X_valid)
    
    # Evaluate the model
    mae = mean_absolute_error(y_valid, preds)
    mse = mean_squared_error(y_valid, preds)
    r2 = r2_score(y_valid, preds)
    print(f"Mean Absolute Error (MAE): {mae:.2f}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"R-squared (R2): {r2:.2f}")
    ```
    
4. Hyper-parameter tuning
    
    1. Setting up parameters to tune
        

```python
param_tuning = {
    'model__learning_rate': [0.01, 0.1, 0.05],
    'model__max_depth': [3, 5, 7, 10],
    'model__min_child_weight': [1, 3, 5],
    'model__subsample': [0.5, 0.7],
    'model__colsample_bytree': [0.5, 0.7],
    'model__n_estimators': [100, 200, 500, 1000],
    'model__objective': ['reg:squarederror']
}
```

* Grid Search
    

```python
xgb_model = XGBRegressor()
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', xgb_model)
                             ])


xgb_cv = GridSearchCV(estimator=my_pipeline,
                           param_grid = param_tuning,                        
                           scoring = 'neg_mean_absolute_error', #MAE
                           #scoring = 'neg_mean_squared_error',  #MSE
                           cv = 5)

xgb_cv.fit(X_train, y_train)
print("Best Score: ", xgb_cv.best_score_)
print("Best Params: ", xgb_cv.best_params_)
```

Hyper-parameter tuning helped decrease the model error 🎉
