Home Credit Risk Model (Kaggle silver Medal)

type

status

date

slug

summary

Executive Summary

This report details our approach to the Home Credit - Credit Risk Model Stability competition, which aimed to predict customer loan defaults with a focus on model stability over time. We achieved a competitive result by implementing a robust ensemble model combining LightGBM and CatBoost classifiers, with careful feature engineering specifically tailored to financial data. Our methodology emphasized temporal stability, leveraging time-based cross-validation and feature importance analysis to create a reliable predictive model.

Dataset:

https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/data

Competition Overview

The Home Credit - Credit Risk Model Stability competition challenged participants to predict which customers were more likely to default on loans. The key differentiator from standard classification tasks was the emphasis on creating models that maintain stability over time. The evaluation metric rewarded both prediction accuracy and consistent performance across different time periods, using a custom stability metric calculated from Gini coefficients across weeks.

Data Understanding

The competition provided an extensive dataset containing various aspects of customer information:

Base application data

Credit bureau information

Previous application history

Tax registry data

Deposit and debit card information

The data was organized into various files with different depths:

Depth 0: Static information

Depth 1: Entity-level aggregated data

Depth 2: Further aggregations from external sources

Each feature was encoded with a suffix indicating its type:

P: Days past due transformation

M: Masking categories

A: Amount transformation

D: Date transformation

T: Unspecified transformation

L: Unspecified transformation

Methodology

Feature Engineering

Our feature engineering approach focused on creating meaningful aggregations from the raw data that would remain stable over time:

We also implemented special handling for date features, converting them to relative distances:

Additionally, we implemented memory optimization techniques to handle the large dataset efficiently:

Model Development

We employed a rigorous cross-validation approach using StratifiedGroupKFold with weeks as the grouping variable to ensure our model's stability over time:

Ensembling Approach

To improve stability and performance, we created a voting ensemble combining both CatBoost and LightGBM models:

Performance Optimization

After analyzing feature importance, we identified critical features that impact loan default prediction:

Interest rate (eir_270L)

Credit price (price_1097A)

Residual amount for active contracts (mean_residualamount_856A)

Total loan payments made by clients (pmtnum_254L)

Start date of closed credit contracts (mean_dateofcredstart_739D)

We also implemented a specific optimization to enhance our model's stability score by adjusting predictions based on the most important date feature:

Key Findings

Feature stability is critical for model performance over time

Temporal features, especially those related to credit history dates, provide strong signals

Financial indicators like interest rates and residual amounts are highly predictive

Ensemble models combining multiple algorithms provide more stable predictions

Understanding the evaluation metric's design is crucial - in this case, optimizing for both accuracy and stability

Conclusion

Our approach to the Home Credit - Credit Risk Model Stability competition successfully balanced prediction accuracy with temporal stability. By using a combination of rigorous feature engineering, efficient memory management, multi-model ensembling, and strategic cross-validation, we created a robust solution for predicting loan defaults that maintains its performance over time.

The model's strength lies in its ability to identify key financial indicators that remain predictive across different time periods, making it suitable for real-world deployment in credit risk assessment systems.