Home Credit Risk Model (Kaggle silver Medal)

type
status
date
slug
summary
tags
category
icon
password
notion image

Executive Summary

This report details our approach to the Home Credit - Credit Risk Model Stability competition, which aimed to predict customer loan defaults with a focus on model stability over time. We achieved a competitive result by implementing a robust ensemble model combining LightGBM and CatBoost classifiers, with careful feature engineering specifically tailored to financial data. Our methodology emphasized temporal stability, leveraging time-based cross-validation and feature importance analysis to create a reliable predictive model.
Dataset:
 

Competition Overview

The Home Credit - Credit Risk Model Stability competition challenged participants to predict which customers were more likely to default on loans. The key differentiator from standard classification tasks was the emphasis on creating models that maintain stability over time. The evaluation metric rewarded both prediction accuracy and consistent performance across different time periods, using a custom stability metric calculated from Gini coefficients across weeks.
notion image
 

Data Understanding

The competition provided an extensive dataset containing various aspects of customer information:
  • Base application data
  • Credit bureau information
  • Previous application history
  • Tax registry data
  • Deposit and debit card information
The data was organized into various files with different depths:
  • Depth 0: Static information
  • Depth 1: Entity-level aggregated data
  • Depth 2: Further aggregations from external sources
Each feature was encoded with a suffix indicating its type:
  • P: Days past due transformation
  • M: Masking categories
  • A: Amount transformation
  • D: Date transformation
  • T: Unspecified transformation
  • L: Unspecified transformation

Methodology

Feature Engineering

Our feature engineering approach focused on creating meaningful aggregations from the raw data that would remain stable over time:
We also implemented special handling for date features, converting them to relative distances:
Additionally, we implemented memory optimization techniques to handle the large dataset efficiently:

Model Development

We employed a rigorous cross-validation approach using StratifiedGroupKFold with weeks as the grouping variable to ensure our model's stability over time:

Ensembling Approach

To improve stability and performance, we created a voting ensemble combining both CatBoost and LightGBM models:

Performance Optimization

After analyzing feature importance, we identified critical features that impact loan default prediction:
  1. Interest rate (eir_270L)
  1. Credit price (price_1097A)
  1. Residual amount for active contracts (mean_residualamount_856A)
  1. Total loan payments made by clients (pmtnum_254L)
  1. Start date of closed credit contracts (mean_dateofcredstart_739D)
We also implemented a specific optimization to enhance our model's stability score by adjusting predictions based on the most important date feature:
 
notion image

Key Findings

  1. Feature stability is critical for model performance over time
  1. Temporal features, especially those related to credit history dates, provide strong signals
  1. Financial indicators like interest rates and residual amounts are highly predictive
  1. Ensemble models combining multiple algorithms provide more stable predictions
  1. Understanding the evaluation metric's design is crucial - in this case, optimizing for both accuracy and stability

Conclusion

Our approach to the Home Credit - Credit Risk Model Stability competition successfully balanced prediction accuracy with temporal stability. By using a combination of rigorous feature engineering, efficient memory management, multi-model ensembling, and strategic cross-validation, we created a robust solution for predicting loan defaults that maintains its performance over time.
The model's strength lies in its ability to identify key financial indicators that remain predictive across different time periods, making it suitable for real-world deployment in credit risk assessment systems.
Loading...

© Kai Zhang 2024-2025