This project analyzes loan default patterns using machine learning on historical financial data. The goal was to identify high-risk borrowers early, highlight drivers of default, and deliver actionable insights that financial institutions could use to mitigate risk.
As Data Engineer, I led the end-to-end lifecycle of this project: data preprocessing, feature engineering, model selection, evaluation, and recommendations.
I combined technical expertise in Python, SQL, and Jupyter with data visualization in Tableau to balance predictive accuracy with interpretability.
Cleaned, preprocessed, and engineered features from financial datasets
Built and validated predictive models using machine learning algorithms
Translated technical findings into business recommendations for risk mitigation
Dataset contained far fewer defaults than non-defaults, requiring specialized techniques to ensure model could effectively identify high-risk cases.
Complex relationships between financial variables and noise in payment history data affected model stability and interpretability.
Need for both high recall (catch defaults) and high precision (avoid false positives) to balance risk detection with customer experience.
Cleaned missing and inconsistent payment data, handling null values and standardizing financial metrics across the dataset.
Applied SMOTE (Synthetic Minority Oversampling) to address class imbalance and improve model's ability to detect defaults.
Engineered new features including repayment history flags and credit utilization ratios to capture risk patterns.
I trained multiple models including Logistic Regression and Random Forest. Logistic Regression offered interpretability, while Random Forest delivered higher predictive power.
Random Forest model outperformed others overall, striking the best balance between recall and precision for deployment.
Logistic Regression provided valuable interpretability for stakeholders, ensuring trust in model decisions.
Both models achieved strong F1 scores of 0.94, demonstrating effective handling of the precision-recall tradeoff.
Improved detection of high-risk borrowers by boosting recall through synthetic minority oversampling techniques.
Random Forest chosen for deployment, Logistic Regression kept for explainability and stakeholder confidence.
Fine-tuning hyperparameters and testing XGBoost for further performance gains in precision and recall balance.
Flag borrowers with early payment delays (PAY_0, PAY_2) for proactive outreach and support programs.
Action: Implement automated alert system for payment delays
Apply stricter checks for borrowers with low limits and repayment issues to prevent over-extension.
Action: Deploy dynamic credit limit adjustment based on risk scores
Real-time alert system for repeated late payments to enable immediate intervention strategies.
Action: Build dashboard for payment pattern monitoring
Dynamic loan rates tied to default probability scores to better price risk and improve profitability.
Action: Integrate risk scoring into pricing algorithms
This project demonstrated the power of machine learning to reduce financial risk and enhance lending practices. By blending technical rigor with business strategy, I delivered a predictive framework that not only identified risk but also informed actionable strategies to reduce defaults.
By combining machine learning expertise with financial domain knowledge, this project transformed raw data into a strategic asset for risk management and business growth.