Problem Statement¶
Context¶
AllLife Bank is a mid-sized, fast-growing US-based financial institution that offers a range of retail banking services, including savings and checking accounts, fixed deposits, and personal loans. The bank’s business model is centered on building long-term customer relationships, expanding its retail footprint, and growing its loan portfolio to drive sustainable profitability through interest income.
It currently relies on a large base of liability customers (depositors) but faces a significant under-representation of asset customers (borrowers). To drive profitability through interest income, the bank must aggressively expand its loan portfolio by converting existing depositors into personal loan customers.
Last year’s pilot campaign achieved a 9% conversion rate, validating the potential of this strategy. However, to optimize marketing spend and improve efficiency, the retail marketing department requires a more data-driven approach. Enhancing the success ratio of these campaigns is critical for sustainable growth and maximizing customer lifetime value.
Objective¶
The objective is to develop a predictive classification model that identifies patterns and key factors driving personal loan adoption among existing liability customers. By uncovering the demographic and behavioral drivers of loan conversion, the goal is to enable targeted segmentation and more precise marketing interventions that improve campaign conversion rates, optimize marketing spend, and enhance overall profitability through higher-quality loan portfolio growth.
Data Dictionary¶
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Importing necessary libraries¶
# Installing the libraries with the specified version.
%pip install numpy==2.0.2 pandas==2.2.2 matplotlib==3.10.0 seaborn==0.13.2 scikit-learn==1.6.1 sklearn-pandas==2.2.0 -q --user
[notice] A new release of pip is available: 25.3 -> 26.0.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
Loading the dataset¶
Loan = pd.read_csv("Loan_Modelling.csv") ##read the data
# copying data to another variable to avoid any changes to original data
data = Loan.copy()
Data Overview¶
- Observations
- Sanity checks
Shape of the Data¶
data.shape
(5000, 14)
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Understanding the data types¶
data.dtypes.T
ID int64 Age int64 Experience int64 Income int64 ZIPCode int64 Family int64 CCAvg float64 Education int64 Mortgage int64 Personal_Loan int64 Securities_Account int64 CD_Account int64 Online int64 CreditCard int64 dtype: object
Checking Statistical Summary¶
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
Clean up¶
data = data.drop(['ID'], axis=1) ##drop a column from the dataframe
Data Preprocessing¶
Checking for Anomalous Values¶
data["Experience"].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43])
# checking for experience <0
data[data["Experience"] < 0]["Experience"].unique()
array([-1, -2, -3])
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
data["Education"].unique()
array([1, 2, 3])
Feature Engineering¶
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
467
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]
data["ZIPCode"] = data["ZIPCode"].astype("category")
Number of unique values if we take first two digits of ZIPCode: 7
## Converting the data type of categorical features to 'category'
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")
Exploratory Data Analysis.¶
EDA is a critical step in any data project used to investigate and understand the data before model construction.
The following questions serve as a starting point to help you approach the analysis and generate initial insights:
Questions:
What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
From data.describe():Mean = 56.5K, Median = 0 (50th percentile is 0) 25th percentile = 0, 75th percentile = 101K, Max = 635K Pattern: Heavily right-skewed. The majority of customers (~50%+) have no mortgage at all (value = 0). A smaller segment has mortgages, with a few high outliers near 635K. This bimodal-like distribution (zero vs. non-zero) is the dominant pattern.
How many customers have credit cards?
From the stats: CreditCard mean = 0.294 with 5,000 total customers. ~1,470 customers (29.4%) have a credit card from another bank. About 70.6% do not.What are the attributes that have a strong correlation with the target attribute (personal loan)?
IncomeandCCAvgshow the strongest positive correlation withPersonal_Loan.- see other observations below in correlation observation cell
How does a customer's interest in purchasing a loan vary with their education?
Undergraduates accept loans at less than a third the rate of Graduate/Advanced customers. The biggest jump is between Undergrad and Graduate — once a customer has a graduate degree, their likelihood of taking a loan nearly triples. Graduate and Advanced/Professional customers should be the primary targets for loan marketing campaigns.How does a customer's interest in purchasing a loan vary with their age?
Age has virtually no effect on loan interest. Acceptance rates are nearly flat across all age groups (8.7%–10.8%), and the mean age of acceptors (45.1) vs. non-acceptors (45.4) is essentially identical. This aligns with the near-zero correlation (r = -0.008) from the heatmap. Age should not be used as a segmentation variable for loan campaigns.
[IMPORTANT] Beyond the Basics: Other observations near each code block
Univariate Analysis¶
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet")
if bins:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="steelblue")
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, color="steelblue")
ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--")
ax_hist2.axvline(data[feature].median(), color="black", linestyle="-")
plt.show()
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
histogram_boxplot(data, "Age")
Observations on Age¶
- Age ranges from 23 to 67, with a mean (~45) and median (~45) that are nearly equal, suggesting a roughly symmetric distribution.
- The boxplot shows no extreme outliers; the distribution is fairly uniform across middle-aged customers.
- The bulk of customers fall between 35 and 55 years old — a working-age, financially active demographic.
histogram_boxplot(data, "Mortgage")
Observations on Mortgage¶
- The mortgage distribution is heavily right-skewed: the median is 0 (more than 50% of customers have no mortgage).
- The mean (~56.5K) is much higher than the median due to a long right tail with values up to 635K.
- There are visible outliers on the high end — a small group of customers with very large mortgages.
- The zero-inflated nature of this feature (most values = 0) suggests it may have limited predictive power for the majority of customers.
histogram_boxplot(data, "Experience")
Observations on Experience¶
- Professional experience ranges from 0 to 43 years (after correcting the negative values).
- The distribution is fairly uniform/flat, meaning customers are spread across all experience levels.
- Mean and median are both around 20 years, with no significant skew.
- Note:
Experienceis nearly perfectly correlated withAge, making it largely redundant as a predictor.
histogram_boxplot(data, "Income") ## Complete the code to create histogram_boxplot for Income
Observations on Income¶
- Income is right-skewed: most customers earn between 39K–98K, but the tail extends to 224K.
- The mean (~73.8K) is notably higher than the median (~64K), confirming the skew.
- Several high-income outliers are visible in the boxplot.
- Income is expected to be one of the strongest predictors of personal loan acceptance.
histogram_boxplot(data, "CCAvg") ## Complete the code to create histogram_boxplot for CCAvg
Observations on CCAvg¶
- Monthly credit card spend (
CCAvg) is right-skewed, with most customers spending between 0.7K–2.5K/month. - The mean (~1.94K) exceeds the median (~1.5K), driven by a few high spenders (max = 10K).
- High-spend outliers are visible; most customers cluster at the lower end of the scale.
- Higher CCAvg likely signals greater financial activity and correlates with personal loan uptake.
labeled_barplot(data, "Family", perc=True)
Observations on Family¶
- Family size is evenly distributed across 1–4 members, each accounting for roughly 25% of customers.
- No dominant family size exists — the bank's customer base is demographically diverse.
- Family size may influence financial needs; larger families might be more inclined to seek personal loans.
labeled_barplot(data, "CreditCard", perc=True)
Observations on CreditCard¶
- Approximately 29.4% (~1,470) of customers hold a credit card from another bank.
- The majority (70.6%) do not have an external credit card.
- This attribute may indicate competitive banking relationships and could influence loan targeting strategies.
labeled_barplot(data, "Education", perc=True)
Observations on Education¶
- Education is distributed across three levels: Undergrad (~42%), Graduate (~29%), and Advanced/Professional (~29%).
- Undergraduates are the largest group, but advanced-degree holders represent a significant and valuable segment.
- Education level is expected to correlate with income and personal loan interest.
labeled_barplot(data, "Securities_Account", perc=True)
Observations on Securities_Account¶
- Only about 10.4% of customers hold a securities account with the bank.
- The vast majority (89.6%) do not — securities account ownership is relatively rare in this customer base.
labeled_barplot(data, "CD_Account", perc=True)
Observations on CD_Account¶
- Only 6% of customers have a CD (Certificate of Deposit) account — the rarest product in the dataset.
- Despite its low prevalence, CD account ownership shows a strong association with personal loan acceptance (explored in bivariate analysis).
labeled_barplot(data, "Online", perc=True)
Observations on Online¶
- About 59.7% of customers use online banking, while 40.3% do not.
- The majority of the bank's customers are digitally engaged, which is relevant for designing targeted online marketing campaigns.
labeled_barplot(data, "ZIPCode", perc=True)
Observation on ZIPCode¶
- After truncating to the first two digits, there are 7 ZIP code groups, with the majority of customers concentrated in a few regions.
- The geographic concentration suggests the bank has a strong regional presence, which could inform location-based targeting.
Bivariate Analysis¶
Correlation Heatmap¶
# Correlation heatmap for numeric features
plt.figure(figsize=(12, 8))
numeric_data = data.select_dtypes(include=['float64', 'int64'])
corr_matrix = numeric_data.corr()
sns.heatmap(
corr_matrix,
annot=True,
fmt='.2f',
cmap='coolwarm',
center=0,
linewidths=0.5
)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()
Personal Loan Acceptance Rate by Education¶
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Count plot
sns.countplot(
data=data,
x='Education',
hue='Personal_Loan',
palette='Set2',
ax=axes[0]
)
axes[0].set_title('Personal Loan Acceptance by Education Level')
axes[0].set_xlabel('Education (1=Undergrad, 2=Graduate, 3=Advanced)')
axes[0].legend(title='Personal Loan', labels=['No', 'Yes'])
# Acceptance rate
edu_loan_rate = (
data.groupby('Education')['Personal_Loan']
.apply(lambda x: x.astype(int).mean() * 100)
.reset_index()
)
edu_loan_rate.columns = ['Education', 'Acceptance Rate (%)']
sns.barplot(
data=edu_loan_rate,
x='Education',
y='Acceptance Rate (%)',
palette='Set2',
ax=axes[1]
)
axes[1].set_title('Loan Acceptance Rate by Education Level')
axes[1].set_xlabel('Education (1=Undergrad, 2=Graduate, 3=Advanced)')
for p in axes[1].patches:
axes[1].annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width()/2, p.get_height()),
ha='center', va='bottom')
plt.tight_layout()
plt.show()
print(edu_loan_rate.to_string(index=False))
Education Acceptance Rate (%)
1 4.437023
2 12.972202
3 13.657562
Observations — Loan Acceptance by Education:
| Education Level | Total Customers | Accepted Loan | Acceptance Rate |
|---|---|---|---|
| Undergrad (1) | 2,096 | 93 | 4.44% |
| Graduate (2) | 1,403 | 182 | 12.97% |
| Advanced/Professional (3) | 1,501 | 205 | 13.66% |
- Undergraduates have a significantly lower loan acceptance rate (4.44%) compared to Graduate and Advanced degree holders.
- Graduate and Advanced/Professional customers accept loans at nearly 3x the rate of undergraduates.
- The jump in acceptance is most dramatic between Undergrad and Graduate — suggesting that education level (likely as a proxy for income) is a meaningful segmentation variable.
- Marketing implication: Campaigns should be heavily weighted toward Graduate and Advanced/Professional customers for higher conversion rates.
Personal Loan Acceptance vs. Age¶
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box plot of Age by Personal Loan
sns.boxplot(
data=data,
x='Personal_Loan',
y='Age',
palette='Set2',
ax=axes[0]
)
axes[0].set_title('Age Distribution by Personal Loan Acceptance')
axes[0].set_xticklabels(['No Loan', 'Accepted Loan'])
# Acceptance rate by age group
data['Age_Group'] = pd.cut(
data['Age'],
bins=[20, 30, 40, 50, 60, 70],
labels=['20s', '30s', '40s', '50s', '60s']
)
age_loan_rate = (
data.groupby('Age_Group', observed=True)['Personal_Loan']
.apply(lambda x: x.astype(int).mean() * 100)
.reset_index()
)
age_loan_rate.columns = ['Age_Group', 'Acceptance Rate (%)']
sns.barplot(
data=age_loan_rate,
x='Age_Group',
y='Acceptance Rate (%)',
palette='Set2',
ax=axes[1]
)
axes[1].set_title('Loan Acceptance Rate by Age Group')
for p in axes[1].patches:
axes[1].annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width()/2, p.get_height()),
ha='center', va='bottom')
plt.tight_layout()
plt.show()
# Drop temporary column
data.drop(columns=['Age_Group'], inplace=True)
Observations — Loan Acceptance by Age:
| Age Group | Total Customers | Accepted Loan | Acceptance Rate |
|---|---|---|---|
| 20s | 624 | 66 | 10.58% |
| 30s | 1,236 | 118 | 9.55% |
| 40s | 1,270 | 122 | 9.61% |
| 50s | 1,323 | 115 | 8.69% |
| 60s | 547 | 59 | 10.79% |
- Acceptance rates are remarkably consistent across all age groups, ranging only from ~8.7% to ~10.8%.
- The mean and median age of customers who accepted a loan (45.1 / 45.0) are virtually identical to those who did not (45.4 / 45.0).
- Age is not a meaningful predictor of personal loan interest — this is confirmed by its near-zero correlation (r = -0.008) with
Personal_Loan. - Marketing campaigns should not be segmented by age; instead, focus on higher-signal attributes like
Income,CCAvg, andCD_Account.
Personal Loan Acceptance vs. Income¶
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.boxplot(
data=data,
x='Personal_Loan',
y='Income',
palette='Set2',
ax=axes[0]
)
axes[0].set_title('Income Distribution by Personal Loan Acceptance')
axes[0].set_xticklabels(['No Loan', 'Accepted Loan'])
sns.kdeplot(
data=data[data['Personal_Loan'].astype(int) == 0],
x='Income',
label='No Loan',
ax=axes[1],
fill=True,
alpha=0.4
)
sns.kdeplot(
data=data[data['Personal_Loan'].astype(int) == 1],
x='Income',
label='Accepted Loan',
ax=axes[1],
fill=True,
alpha=0.4
)
axes[1].set_title('Income Density by Personal Loan Acceptance')
axes[1].legend()
plt.tight_layout()
plt.show()
Observations:
- Customers who accepted a personal loan have significantly higher median income than those who did not.
- The income distribution for loan acceptors is shifted clearly to the right, confirming
Incomeas one of the strongest predictors of loan acceptance.
Personal Loan Acceptance vs. CCAvg¶
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.boxplot(
data=data,
x='Personal_Loan',
y='CCAvg',
palette='Set2',
ax=axes[0]
)
axes[0].set_title('CCAvg Distribution by Personal Loan Acceptance')
axes[0].set_xticklabels(['No Loan', 'Accepted Loan'])
sns.kdeplot(
data=data[data['Personal_Loan'].astype(int) == 0],
x='CCAvg',
label='No Loan',
ax=axes[1],
fill=True,
alpha=0.4
)
sns.kdeplot(
data=data[data['Personal_Loan'].astype(int) == 1],
x='CCAvg',
label='Accepted Loan',
ax=axes[1],
fill=True,
alpha=0.4
)
axes[1].set_title('CCAvg Density by Personal Loan Acceptance')
axes[1].legend()
plt.tight_layout()
plt.show()
Observations:
- Customers who accepted a personal loan have a higher average monthly credit card spend (CCAvg).
- This suggests that higher spenders are more likely to take on additional credit products.
Personal Loan Acceptance vs. CD Account¶
cd_loan = data.groupby('CD_Account')['Personal_Loan'].apply(
lambda x: x.astype(int).mean() * 100
).reset_index()
cd_loan.columns = ['CD_Account', 'Acceptance Rate (%)']
sns.barplot(data=cd_loan, x='CD_Account', y='Acceptance Rate (%)', palette='Set2')
plt.title('Loan Acceptance Rate by CD Account Ownership')
plt.xticks([0, 1], ['No CD Account', 'Has CD Account'])
for p in plt.gca().patches:
plt.gca().annotate(f'{p.get_height():.1f}%',
(p.get_x() + p.get_width()/2, p.get_height()),
ha='center', va='bottom')
plt.show()
print(cd_loan.to_string(index=False))
CD_Account Acceptance Rate (%)
0 7.237122
1 46.357616
Observations:
- Customers with a CD (Certificate of Deposit) account have a dramatically higher loan acceptance rate.
- This is one of the strongest binary predictors of personal loan acceptance, likely because CD account holders are more financially engaged with the bank.
Overall Observations — Correlation with Personal Loan:
| Attribute | Correlation | Strength |
|---|---|---|
Income |
0.50 | Strong positive |
CCAvg |
0.37 | Moderate positive |
CD_Account |
0.32 | Moderate positive |
Mortgage |
0.14 | Weak positive |
Education |
0.14 | Weak positive |
Family |
0.06 | Negligible |
Securities_Account |
0.02 | Negligible |
Online |
~0.006 | No correlation |
CreditCard |
~0.003 | No correlation |
Age / Experience |
~-0.008 | No correlation |
Incomeis the strongest predictor (r = 0.50) — higher earners are significantly more likely to accept a personal loan.CCAvg(r = 0.37) reflects financial activity; heavy credit card spenders are more likely to take loans.CD_Account(r = 0.32) is a strong binary signal — CD holders are much more engaged banking customers.AgeandExperienceshow virtually no correlation with loan acceptance, despite being highly correlated with each other.Online,CreditCard(external), andZIPCodehave negligible predictive value for loan acceptance.
Data Preprocessing - cont'd¶
Outlier Detection¶
Q1 = data.select_dtypes(include=["float64", "int64"]).quantile(0.25)
Q3 = data.select_dtypes(include=["float64", "int64"]).quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower = (
Q1 - 1.5 * IQR
) # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
Age 0.00 Experience 0.00 Income 1.92 Family 0.00 CCAvg 6.48 Mortgage 5.82 dtype: float64
Data Preparation for Modeling¶
# dropping Experience as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X = X.astype(float)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 17) Shape of test set : (1500, 17) Percentage of classes in training set: Personal_Loan 0 0.905429 1 0.094571 Name: proportion, dtype: float64 Percentage of classes in test set: Personal_Loan 0 0.900667 1 0.099333 Name: proportion, dtype: float64
Model Building¶
Model Evaluation Criterion¶
We don't just default to accuracy, we prioritize Recall.
- False negatives would prdict a customer won't take the loan when they would have
- This would mean a misses sales opportunity
- So it's more costly to miss a real positive than to incorrectly flag someone.
Recall is supported by F1, Precision and Accuracy as a full picture.
Model Building¶
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Decision Tree (sklearn default)¶
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
Checking model performance on training data¶
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Visualizing the Decision Tree¶
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ZIPCode_93 <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 28.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_93 > 0.50 | | | | | |--- Age <= 37.50 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Age > 37.50 | | | | | | |--- Income <= 112.00 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Income > 112.00 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- CCAvg <= 2.40 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.40 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- Age <= 37.50 | | | | | | | | |--- Age <= 33.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Age > 33.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 37.50 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- Income <= 83.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 83.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family <= 3.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family > 3.00 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income <= 81.50 | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Income > 81.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- Mortgage <= 93.50 | | | | | | | | | |--- weights: [26.00, 0.00] class: 0 | | | | | | | | |--- Mortgage > 93.50 | | | | | | | | | |--- Mortgage <= 104.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Mortgage > 104.50 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- ZIPCode_91 <= 0.50 | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- ZIPCode_91 > 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Family <= 2.50 | | | | |--- Education_2 <= 0.50 | | | | | |--- Education_3 <= 0.50 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 56.50 | | | | | | | | |--- weights: [27.00, 0.00] class: 0 | | | | | | | |--- Age > 56.50 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Education_3 > 0.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- Income <= 107.00 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Income > 107.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- Education_2 > 0.50 | | | | | |--- weights: [0.00, 4.00] class: 1 | | | |--- Family > 2.50 | | | | |--- Age <= 57.50 | | | | | |--- CCAvg <= 4.85 | | | | | | |--- weights: [0.00, 17.00] class: 1 | | | | | |--- CCAvg > 4.85 | | | | | | |--- CCAvg <= 4.95 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 4.95 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Age > 57.50 | | | | | |--- ZIPCode_93 <= 0.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- Age <= 59.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 59.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- ZIPCode_93 > 0.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Income > 116.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- weights: [0.00, 53.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [0.00, 62.00] class: 1 | |--- Family > 2.50 | | |--- weights: [0.00, 154.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.308098 Family 0.259255 Education_2 0.166192 Education_3 0.147127 CCAvg 0.048798 Age 0.033150 CD_Account 0.017273 ZIPCode_94 0.007183 ZIPCode_93 0.004682 Mortgage 0.003236 Securities_Account 0.002224 Online 0.002224 ZIPCode_91 0.000556 CreditCard 0.000000 ZIPCode_92 0.000000 ZIPCode_96 0.000000 ZIPCode_95 0.000000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking model performance on test data¶
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.986 | 0.932886 | 0.926667 | 0.929766 |
Model Performance Improvement¶
Pre-pruning¶
Note: The parameters provided below are a sample set. You can feel free to update the same and try out other combinations.
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=42
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate recall scores for training and test sets
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test recall scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found: Max depth: 2 Max leaf nodes: 50 Min samples split: 10 Best test recall score: 1.0
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train) ## Complete the code to fit model on train data
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(2),
max_leaf_nodes=50, min_samples_split=10,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(2),
max_leaf_nodes=50, min_samples_split=10,
random_state=42)Checking performance on training data
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.790286 | 1.0 | 0.310798 | 0.474212 |
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1344.67, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [64.61, 79.31] class: 1 |--- Income > 92.50 | |--- Family <= 2.50 | | |--- weights: [298.20, 697.89] class: 1 | |--- Family > 2.50 | | |--- weights: [42.52, 972.81] class: 1
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.876529 CCAvg 0.066940 Family 0.056531 Age 0.000000 Mortgage 0.000000 Securities_Account 0.000000 CD_Account 0.000000 Online 0.000000 CreditCard 0.000000 ZIPCode_91 0.000000 ZIPCode_92 0.000000 ZIPCode_93 0.000000 ZIPCode_94 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000 Education_2 0.000000 Education_3 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking performance on test data
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.779333 | 1.0 | 0.310417 | 0.473768 |
Post-pruning¶
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000186 | 0.001114 |
| 2 | 0.000214 | 0.001542 |
| 3 | 0.000242 | 0.002750 |
| 4 | 0.000250 | 0.003250 |
| 5 | 0.000268 | 0.004324 |
| 6 | 0.000272 | 0.004868 |
| 7 | 0.000276 | 0.005420 |
| 8 | 0.000381 | 0.005801 |
| 9 | 0.000527 | 0.006329 |
| 10 | 0.000625 | 0.006954 |
| 11 | 0.000700 | 0.007654 |
| 12 | 0.000769 | 0.010731 |
| 13 | 0.000882 | 0.014260 |
| 14 | 0.000889 | 0.015149 |
| 15 | 0.001026 | 0.017200 |
| 16 | 0.001305 | 0.018505 |
| 17 | 0.001647 | 0.020153 |
| 18 | 0.002333 | 0.022486 |
| 19 | 0.002407 | 0.024893 |
| 20 | 0.003294 | 0.028187 |
| 21 | 0.006473 | 0.034659 |
| 22 | 0.025146 | 0.084951 |
| 23 | 0.039216 | 0.124167 |
| 24 | 0.047088 | 0.171255 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596766
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Recall vs alpha for training and testing sets
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=np.float64(0.0), random_state=1)
estimator_2 = DecisionTreeClassifier(
ccp_alpha=0.000272, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.000272, class_weight={0: 0.15, 1: 0.85},
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.000272, class_weight={0: 0.15, 1: 0.85},
random_state=1)Checking performance on training data
confusion_matrix_sklearn(estimator_2, X_train, y_train)
decision_tree_tune_post_train = model_performance_classification_sklearn(estimator_2, X_train, y_train)
decision_tree_tune_post_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.999143 | 1.0 | 0.991018 | 0.995489 |
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Age > 36.50 | | | | | | |--- ZIPCode_91 <= 0.50 | | | | | | | |--- weights: [6.15, 0.00] class: 0 | | | | | | |--- ZIPCode_91 > 0.50 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- Income > 81.50 | | | | | |--- Mortgage <= 152.00 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.05 | | | | | | | | |--- CCAvg <= 3.85 | | | | | | | | | |--- ZIPCode_91 <= 0.50 | | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- ZIPCode_91 > 0.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.85 | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Mortgage > 152.00 | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.50 | | | | |--- weights: [0.00, 6.80] class: 1 | | | |--- CCAvg > 4.50 | | | | |--- weights: [0.15, 0.00] class: 0 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 100.00 | | | | | |--- CCAvg <= 4.20 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- CCAvg > 4.20 | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | |--- Income > 100.00 | | | | | |--- Income <= 103.50 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- CreditCard > 0.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Income > 103.50 | | | | | | |--- weights: [64.95, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 110.00 | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | |--- Income > 110.00 | | | | | |--- Income <= 116.50 | | | | | | |--- Mortgage <= 141.50 | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | |--- Age <= 48.50 | | | | | | | | | |--- weights: [0.15, 2.55] class: 1 | | | | | | | | |--- Age > 48.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- CreditCard > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Mortgage > 141.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Income > 116.50 | | | | | | |--- weights: [0.00, 45.05] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 41.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | |--- Age > 41.50 | | | | | | |--- Income <= 100.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Income > 100.00 | | | | | | | |--- weights: [0.15, 5.10] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 52.70] class: 1 | |--- Family > 2.50 | | |--- Income <= 113.50 | | | |--- CCAvg <= 2.75 | | | | |--- Income <= 106.50 | | | | | |--- weights: [3.90, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Age <= 28.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- Age <= 60.00 | | | | | | | | |--- Age <= 35.00 | | | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- Education_2 > 0.50 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | |--- Age > 35.00 | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | |--- Age > 60.00 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 57.00 | | | | | |--- weights: [0.15, 11.90] class: 1 | | | | |--- Age > 57.00 | | | | | |--- weights: [0.75, 0.00] class: 0 | | |--- Income > 113.50 | | | |--- Age <= 66.00 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.50 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- CCAvg > 2.50 | | | | | | |--- Age <= 60.50 | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | |--- Age > 60.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 130.90] class: 1 | | | |--- Age > 66.00 | | | | |--- weights: [0.15, 0.00] class: 0
print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.598659 Education_2 0.138693 CCAvg 0.078911 Education_3 0.067460 Family 0.066408 Age 0.018238 CD_Account 0.011027 Mortgage 0.005053 Securities_Account 0.004728 ZIPCode_94 0.003990 ZIPCode_91 0.003596 CreditCard 0.002434 ZIPCode_92 0.000804 Online 0.000000 ZIPCode_93 0.000000 ZIPCode_96 0.000000 ZIPCode_95 0.000000
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking performance on test data
confusion_matrix_sklearn(estimator_2, X_test, y_test)
decision_tree_tune_post_test = model_performance_classification_sklearn(estimator_2, X_test, y_test)
decision_tree_tune_post_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.85906 | 0.934307 | 0.895105 |
Model Performance Comparison and Final Model Selection¶
# Create a dataframe to compare the performance of the models
models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, decision_tree_tune_perf_train.T, decision_tree_tune_post_train.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree (sklearn default) | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.0 | 0.790286 | 0.999143 |
| Recall | 1.0 | 1.000000 | 1.000000 |
| Precision | 1.0 | 0.310798 | 0.991018 |
| F1 | 1.0 | 0.474212 | 0.995489 |
Observations¶
Default tree- baseline only, almost always overfits.
Actionable Insights and Business Recommendations¶
Recommendations to the Bank¶
1. Target High-Income, High-Spending Customers First¶
Feature importance plots consistently show Income and CCAvg (credit card average spending) as the top predictors of loan acceptance. This tells the bank:
- Customers with higher disposable income and active spending habits are most likely to accept
- Marketing spend should be concentrated on this segment rather than broad outreach
- Personalized loan offers tied to spending patterns will have higher conversion rates
2. Education Level is a Strong Signal — Tailor the Message¶
Education is a significant feature. Higher-educated customers tend to accept more. The recommendation:
- Offer more sophisticated loan products (investment-linked, career development loans) to graduate-level customers
- Use simpler, benefit-focused messaging for lower education segments
3. Don't Rely on Age Alone — Use Income as a Proxy¶
Experience was dropped because it perfectly overlapped with Age — meaning age adds no extra information once income is known. The bank should:
- Avoid age-based segmentation strategies in isolation
- Focus on financial profile over demographic profile
4. Use the Model for Proactive, Not Reactive, Outreach¶
The model is optimized for high Recall — catching nearly every customer likely to accept. This means:
- Use it to build a pre-approved offer list and reach out proactively
- Customers flagged by the model should receive targeted campaigns before they look elsewhere
- The cost of a false positive (contacting someone who declines) is low compared to missing a genuine opportunity
5. Watch for Imbalance in the Customer Base¶
The model required class_weight corrections because loan acceptors are a minority class. This suggests:
- The bank's current conversion rate is low — there is significant untapped potential
- Even a modest improvement in targeting precision translates to meaningful revenue gain
- The bank should periodically retrain the model as customer behavior shifts
6. Combine Model Predictions with Relationship Data¶
The model uses transactional and demographic features but does not capture customer satisfaction, recent life events, or prior interactions. Recommendation: Use the model as a first filter, then layer in relationship manager insights for high-value flagged customers.
Summary¶
| Recommendation | Basis |
|---|---|
| Target high-income, high-spending customers | Top feature importances |
| Personalize by education level | Education is a key split feature |
| Don't use age in isolation | Age/Experience redundancy |
| Build proactive outreach lists | High-recall model design |
| Retrain periodically | Class imbalance reflects evolving behavior |
| Augment with relationship data | Model's feature limitations |