Problem Statement¶

Context¶

AllLife Bank is a mid-sized, fast-growing US-based financial institution that offers a range of retail banking services, including savings and checking accounts, fixed deposits, and personal loans. The bank’s business model is centered on building long-term customer relationships, expanding its retail footprint, and growing its loan portfolio to drive sustainable profitability through interest income.

It currently relies on a large base of liability customers (depositors) but faces a significant under-representation of asset customers (borrowers). To drive profitability through interest income, the bank must aggressively expand its loan portfolio by converting existing depositors into personal loan customers.

Last year’s pilot campaign achieved a 9% conversion rate, validating the potential of this strategy. However, to optimize marketing spend and improve efficiency, the retail marketing department requires a more data-driven approach. Enhancing the success ratio of these campaigns is critical for sustainable growth and maximizing customer lifetime value.

Objective¶

The objective is to develop a predictive classification model that identifies patterns and key factors driving personal loan adoption among existing liability customers. By uncovering the demographic and behavioral drivers of loan conversion, the goal is to enable targeted segmentation and more precise marketing interventions that improve campaign conversion rates, optimize marketing spend, and enhance overall profitability through higher-quality loan portfolio growth.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [1]:
# Installing the libraries with the specified version.
%pip install numpy==2.0.2 pandas==2.2.2 matplotlib==3.10.0 seaborn==0.13.2 scikit-learn==1.6.1 sklearn-pandas==2.2.0 -q --user
[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [2]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)

# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

Loading the dataset¶

In [3]:
Loan = pd.read_csv("Loan_Modelling.csv")   ##read the data
# copying data to another variable to avoid any changes to original data
data = Loan.copy()

Data Overview¶

  • Observations
  • Sanity checks

Shape of the Data¶

In [4]:
data.shape
Out[4]:
(5000, 14)
In [6]:
data.head()
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [5]:
data.tail()
Out[5]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

Understanding the data types¶

In [7]:
data.dtypes.T
Out[7]:
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIPCode                 int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal_Loan           int64
Securities_Account      int64
CD_Account              int64
Online                  int64
CreditCard              int64
dtype: object

Checking Statistical Summary¶

In [8]:
data.describe().T
Out[8]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Clean up¶

In [9]:
data = data.drop(['ID'], axis=1)    ##drop a column from the dataframe

Data Preprocessing¶

Checking for Anomalous Values¶

In [10]:
data["Experience"].unique()
Out[10]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, -1, 34,  0, 38, 40, 33,  4, -2, 42, -3, 43])
In [11]:
# checking for experience <0
data[data["Experience"] < 0]["Experience"].unique()
Out[11]:
array([-1, -2, -3])
In [12]:
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
In [13]:
data["Education"].unique()
Out[13]:
array([1, 2, 3])

Feature Engineering¶

In [14]:
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
Out[14]:
467
In [15]:
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
    "Number of unique values if we take first two digits of ZIPCode: ",
    data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]

data["ZIPCode"] = data["ZIPCode"].astype("category")
Number of unique values if we take first two digits of ZIPCode:  7
In [16]:
## Converting the data type of categorical features to 'category'
cat_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")

Exploratory Data Analysis.¶

EDA is a critical step in any data project used to investigate and understand the data before model construction.

The following questions serve as a starting point to help you approach the analysis and generate initial insights:

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
    From data.describe():

    Mean = 56.5K, Median = 0 (50th percentile is 0) 25th percentile = 0, 75th percentile = 101K, Max = 635K Pattern: Heavily right-skewed. The majority of customers (~50%+) have no mortgage at all (value = 0). A smaller segment has mortgages, with a few high outliers near 635K. This bimodal-like distribution (zero vs. non-zero) is the dominant pattern.

  2. How many customers have credit cards?
    From the stats: CreditCard mean = 0.294 with 5,000 total customers. ~1,470 customers (29.4%) have a credit card from another bank. About 70.6% do not.

  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?

    • Income and CCAvg show the strongest positive correlation with Personal_Loan.
    • see other observations below in correlation observation cell
  4. How does a customer's interest in purchasing a loan vary with their education?
    Undergraduates accept loans at less than a third the rate of Graduate/Advanced customers. The biggest jump is between Undergrad and Graduate — once a customer has a graduate degree, their likelihood of taking a loan nearly triples. Graduate and Advanced/Professional customers should be the primary targets for loan marketing campaigns.

  5. How does a customer's interest in purchasing a loan vary with their age?
    Age has virtually no effect on loan interest. Acceptance rates are nearly flat across all age groups (8.7%–10.8%), and the mean age of acceptors (45.1) vs. non-acceptors (45.4) is essentially identical. This aligns with the near-zero correlation (r = -0.008) from the heatmap. Age should not be used as a segmentation variable for loan campaigns.

[IMPORTANT] Beyond the Basics: Other observations near each code block

Univariate Analysis¶

In [17]:
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,
        sharex=True,
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet")
    if bins:
        sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="steelblue")
    else:
        sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, color="steelblue")
    ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--")
    ax_hist2.axvline(data[feature].median(), color="black", linestyle="-")
    plt.show()
In [18]:
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [19]:
histogram_boxplot(data, "Age")
No description has been provided for this image

Observations on Age¶

  • Age ranges from 23 to 67, with a mean (~45) and median (~45) that are nearly equal, suggesting a roughly symmetric distribution.
  • The boxplot shows no extreme outliers; the distribution is fairly uniform across middle-aged customers.
  • The bulk of customers fall between 35 and 55 years old — a working-age, financially active demographic.
In [20]:
histogram_boxplot(data, "Mortgage")
No description has been provided for this image

Observations on Mortgage¶

  • The mortgage distribution is heavily right-skewed: the median is 0 (more than 50% of customers have no mortgage).
  • The mean (~56.5K) is much higher than the median due to a long right tail with values up to 635K.
  • There are visible outliers on the high end — a small group of customers with very large mortgages.
  • The zero-inflated nature of this feature (most values = 0) suggests it may have limited predictive power for the majority of customers.
In [21]:
histogram_boxplot(data, "Experience")
No description has been provided for this image

Observations on Experience¶

  • Professional experience ranges from 0 to 43 years (after correcting the negative values).
  • The distribution is fairly uniform/flat, meaning customers are spread across all experience levels.
  • Mean and median are both around 20 years, with no significant skew.
  • Note: Experience is nearly perfectly correlated with Age, making it largely redundant as a predictor.
In [22]:
histogram_boxplot(data, "Income")  ## Complete the code to create histogram_boxplot for Income
No description has been provided for this image

Observations on Income¶

  • Income is right-skewed: most customers earn between 39K–98K, but the tail extends to 224K.
  • The mean (~73.8K) is notably higher than the median (~64K), confirming the skew.
  • Several high-income outliers are visible in the boxplot.
  • Income is expected to be one of the strongest predictors of personal loan acceptance.
In [23]:
histogram_boxplot(data, "CCAvg")  ## Complete the code to create histogram_boxplot for CCAvg
No description has been provided for this image

Observations on CCAvg¶

  • Monthly credit card spend (CCAvg) is right-skewed, with most customers spending between 0.7K–2.5K/month.
  • The mean (~1.94K) exceeds the median (~1.5K), driven by a few high spenders (max = 10K).
  • High-spend outliers are visible; most customers cluster at the lower end of the scale.
  • Higher CCAvg likely signals greater financial activity and correlates with personal loan uptake.
In [24]:
labeled_barplot(data, "Family", perc=True)
No description has been provided for this image

Observations on Family¶

  • Family size is evenly distributed across 1–4 members, each accounting for roughly 25% of customers.
  • No dominant family size exists — the bank's customer base is demographically diverse.
  • Family size may influence financial needs; larger families might be more inclined to seek personal loans.
In [25]:
labeled_barplot(data, "CreditCard", perc=True)
No description has been provided for this image

Observations on CreditCard¶

  • Approximately 29.4% (~1,470) of customers hold a credit card from another bank.
  • The majority (70.6%) do not have an external credit card.
  • This attribute may indicate competitive banking relationships and could influence loan targeting strategies.
In [26]:
labeled_barplot(data, "Education", perc=True)
No description has been provided for this image

Observations on Education¶

  • Education is distributed across three levels: Undergrad (~42%), Graduate (~29%), and Advanced/Professional (~29%).
  • Undergraduates are the largest group, but advanced-degree holders represent a significant and valuable segment.
  • Education level is expected to correlate with income and personal loan interest.
In [27]:
labeled_barplot(data, "Securities_Account", perc=True)
No description has been provided for this image

Observations on Securities_Account¶

  • Only about 10.4% of customers hold a securities account with the bank.
  • The vast majority (89.6%) do not — securities account ownership is relatively rare in this customer base.
In [28]:
labeled_barplot(data, "CD_Account", perc=True)
No description has been provided for this image

Observations on CD_Account¶

  • Only 6% of customers have a CD (Certificate of Deposit) account — the rarest product in the dataset.
  • Despite its low prevalence, CD account ownership shows a strong association with personal loan acceptance (explored in bivariate analysis).
In [29]:
labeled_barplot(data, "Online", perc=True)
No description has been provided for this image

Observations on Online¶

  • About 59.7% of customers use online banking, while 40.3% do not.
  • The majority of the bank's customers are digitally engaged, which is relevant for designing targeted online marketing campaigns.
In [30]:
labeled_barplot(data, "ZIPCode", perc=True)
No description has been provided for this image

Observation on ZIPCode¶

  • After truncating to the first two digits, there are 7 ZIP code groups, with the majority of customers concentrated in a few regions.
  • The geographic concentration suggests the bank has a strong regional presence, which could inform location-based targeting.

Bivariate Analysis¶

Correlation Heatmap¶

In [31]:
# Correlation heatmap for numeric features
plt.figure(figsize=(12, 8))
numeric_data = data.select_dtypes(include=['float64', 'int64'])
corr_matrix = numeric_data.corr()
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    linewidths=0.5
)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()
No description has been provided for this image

Personal Loan Acceptance Rate by Education¶

In [32]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(
    data=data,
    x='Education',
    hue='Personal_Loan',
    palette='Set2',
    ax=axes[0]
)
axes[0].set_title('Personal Loan Acceptance by Education Level')
axes[0].set_xlabel('Education (1=Undergrad, 2=Graduate, 3=Advanced)')
axes[0].legend(title='Personal Loan', labels=['No', 'Yes'])

# Acceptance rate
edu_loan_rate = (
    data.groupby('Education')['Personal_Loan']
    .apply(lambda x: x.astype(int).mean() * 100)
    .reset_index()
)
edu_loan_rate.columns = ['Education', 'Acceptance Rate (%)']
sns.barplot(
    data=edu_loan_rate,
    x='Education',
    y='Acceptance Rate (%)',
    palette='Set2',
    ax=axes[1]
)
axes[1].set_title('Loan Acceptance Rate by Education Level')
axes[1].set_xlabel('Education (1=Undergrad, 2=Graduate, 3=Advanced)')
for p in axes[1].patches:
    axes[1].annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width()/2, p.get_height()),
                     ha='center', va='bottom')
plt.tight_layout()
plt.show()

print(edu_loan_rate.to_string(index=False))
No description has been provided for this image
Education  Acceptance Rate (%)
        1             4.437023
        2            12.972202
        3            13.657562

Observations — Loan Acceptance by Education:

Education Level Total Customers Accepted Loan Acceptance Rate
Undergrad (1) 2,096 93 4.44%
Graduate (2) 1,403 182 12.97%
Advanced/Professional (3) 1,501 205 13.66%
  • Undergraduates have a significantly lower loan acceptance rate (4.44%) compared to Graduate and Advanced degree holders.
  • Graduate and Advanced/Professional customers accept loans at nearly 3x the rate of undergraduates.
  • The jump in acceptance is most dramatic between Undergrad and Graduate — suggesting that education level (likely as a proxy for income) is a meaningful segmentation variable.
  • Marketing implication: Campaigns should be heavily weighted toward Graduate and Advanced/Professional customers for higher conversion rates.

Personal Loan Acceptance vs. Age¶

In [33]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot of Age by Personal Loan
sns.boxplot(
    data=data,
    x='Personal_Loan',
    y='Age',
    palette='Set2',
    ax=axes[0]
)
axes[0].set_title('Age Distribution by Personal Loan Acceptance')
axes[0].set_xticklabels(['No Loan', 'Accepted Loan'])

# Acceptance rate by age group
data['Age_Group'] = pd.cut(
    data['Age'],
    bins=[20, 30, 40, 50, 60, 70],
    labels=['20s', '30s', '40s', '50s', '60s']
)
age_loan_rate = (
    data.groupby('Age_Group', observed=True)['Personal_Loan']
    .apply(lambda x: x.astype(int).mean() * 100)
    .reset_index()
)
age_loan_rate.columns = ['Age_Group', 'Acceptance Rate (%)']
sns.barplot(
    data=age_loan_rate,
    x='Age_Group',
    y='Acceptance Rate (%)',
    palette='Set2',
    ax=axes[1]
)
axes[1].set_title('Loan Acceptance Rate by Age Group')
for p in axes[1].patches:
    axes[1].annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width()/2, p.get_height()),
                     ha='center', va='bottom')
plt.tight_layout()
plt.show()

# Drop temporary column
data.drop(columns=['Age_Group'], inplace=True)
No description has been provided for this image

Observations — Loan Acceptance by Age:

Age Group Total Customers Accepted Loan Acceptance Rate
20s 624 66 10.58%
30s 1,236 118 9.55%
40s 1,270 122 9.61%
50s 1,323 115 8.69%
60s 547 59 10.79%
  • Acceptance rates are remarkably consistent across all age groups, ranging only from ~8.7% to ~10.8%.
  • The mean and median age of customers who accepted a loan (45.1 / 45.0) are virtually identical to those who did not (45.4 / 45.0).
  • Age is not a meaningful predictor of personal loan interest — this is confirmed by its near-zero correlation (r = -0.008) with Personal_Loan.
  • Marketing campaigns should not be segmented by age; instead, focus on higher-signal attributes like Income, CCAvg, and CD_Account.

Personal Loan Acceptance vs. Income¶

In [34]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(
    data=data,
    x='Personal_Loan',
    y='Income',
    palette='Set2',
    ax=axes[0]
)
axes[0].set_title('Income Distribution by Personal Loan Acceptance')
axes[0].set_xticklabels(['No Loan', 'Accepted Loan'])

sns.kdeplot(
    data=data[data['Personal_Loan'].astype(int) == 0],
    x='Income',
    label='No Loan',
    ax=axes[1],
    fill=True,
    alpha=0.4
)
sns.kdeplot(
    data=data[data['Personal_Loan'].astype(int) == 1],
    x='Income',
    label='Accepted Loan',
    ax=axes[1],
    fill=True,
    alpha=0.4
)
axes[1].set_title('Income Density by Personal Loan Acceptance')
axes[1].legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Observations:

  • Customers who accepted a personal loan have significantly higher median income than those who did not.
  • The income distribution for loan acceptors is shifted clearly to the right, confirming Income as one of the strongest predictors of loan acceptance.

Personal Loan Acceptance vs. CCAvg¶

In [35]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(
    data=data,
    x='Personal_Loan',
    y='CCAvg',
    palette='Set2',
    ax=axes[0]
)
axes[0].set_title('CCAvg Distribution by Personal Loan Acceptance')
axes[0].set_xticklabels(['No Loan', 'Accepted Loan'])

sns.kdeplot(
    data=data[data['Personal_Loan'].astype(int) == 0],
    x='CCAvg',
    label='No Loan',
    ax=axes[1],
    fill=True,
    alpha=0.4
)
sns.kdeplot(
    data=data[data['Personal_Loan'].astype(int) == 1],
    x='CCAvg',
    label='Accepted Loan',
    ax=axes[1],
    fill=True,
    alpha=0.4
)
axes[1].set_title('CCAvg Density by Personal Loan Acceptance')
axes[1].legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Observations:

  • Customers who accepted a personal loan have a higher average monthly credit card spend (CCAvg).
  • This suggests that higher spenders are more likely to take on additional credit products.

Personal Loan Acceptance vs. CD Account¶

In [36]:
cd_loan = data.groupby('CD_Account')['Personal_Loan'].apply(
    lambda x: x.astype(int).mean() * 100
).reset_index()
cd_loan.columns = ['CD_Account', 'Acceptance Rate (%)']

sns.barplot(data=cd_loan, x='CD_Account', y='Acceptance Rate (%)', palette='Set2')
plt.title('Loan Acceptance Rate by CD Account Ownership')
plt.xticks([0, 1], ['No CD Account', 'Has CD Account'])
for p in plt.gca().patches:
    plt.gca().annotate(f'{p.get_height():.1f}%',
                       (p.get_x() + p.get_width()/2, p.get_height()),
                       ha='center', va='bottom')
plt.show()

print(cd_loan.to_string(index=False))
No description has been provided for this image
CD_Account  Acceptance Rate (%)
         0             7.237122
         1            46.357616

Observations:

  • Customers with a CD (Certificate of Deposit) account have a dramatically higher loan acceptance rate.
  • This is one of the strongest binary predictors of personal loan acceptance, likely because CD account holders are more financially engaged with the bank.

Overall Observations — Correlation with Personal Loan:

Attribute Correlation Strength
Income 0.50 Strong positive
CCAvg 0.37 Moderate positive
CD_Account 0.32 Moderate positive
Mortgage 0.14 Weak positive
Education 0.14 Weak positive
Family 0.06 Negligible
Securities_Account 0.02 Negligible
Online ~0.006 No correlation
CreditCard ~0.003 No correlation
Age / Experience ~-0.008 No correlation
  • Income is the strongest predictor (r = 0.50) — higher earners are significantly more likely to accept a personal loan.
  • CCAvg (r = 0.37) reflects financial activity; heavy credit card spenders are more likely to take loans.
  • CD_Account (r = 0.32) is a strong binary signal — CD holders are much more engaged banking customers.
  • Age and Experience show virtually no correlation with loan acceptance, despite being highly correlated with each other.
  • Online, CreditCard (external), and ZIPCode have negligible predictive value for loan acceptance.

Data Preprocessing - cont'd¶

Outlier Detection¶

In [38]:
Q1 = data.select_dtypes(include=["float64", "int64"]).quantile(0.25)
Q3 = data.select_dtypes(include=["float64", "int64"]).quantile(0.75)

IQR = Q3 - Q1  # Inter Quantile Range (75th perentile - 25th percentile)

lower = (
    Q1 - 1.5 * IQR
)  # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
In [39]:
(
    (data.select_dtypes(include=["float64", "int64"]) < lower)
    | (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
Out[39]:
Age           0.00
Experience    0.00
Income        1.92
Family        0.00
CCAvg         6.48
Mortgage      5.82
dtype: float64

Data Preparation for Modeling¶

In [40]:
# dropping Experience as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]

X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)

X = X.astype(float)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [41]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3500, 17)
Shape of test set :  (1500, 17)
Percentage of classes in training set:
Personal_Loan
0    0.905429
1    0.094571
Name: proportion, dtype: float64
Percentage of classes in test set:
Personal_Loan
0    0.900667
1    0.099333
Name: proportion, dtype: float64

Model Building¶

Model Evaluation Criterion¶

We don't just default to accuracy, we prioritize Recall.

  • False negatives would prdict a customer won't take the loan when they would have
  • This would mean a misses sales opportunity
  • So it's more costly to miss a real positive than to incorrectly flag someone.

Recall is supported by F1, Precision and Accuracy as a full picture.

Model Building¶

In [42]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [43]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Decision Tree (sklearn default)¶

In [44]:
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
Out[44]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Checking model performance on training data¶

In [45]:
confusion_matrix_sklearn(model, X_train, y_train)
No description has been provided for this image
In [46]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train
Out[46]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Visualizing the Decision Tree¶

In [47]:
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
In [48]:
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [49]:
# Text report showing the rules of a decision tree -

print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family <= 3.50
|   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- CCAvg <= 2.20
|   |   |   |   |   |   |   |--- weights: [48.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.20
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |--- Age <= 37.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Age >  37.50
|   |   |   |   |   |   |--- Income <= 112.00
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Income >  112.00
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Family >  3.50
|   |   |   |   |--- Age <= 32.50
|   |   |   |   |   |--- CCAvg <= 2.40
|   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.40
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  32.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- Age <= 37.50
|   |   |   |   |   |   |   |   |--- Age <= 33.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  33.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  37.50
|   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |   |   |--- weights: [26.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- ZIPCode_91 <= 0.50
|   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ZIPCode_91 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Family <= 2.50
|   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 56.50
|   |   |   |   |   |   |   |   |--- weights: [27.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  56.50
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- Income <= 107.00
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  107.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |--- Family >  2.50
|   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |--- CCAvg <= 4.85
|   |   |   |   |   |   |--- weights: [0.00, 17.00] class: 1
|   |   |   |   |   |--- CCAvg >  4.85
|   |   |   |   |   |   |--- CCAvg <= 4.95
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  4.95
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  57.50
|   |   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  59.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|--- Income >  116.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [0.00, 53.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [0.00, 62.00] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [0.00, 154.00] class: 1

In [50]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.308098
Family              0.259255
Education_2         0.166192
Education_3         0.147127
CCAvg               0.048798
Age                 0.033150
CD_Account          0.017273
ZIPCode_94          0.007183
ZIPCode_93          0.004682
Mortgage            0.003236
Securities_Account  0.002224
Online              0.002224
ZIPCode_91          0.000556
CreditCard          0.000000
ZIPCode_92          0.000000
ZIPCode_96          0.000000
ZIPCode_95          0.000000
In [51]:
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Checking model performance on test data¶

In [52]:
confusion_matrix_sklearn(model, X_test, y_test)
No description has been provided for this image
In [53]:
decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
decision_tree_perf_test
Out[53]:
Accuracy Recall Precision F1
0 0.986 0.932886 0.926667 0.929766

Model Performance Improvement¶

Pre-pruning¶

Note: The parameters provided below are a sample set. You can feel free to update the same and try out other combinations.

In [55]:
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=42
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # Calculate recall scores for training and test sets
            train_recall_score = recall_score(y_train, y_train_pred)
            test_recall_score = recall_score(y_test, y_test_pred)

            # Calculate the absolute difference between training and test recall scores
            score_diff = abs(train_recall_score - test_recall_score)

            # Update the best estimator and best score if the current one has a smaller score difference
            if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_recall_score
                best_estimator = estimator

# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found:
Max depth: 2
Max leaf nodes: 50
Min samples split: 10
Best test recall score: 1.0
In [56]:
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train) ## Complete the code to fit model on train data
Out[56]:
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(2),
                       max_leaf_nodes=50, min_samples_split=10,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(2),
                       max_leaf_nodes=50, min_samples_split=10,
                       random_state=42)

Checking performance on training data

In [57]:
confusion_matrix_sklearn(estimator, X_train, y_train) 
No description has been provided for this image
In [58]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train
Out[58]:
Accuracy Recall Precision F1
0 0.790286 1.0 0.310798 0.474212

Visualizing the Decision Tree

In [59]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [60]:
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1344.67, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [64.61, 79.31] class: 1
|--- Income >  92.50
|   |--- Family <= 2.50
|   |   |--- weights: [298.20, 697.89] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [42.52, 972.81] class: 1

In [61]:
print(
    pd.DataFrame(
        estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.876529
CCAvg               0.066940
Family              0.056531
Age                 0.000000
Mortgage            0.000000
Securities_Account  0.000000
CD_Account          0.000000
Online              0.000000
CreditCard          0.000000
ZIPCode_91          0.000000
ZIPCode_92          0.000000
ZIPCode_93          0.000000
ZIPCode_94          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
Education_2         0.000000
Education_3         0.000000
In [62]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Checking performance on test data

In [63]:
confusion_matrix_sklearn(estimator, X_test, y_test)  
No description has been provided for this image
In [64]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test
Out[64]:
Accuracy Recall Precision F1
0 0.779333 1.0 0.310417 0.473768

Post-pruning¶

In [65]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [66]:
pd.DataFrame(path)
Out[66]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000186 0.001114
2 0.000214 0.001542
3 0.000242 0.002750
4 0.000250 0.003250
5 0.000268 0.004324
6 0.000272 0.004868
7 0.000276 0.005420
8 0.000381 0.005801
9 0.000527 0.006329
10 0.000625 0.006954
11 0.000700 0.007654
12 0.000769 0.010731
13 0.000882 0.014260
14 0.000889 0.015149
15 0.001026 0.017200
16 0.001305 0.018505
17 0.001647 0.020153
18 0.002333 0.022486
19 0.002407 0.024893
20 0.003294 0.028187
21 0.006473 0.034659
22 0.025146 0.084951
23 0.039216 0.124167
24 0.047088 0.171255
In [67]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
No description has been provided for this image

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [69]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596766
In [ ]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Recall vs alpha for training and testing sets

In [70]:
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)

recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
In [71]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
No description has been provided for this image
In [ ]:
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=np.float64(0.0), random_state=1)
In [77]:
estimator_2 = DecisionTreeClassifier(
    ccp_alpha=0.000272, class_weight={0: 0.15, 1: 0.85}, random_state=1  
)
estimator_2.fit(X_train, y_train)
Out[77]:
DecisionTreeClassifier(ccp_alpha=0.000272, class_weight={0: 0.15, 1: 0.85},
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.000272, class_weight={0: 0.15, 1: 0.85},
                       random_state=1)

Checking performance on training data

In [78]:
confusion_matrix_sklearn(estimator_2, X_train, y_train)
No description has been provided for this image
In [79]:
decision_tree_tune_post_train = model_performance_classification_sklearn(estimator_2, X_train, y_train) 
decision_tree_tune_post_train
Out[79]:
Accuracy Recall Precision F1
0 0.999143 1.0 0.991018 0.995489

Visualizing the Decision Tree

In [80]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator_2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [81]:
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [374.10, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  3.50
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |--- ZIPCode_91 <= 0.50
|   |   |   |   |   |   |   |--- weights: [6.15, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_91 >  0.50
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |--- Income >  81.50
|   |   |   |   |   |--- Mortgage <= 152.00
|   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.85
|   |   |   |   |   |   |   |   |   |--- ZIPCode_91 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- ZIPCode_91 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.85
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  152.00
|   |   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [6.75, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.50
|   |   |   |   |--- weights: [0.00, 6.80] class: 1
|   |   |   |--- CCAvg >  4.50
|   |   |   |   |--- weights: [0.15, 0.00] class: 0
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |--- CCAvg <= 4.20
|   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  4.20
|   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |--- Income >  100.00
|   |   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |--- weights: [2.10, 0.00] class: 0
|   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |--- Income >  103.50
|   |   |   |   |   |   |--- weights: [64.95, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 110.00
|   |   |   |   |   |--- weights: [1.80, 0.00] class: 0
|   |   |   |   |--- Income >  110.00
|   |   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |   |--- Mortgage <= 141.50
|   |   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 2.55] class: 1
|   |   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  141.50
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |--- Income >  116.50
|   |   |   |   |   |   |--- weights: [0.00, 45.05] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [1.95, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- Age <= 41.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- weights: [1.20, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |   |--- Age >  41.50
|   |   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |   |--- weights: [0.15, 5.10] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 52.70] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 113.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [3.90, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |--- weights: [0.90, 0.00] class: 0
|   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |   |   |--- Age <= 35.00
|   |   |   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  35.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.25] class: 1
|   |   |   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- Age <= 57.00
|   |   |   |   |   |--- weights: [0.15, 11.90] class: 1
|   |   |   |   |--- Age >  57.00
|   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |--- Income >  113.50
|   |   |   |--- Age <= 66.00
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- CCAvg <= 2.50
|   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.50
|   |   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.10] class: 1
|   |   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 130.90] class: 1
|   |   |   |--- Age >  66.00
|   |   |   |   |--- weights: [0.15, 0.00] class: 0

In [82]:
print(
    pd.DataFrame(
        estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.598659
Education_2         0.138693
CCAvg               0.078911
Education_3         0.067460
Family              0.066408
Age                 0.018238
CD_Account          0.011027
Mortgage            0.005053
Securities_Account  0.004728
ZIPCode_94          0.003990
ZIPCode_91          0.003596
CreditCard          0.002434
ZIPCode_92          0.000804
Online              0.000000
ZIPCode_93          0.000000
ZIPCode_96          0.000000
ZIPCode_95          0.000000
In [83]:
importances = estimator_2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Checking performance on test data

In [84]:
confusion_matrix_sklearn(estimator_2, X_test, y_test) 
No description has been provided for this image
In [85]:
decision_tree_tune_post_test = model_performance_classification_sklearn(estimator_2, X_test, y_test)
decision_tree_tune_post_test
Out[85]:
Accuracy Recall Precision F1
0 0.98 0.85906 0.934307 0.895105

Model Performance Comparison and Final Model Selection¶

In [ ]:
# Create a dataframe to compare the performance of the models

models_train_comp_df = pd.concat(
    [decision_tree_perf_train.T, decision_tree_tune_perf_train.T, decision_tree_tune_post_train.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 0.790286 0.999143
Recall 1.0 1.000000 1.000000
Precision 1.0 0.310798 0.991018
F1 1.0 0.474212 0.995489

Observations¶

Default tree- baseline only, almost always overfits.

Actionable Insights and Business Recommendations¶

Recommendations to the Bank¶

1. Target High-Income, High-Spending Customers First¶

Feature importance plots consistently show Income and CCAvg (credit card average spending) as the top predictors of loan acceptance. This tells the bank:

  • Customers with higher disposable income and active spending habits are most likely to accept
  • Marketing spend should be concentrated on this segment rather than broad outreach
  • Personalized loan offers tied to spending patterns will have higher conversion rates

2. Education Level is a Strong Signal — Tailor the Message¶

Education is a significant feature. Higher-educated customers tend to accept more. The recommendation:

  • Offer more sophisticated loan products (investment-linked, career development loans) to graduate-level customers
  • Use simpler, benefit-focused messaging for lower education segments

3. Don't Rely on Age Alone — Use Income as a Proxy¶

Experience was dropped because it perfectly overlapped with Age — meaning age adds no extra information once income is known. The bank should:

  • Avoid age-based segmentation strategies in isolation
  • Focus on financial profile over demographic profile

4. Use the Model for Proactive, Not Reactive, Outreach¶

The model is optimized for high Recall — catching nearly every customer likely to accept. This means:

  • Use it to build a pre-approved offer list and reach out proactively
  • Customers flagged by the model should receive targeted campaigns before they look elsewhere
  • The cost of a false positive (contacting someone who declines) is low compared to missing a genuine opportunity

5. Watch for Imbalance in the Customer Base¶

The model required class_weight corrections because loan acceptors are a minority class. This suggests:

  • The bank's current conversion rate is low — there is significant untapped potential
  • Even a modest improvement in targeting precision translates to meaningful revenue gain
  • The bank should periodically retrain the model as customer behavior shifts

6. Combine Model Predictions with Relationship Data¶

The model uses transactional and demographic features but does not capture customer satisfaction, recent life events, or prior interactions. Recommendation: Use the model as a first filter, then layer in relationship manager insights for high-value flagged customers.


Summary¶

Recommendation Basis
Target high-income, high-spending customers Top feature importances
Personalize by education level Education is a key split feature
Don't use age in isolation Age/Experience redundancy
Build proactive outreach lists High-recall model design
Retrain periodically Class imbalance reflects evolving behavior
Augment with relationship data Model's feature limitations