期末大作业：客户流失数据可视化分析与预测-51CTO.COM

今天云朵君和大家一起学习一个期末作业项目。

本文亮点：

项目流程完整，从数据预处理、特征工程、建模到预测
使用Pipline构建机器学习管道
使用optuna优化算法
数据完整、代码完整

背景

预测客户流失是机器学习在行业中的一种常见用例，特别是在金融和订阅服务领域。

流失率是指离开提供商的用户数量。它也可以指离开公司的员工（员工保留率）。

因此，银行客户流失（又称客户流失）是指客户停止与一家银行做生意或转向另一家银行。

数据

数据字典：

Customer ID：每个客户的唯一标识符
Surname：客户的姓氏
Credit Score：代表客户信用评分的数值
Geography：客户居住的国家/地区
Gender：顾客的性别
Age：顾客的年龄。
Tenure：客户在该银行的服务年限
Balance：客户的账户余额
NumOfProducts：客户使用的银行产品数量（例如储蓄账户、信用卡）
HasCrCard：客户是否拥有信用卡
IsActiveMember：客户是否为活跃会员
EstimatedSalary：客户的预计工资
Exited：客户是否流失（目标变量）

目标

这是一个经典的二元分类问题。

图片

在二元问题中，你必须猜测一个示例是否应该归类到特定类别（通常是正类 (1) 和负类 (0)。在本例中，churn 是正类。

预测一个新的y = 0或是y = 1一项常见的任务，但在很多情况下，你必须提供一个概率，特别是在医疗应用中，你必须对不同选项中的积极预测进行排序以做出最佳决策（例如，模型＃1预测0.9，模型＃2预测0.8）

评估二元分类器模型的最常见指标是预测概率和观察到的目标之间的 ROC 曲线下面积(ROC-AUC)。

ROC 曲线是评估二元分类器性能和比较多个分类器的图表。以下是一些示例。

图片

理想情况下，性能良好的分类器的 ROC 曲线应该在假阳性率较低时攀升真阳性率（召回率）。因此，0.9–1 之间的 ROC 非常好。

坏分类器是与图表对角线相似或相同的分类器，代表纯随机分类器的性能。

如果类别平衡，你可以认为更高的 AUC == 模型能够输出更高概率的真阳性结果。但是，如果阳性结果很少见，AUC 一开始就很高，增量对于更好地预测罕见类别可能意义不大。平均精度在这里将是一个更有用的指标。

加载数据

我们加载给定的生成数据，以及深度学习模型训练的原始数据集。

train = pd.read_csv("./data/train.csv") # 数据获取：在公众号：数据STUDIO 后台回复240720 获取
original = pd.read_csv("./data/Churn_Modelling.csv")
test = pd.read_csv("./data/test.csv")

train.drop(columns=["id"], inplace=True)
test.drop(columns=["id"], inplace=True)
original.drop(columns=["RowNumber"], inplace=True)

train = pd.concat([train, original.dropna()], axis=0)
train.reset_index(inplace=True, drop=True)

target_col = "Exited"

探索性数据分析

我们有 175k 个数据点可供使用。

train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 175030 entries, 0 to 175030
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   CustomerId       175030 non-null  int32  
 1   Surname          175030 non-null  object 
 2   CreditScore      175030 non-null  int16  
 3   Geography        175030 non-null  object 
 4   Gender           175030 non-null  object 
 5   Age              175030 non-null  float16
 6   Tenure           175030 non-null  int8   
 7   Balance          175030 non-null  float32
 8   NumOfProducts    175030 non-null  int8   
 9   HasCrCard        175030 non-null  float16
 10  IsActiveMember   175030 non-null  float16
 11  EstimatedSalary  175030 non-null  float32
 12  Exited           175030 non-null  int8   
dtypes: float16(3), float32(2), int16(1), int32(1), int8(3), object(3)
memory usage: 9.2+ MB

减少数据集的内存，以便特征工程和建模更加节省内存。

train = reduce_mem_usage(train)

Mem. usage decreased to  9.18 Mb (50.9% reduction)

这是一个使用prettytable打印数据集中缺失数据的好函数

看来我们的数据没有缺失值。

print_missing_table(train, test, target_col)

+-----------------+-----------+-----------------+----------------+
|     Feature     | Data Type | Train Missing % | Test Missing % |
+-----------------+-----------+-----------------+----------------+
|    CustomerId   |   int64   |       0.0       |      0.0       |
|     Surname     |   object  |       0.0       |      0.0       |
|   CreditScore   |   int64   |       0.0       |      0.0       |
|    Geography    |   object  |       0.0       |      0.0       |
|      Gender     |   object  |       0.0       |      0.0       |
|       Age       |  float64  |       0.0       |      0.0       |
|      Tenure     |   int64   |       0.0       |      0.0       |
|     Balance     |  float64  |       0.0       |      0.0       |
|  NumOfProducts  |   int64   |       0.0       |      0.0       |
|    HasCrCard    |  float64  |       0.0       |      0.0       |
|  IsActiveMember |  float64  |       0.0       |      0.0       |
| EstimatedSalary |  float64  |       0.0       |      0.0       |
|      Exited     |   int64   |       0.0       |       NA       |
+-----------------+-----------+-----------------+----------------+

以下是我们的数据。

train.head()

图片

为了简单起见，我们过滤掉分类或连续的列。

# 每列的唯一值计数
unique_counts = train.nunique()

# 区分连续和分类的阈值
threshold = 12

# 连续变量只选择数字列
numeric_cols = train.select_dtypes(include=[np.number]).columns.tolist()

continuous_vars = unique_counts[(unique_counts > threshold) & unique_counts.index.isin(numeric_cols)].index.tolist()
categorical_vars = unique_counts[(unique_counts <= threshold) | ~unique_counts.index.isin(numeric_cols)].index.tolist()

target_col = 'Exited'
id_col = ['id', 'CustomerId']

if target_col in categorical_vars:
    categorical_vars.remove(target_col)

for col in id_col:
    if col in continuous_vars:
        continuous_vars.remove(col)

print(f"Categorical Variables: {categorical_vars}")
print(f"Continuous/Numerical Variables: {continuous_vars}")

Categorical Variables: ['Surname', 'Geography', 'Gender', 'Tenure', 'NumOfProducts', 'HasCrCard', 'IsActiveMember']
Continuous/Numerical Variables: ['CreditScore', 'Age', 'Balance', 'EstimatedSalary']

绘制出target。

plot_categorical(train, column_name='Exited')

图片

这里存在明显的类别不平衡。只有 20% 的数据属于正类：Exited = 1

接下来我们看看连续变量和目标列的相互作用。

plot_violin_plots(train, continuous_vars, target_col)

图片

退出的客户的中位数age(1) 似乎高于未退出的客户的中位数 (0) 退出值之间的差异，这表明它可能是预测退出的相关因素。

分布balance表明，未退出的客户（0）在 0 左右集中度较大，而退出的客户（1）的中位数余额较高。

plot_histograms(train, continuous_vars, target_col)

图片

我们观察到大量未退出的年轻客户，而退出客户的分布则偏向于老年。这一点在 50 岁左右的峰值中尤为明显，此时橙线超过了蓝线。

plot_correlation_heatmap(train, continuous_vars, target_col)

图片

“Age”和“Exited”之间呈现出最强的正相关性（0.3366），这支持了年龄是预测客户流失的重要因素这一发现

“Balance”也与“Exited”呈现正相关（0.1284），表明余额较高的客户更有可能离开。

plot_pairplot(train, continuous_vars, target_col)

图片

这两个类别之间没有明显的区分，这表明单个变量不足以区分退出的顾客和未退出的顾客。

特征工程

现在是时候创建一些特征了。每当你得到数据时，你都可以创建更多特征来提高模型的预测能力。这就像从数据中榨取每一点洞察力一样。

为此，我们将构建一个管道，它是对一组数据进行操作的对象序列。操作可以包括：

关系探索
特征变换
处理缺失值
创建新特征
选择适合模型
预测未知数据

这是一个简单的转换器示例，仅用于删除列

class DropColumn(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(self.cols, axis=1)

另一个用于一次性执行 kmeans 聚类、缩放和 PCA。

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


class KMeansClusterer(BaseEstimator, TransformerMixin):
    def __init__(self, features, n_clusters=20, random_state=0, n_compnotallow=None):
        self.features = features
        self.n_clusters = n_clusters
        self.random_state = random_state
        self.n_components = n_components
        self.kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=random_state)
        self.scaler = StandardScaler()
        self.pca = PCA(n_compnotallow=n_components)
        
    def fit(self, X, y=None):
        X_scaled = self.scaler.fit_transform(X.loc[:, self.features])
        if self.n_components is not None:
            X_scaled = self.pca.fit_transform(X_scaled)
        self.kmeans.fit(X_scaled)
        
        return self
    
    def transform(self, X):
        X_scaled = self.scaler.transform(X.loc[:, self.features])
        
        # check for NaN and replace with zero
        if np.isnan(X_scaled).any():
            X_scaled = np.nan_to_num(X_scaled)
            
        if self.n_components is not None:
            X_scaled = self.pca.transform(X_scaled)
            
        X_new = pd.DataFrame()
        X_new["Cluster"] = self.kmeans.predict(X_scaled)
        
        X_copy = X.copy()
        
        # convert to dense format
        X_new["Cluster"] = X_new["Cluster"].values
        
        return pd.concat([X_copy.reset_index(drop=True), X_new.reset_index(drop=True)], axis=1)
    
clusterer_with_pca = KMeansClusterer(features=["CustomerId","EstimatedSalary","Balance"], n_clusters=10, random_state=123, n_compnotallow=3)
clusterer_with_pca.fit_transform(train)

图片

一旦定义了构建新功能和执行某些转换所需的所有转换器，就可以构建管道了。因篇幅限制，所有转换器构建的完整代码可以在@公众号：数据STUDIO 后台回复 240720 即可免费获取完整代码。

对于编码，你需要使用列转换器。我们将输出设置为 pandas。

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessing_pipeline = Pipeline([
    ('kmeans', KMeansClusterer(features=["CustomerId", "EstimatedSalary", "Balance"], n_clusters=10, random_state=123, n_compnotallow=3)),
    ('surname_tfid', TFIDFTransformer(column="Surname", max_features=1000, n_compnotallow=5)),
    ('age_binning', VariableBinning(n_bins=5, column_name="Age")),
    ('salary_binning', VariableBinning(n_bins=10, column_name="EstimatedSalary")),
    ('balance_salary_ratio', BalanceSalaryRatioTransformer()),
    ('geo_gender', GeoGenderTransformer()),
    ('total_products', BalanceSalaryRatioTransformer()),  # Note: Should be TotalProductsTransformer, but not defined above
    ('tp_gender', TpGenderTransformer()),
    ('is_senior', IsSeniorTransformer()),
    ('quality_of_balance', QualityOfBalanceTransformer()),
    ('credit_score_tier', CreditScoreTierTransformer()),
    ('is_active_by_credit_card', IsActiveByCreditCardTransformer()),
    ('products_per_tenure', ProductsPerTenureTransformer()),
    ('customer_status', CustomerStatusTransformer()),
    ('drop', DropColumn(cols=['CustomerId', 'Surname'])),
    ('prep', ColumnTransformer([
        ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False), 
         ['Gender', 'Geography', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Geo_Gender', 'Tp_Gender']),
        ],
        remainder='passthrough').set_output(transform='pandas')),
])

preprocessing_pipeline

图片

将这个管道应用到我们的训练数据集上。

df_train = preprocessing_pipeline.fit_transform(train.drop(['Exited'], axis=1))
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175030 entries, 0 to 175029
Data columns (total 49 columns):
 #   Column                             Non-Null Count   Dtype   
---  ------                             --------------   -----   
 0   encode__Gender_Female              175030 non-null  float64 
 1   encode__Gender_Male                175030 non-null  float64 
 2   encode__Geography_France           175030 non-null  float64 
 3   encode__Geography_Germany          175030 non-null  float64 
 4   encode__Geography_Spain            175030 non-null  float64 
 5   encode__NumOfProducts_1            175030 non-null  float64 
 6   encode__NumOfProducts_2            175030 non-null  float64 
 7   encode__NumOfProducts_3            175030 non-null  float64 
 8   encode__NumOfProducts_4            175030 non-null  float64 
 9   encode__HasCrCard_0.0              175030 non-null  float64 
 10  encode__HasCrCard_1.0              175030 non-null  float64 
 11  encode__IsActiveMember_0.0         175030 non-null  float64 
 12  encode__IsActiveMember_1.0         175030 non-null  float64 
 13  encode__Geo_Gender_France_Female   175030 non-null  float64 
 14  encode__Geo_Gender_France_Male     175030 non-null  float64 
 15  encode__Geo_Gender_Germany_Female  175030 non-null  float64 
 16  encode__Geo_Gender_Germany_Male    175030 non-null  float64 
 17  encode__Geo_Gender_Spain_Female    175030 non-null  float64 
 18  encode__Geo_Gender_Spain_Male      175030 non-null  float64 
 19  encode__Tp_Gender_1.0Female        175030 non-null  float64 
 20  encode__Tp_Gender_1.0Male          175030 non-null  float64 
 21  encode__Tp_Gender_2.0Female        175030 non-null  float64 
 22  encode__Tp_Gender_2.0Male          175030 non-null  float64 
 23  encode__Tp_Gender_3.0Female        175030 non-null  float64 
 24  encode__Tp_Gender_3.0Male          175030 non-null  float64 
 25  encode__Tp_Gender_4.0Female        175030 non-null  float64 
 26  encode__Tp_Gender_4.0Male          175030 non-null  float64 
 27  encode__Tp_Gender_5.0Female        175030 non-null  float64 
 28  encode__Tp_Gender_5.0Male          175030 non-null  float64 
 29  remainder__CreditScore             175030 non-null  int16   
 30  remainder__Age                     175030 non-null  float32 
 31  remainder__Tenure                  175030 non-null  int8    
 32  remainder__Balance                 175030 non-null  float32 
 33  remainder__EstimatedSalary         175030 non-null  float32 
 34  remainder__Cluster                 175030 non-null  int32   
 35  remainder__Surname_tfidf_0         175030 non-null  float64 
 36  remainder__Surname_tfidf_1         175030 non-null  float64 
 37  remainder__Surname_tfidf_2         175030 non-null  float64 
 38  remainder__Surname_tfidf_3         175030 non-null  float64 
 39  remainder__Surname_tfidf_4         175030 non-null  float64 
 40  remainder__QCut5_Age               175030 non-null  int64   
 41  remainder__QCut10_EstimatedSalary  175030 non-null  int64   
 42  remainder__Total_Products_Used     175030 non-null  float16 
 43  remainder__IsSenior                175030 non-null  int64   
 44  remainder__QualityOfBalance        175030 non-null  category
 45  remainder__CreditScoreTier         175030 non-null  category
 46  remainder__IsActive_by_CreditCard  175030 non-null  float16 
 47  remainder__Products_Per_Tenure     175030 non-null  float64 
 48  remainder__Customer_Status         175030 non-null  int64   
dtypes: category(2), float16(2), float32(3), float64(35), int16(1), int32(1), int64(4), int8(1)
memory usage: 56.3 MB

从 14 个特征增加到现在 48 个！

GBT 分类器

我们将训练以下增强模型：XGBoost、Catboost、LightGBM。

我们使用Optuna来找到此 Catboost 分类器的最佳超参数。我设置n_trials=10它是为了让它完成得更快，如果你时间充足，这里可以设置大一点（越大时间越久）。

其余模型的构建完整代码：可以在@公众号：数据STUDIO 后台回复 240720 即可免费获取完整代码。

# 过滤警告 (FutureWarnings)
warnings.filterwarnings("ignore", 
                        category=FutureWarning, 
                        module="sklearn.utils.validation")

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

def objective(trial):
    params = {
        'iterations': trial.suggest_int('iterations', 500, 1000),
        'depth': trial.suggest_int('depth', 10, 16),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.2, log=True),
    }
    
    cb_model = CatBoostClassifier(**params, random_state=42, grow_policy='Lossguide', verbose=0)
    cb_pipeline = make_pipeline(modelling_pipeline, cb_model)
    

    cv = abs(cross_val_score(cb_pipeline, X, y, cv=skf, scoring='roc_auc').mean())
    return cv

study = optuna.create_study(directinotallow='maximize')
study.optimize(objective, n_trials=10)

best_params_cb = study.best_params
print("Best Hyperparameters for CatBoost:", best_params_cb)

完成后，你可以将最佳参数传递给分类器。

cb_model = CatBoostClassifier(**best_params_cb, random_state=42, verbose=0)
cb_pipeline_optimized = make_pipeline(modelling_pipeline, cb_model)

我们再进行一次 KFold 来检查 AUC 分数。

n_splits = 10
stratkf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

cv_results = []

for fold, (train_idx, val_idx) in enumerate(stratkf.split(X, y)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    cb_pipeline_optimized.fit(X_train, y_train)

    y_val_pred_prob = cb_pipeline_optimized.predict_proba(X_val)[:, 1]
    y_pred = cb_pipeline_optimized.predict(X_val)
    f1 = f1_score(y_val, y_pred, average='weighted')

    # Evaluating the model
    logloss = log_loss(y_val, y_val_pred_prob)
    roc_auc = roc_auc_score(y_val, y_val_pred_prob)
    print(f'Fold {fold + 1}, AUC-Score on Validation Set: {roc_auc}')
    print(f'Fold {fold + 1}, F1 Score on Validation Set: {f1}')
    print(f'Fold {fold + 1}, Log Loss Score on Validation Set: {logloss}')
    print('-'*70)

    cv_results.append(logloss)

average_cv_result = sum(cv_results) / n_splits
print(f'\nAverage Logarithmic Loss across {n_splits} folds: {average_cv_result}')

我们可以使用混淆矩阵检查模型的性能。

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
cb_pipeline_optimized.fit(X = X_train,
                y = y_train)
predictions_cb = cb_pipeline_optimized.predict(X_val)
cm_cb = confusion_matrix(y_val, predictions_cb)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_cb, display_labels=['Not Churn', 'Churn'])
disp.plot()
plt.show()

图片

我们的模型具有较高的真阴性率，这意味着它在识别不会流失的客户方面比识别会流失的客户更有效。模型预测不会流失的客户中，有相当一部分实际上流失了，这可能是一个需要改进的领域。减少假阴性可以帮助公司更有效地采取干预措施来留住客户。

我们还可以看到 catboost 分类器的特征重要性。

cb_feature_importance = cb_pipeline_optimized.named_steps['catboostclassifier'].feature_importances_
sorted_idx = np.argsort(cb_feature_importance)
fig = plt.figure(figsize=(18, 16))
plt.barh(range(len(sorted_idx)), cb_feature_importance[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), np.array(train_X.columns)[sorted_idx])
plt.title('CB_Feature Importance')
plt.show()

图片

图表显示，对于模型的预测来说，最重要的特征包括年龄、信用评分、估计工资和集群。姓氏的 tfidf 特征似乎也与预测特征重要性有关，尽管这可能会导致对这些名字的过度拟合。

集成学习

现在有性能各异的不同模型。通过集成学习，可以将这些模型融合在一起，以实现更高的性能！

这里我们使用带有“软”投票的投票分类器，它根据预测概率总和的 argmax 来预测类标签。

这些权重是一个数字，它告诉分类器在平均之前对类概率赋予多大的重要性（权重）。它们也可以使用 GridSearch 或 Optuna 进行优化。

ensemble_model = VotingClassifier(estimators=[
    ('xgb', xgb_pipeline_optimized),
    ('lgb', lgb_pipeline_optimized),
    ('cb', cb_pipeline_optimized)
], voting='soft', weights = [0.4,0.4,0.2])

ensemble_model

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

ensemble_model.fit(X = X_train, y = y_train)

predictions_ensemble = ensemble_model.predict(X_val)

cm_ensemble = confusion_matrix(y_val, predictions_ensemble)

disp = ConfusionMatrixDisplay(confusion_matrix=cm_ensemble, display_labels=['Not Churn', 'Churn'])
disp.plot()
plt.show()

请注意，我们的假阴性略有减少，而真阳性有所增加。