作为数据科学家,我们常常使用 Jupyter Notebooks 进行数据探索和模型开发。在这个阶段,我们关注的重点是快速验证想法和证明概念。然而,一旦模型准备就绪,就需要将其部署到生产环境中,这时代码质量就显得尤为重要。
生产代码必须足够健壮、可读且易于维护。不幸的是,数据科学家编写的原型代码通常难以满足这些要求。作为一名机器学习工程师,我的职责就是确保代码能够顺利地从概念验证阶段过渡到生产环境。
因此,编写简洁的代码对于提高开发效率和降低维护成本至关重要。在本文中,我将分享一些 Python 编程技巧和最佳实践,并通过简洁的代码示例,向您展示如何提高代码的可读性和可维护性。
我衷心希望这篇文章能为 Python 爱好者提供有价值的见解,特别是能够激励更多的数据科学家重视代码质量,因为高质量的代码不仅有利于开发过程,更能确保模型成功地投入生产使用。
有意义的名称
很多开发人员没有遵循给变量和函数命名富有意义的名称这一最佳实践。代码的可读性和可维护性因此大大降低。
命名对于代码质量至关重要。好的命名不仅能直观地表达代码的功能,而且还能避免过多的注释和解释,提高代码的整洁度。一个描述性强的名称,就能让函数的作用一目了然。
你给出的机器学习例子非常恰当。比如加载数据集并将其分割为训练集和测试集这一常见任务,如果使用富有意义的函数名如load_dataset()和split_into_train_test()就能立刻看出这两个函数的用途,而不需要查阅注释。
可读性强的代码不仅能让其他开发者更快理解,自己在未来维护时也能事半功倍。因此,我们应当养成良好的命名习惯,写出简洁直白的代码。
以一个典型的机器学习例子为例:加载数据集并将其分割成训练集和测试集。
import pandas as pd
from sklearn.model_selection import train_test_split
def load_and_split(d):
df = pd.read_csv(d)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
return X_train, X_test, y_train, y_test
当谈到数据科学领域时,大多数人都了解其中涉及的概念和术语,例如 X 和 Y。然而,对于初入这一领域的人来说,是否将 CSV 文件的路径命名为d是一个好的做法呢?另外,将特征命名为 X,将目标命名为 y 是一个好的做法吗?或许我们可以通过一个更具意义的例子来进一步理解:
import pandas as pd
from sklearn.model_selection import train_test_split
def load_data_and_split_into_train_test(dataset_path):
data_frame = pd.read_csv(dataset_path)
features = data_frame.iloc[:, :-1]
target = data_frame.iloc[:, -1]
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
return features_train, features_test, target_train, target_test
这样就更容易理解了。即使没有使用过 pandas 和 train_test_split 的经验,现在也能清楚地看到,这个函数是用来从 CSV 文件中加载数据(存储在 dataset_path 中指定的路径下),然后从数据框中提取特征和目标,最后计算训练集和测试集的特征和目标。
这些变化使代码更易读和易懂,尤其是对于那些可能不熟悉机器学习代码规范的人来说。在这些代码中,特征大多以X表示,目标以y表示。
但也不要过度夸大命名,因为这并不会提供任何额外的信息。
来看另一个示例代码片段:
import pandas as pd
from sklearn.model_selection import train_test_split
def load_data_from_csv_and_split_into_training_and_testing_sets(dataset_path_csv):
data_frame_from_csv = pd.read_csv(dataset_path_csv)
features_columns_data_frame = data_frame_from_csv.iloc[:, :-1]
target_column_data_frame = data_frame_from_csv.iloc[:, -1]
features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing = train_test_split(features_columns_data_frame, target_column_data_frame, test_size=0.2, random_state=42)
return features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing
用户提到的代码让人感觉信息过多,却没有提供任何额外的信息,反而会分散读者的注意力。因此,建议在函数中添加有意义的名称,以取得描述性和简洁性之间的平衡。至于是否需要说明函数是从 CSV 加载数据集路径,这取决于代码的上下文和实际需求。
函数
函数的规模与功能应该恰当地设计。它们应该保持简洁,不超过20行,并将大块内容分离到新的函数中。更重要的是,函数应该只负责一件事,而不是多个任务。如果需要执行其他任务,就应该将其放到另一个函数中。举个例子
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_clean_feature_engineer_and_split(data_path):
# Load data
df = pd.read_csv(data_path)
# Clean data
df.dropna(inplace=True)
df = df[df['Age'] > 0]
# Feature engineering
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
# Data preprocessing
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
# Split data
features = df.drop('Survived', axis=1)
target = df['Survived']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
return features_train, features_test, target_train, target_test
你有没有注意到违反了上述规则的行为?
虽然这个函数并不冗长,但明显违反了一个函数只负责一件事的规则。另外,注释表明这些代码块可以放在一个单独的函数中,因为根本不需要单行注释(下一节将详细介绍)。
一个重构后的示例:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
在这个经过重构的代码片段中,每个函数只做一件事,这样就更容易阅读代码了。测试本身也变得更容易了,因为每个函数都可以独立于其他函数进行测试。
甚至连注释也不再需要了,因为现在函数名本身就像是注释。
注释
有时注释是有用的,但有时它们只是糟糕代码的标志。
正确使用注释是为了弥补我们无法用代码表达的缺陷。
当需要在代码中添加注释时,可以考虑是否真的需要它,或者是否可以将其放入一个新函数中,并为函数命名,这样就能清楚地知道发生了什么,而注释并不是必需的。
来修改一下之前函数一章中的代码示例:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_clean_feature_engineer_and_split(data_path):
# Load data
df = pd.read_csv(data_path)
# Clean data
df.dropna(inplace=True)
df = df[df['Age'] > 0]
# Feature engineering
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
# Data preprocessing
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
# Split data
features = df.drop('Survived', axis=1)
target = df['Survived']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
return features_train, features_test, target_train, target_test
代码中注释描述了每个代码块的作用,但实际上,注释只是糟糕代码的一个指标。根据前一章的建议,将这些代码块放入单独的函数中,并为每个函数起一个描述性的名称,这样可以提高代码的可读性,减少对注释的需求。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop('Survived', axis=1)
target = df['Survived']
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
代码现在看起来像一个连贯的故事,不需要注释就可以清楚地了解发生了什么。但还缺少最后一部分:文档字符串。文档字符串是 Python 的标准,旨在提供可读性和可理解性的代码。每个生产代码中的函数都应该包含文档字符串,描述其意图、输入参数和返回值信息。这些文档字符串可以直接用于 Sphinx 等工具,其目的是为代码创建文档。
将文档字符串添加到上述代码片段中:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
"""
Load data from a CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
return pd.read_csv(data_path)
def clean_data(df):
"""
Clean the DataFrame by removing rows with missing values and
filtering out non-positive ages.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The cleaned dataset.
"""
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
"""
Perform feature engineering on the DataFrame, including age
grouping and adult identification.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with new features added.
"""
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
"""
Preprocess features by standardizing the 'Age' and 'Fare'
columns using StandardScaler.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with standardized features.
"""
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
"""
Split the dataset into training and testing sets.
Args:
df (DataFrame): The input dataset.
target_name (str): The name of the target variable column.
Returns:
tuple: Contains the training features, testing features,
training target, and testing target datasets.
"""
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
集成开发环境(如 VSCode)通常会提供 docstrings 扩展功能,以便在函数定义下方添加多行字符串时自动添加 docstrings。
这可以帮助你快速获得所选的正确格式。
格式化
格式化是一个非常关键的概念。
代码的阅读频率比编写频率高。避免人们阅读不规范和难以理解的代码。
在 Python 中有一个 PEP 8 样式指南[1],可用于改善代码的可读性。
样式指南包括如下重要规则:
- 使用四个空格进行代码缩进
- 每行不超过 79 个字符
- 避免不必要的空白,在某些情况下(例如括号内、逗号和括号之间)
但请记住,格式化规则旨在提高代码可读性。有时,严格遵循规则可能不合理,会降低代码的可读性。此时可以忽略某些规则。
《清洁代码》一书中提到的其他重要格式化规则包括:
- 使文件大小合理 (约 200 至 500 行),以促使更好的理解
- 使用空行来分隔不同概念(例如,在初始化 ML 模型的代码块和运行训练的代码块之间)
- 将调用者函数定义在被调用者函数之上,帮助创建自然的阅读流程
因此,与团队一起决定遵守的规则,并坚持执行!您可以利用集成开发环境的扩展功能来支持准则遵守。例如,VSCode 提供了多种扩展。您可以使用 Pylint[2] 和 autopep8[3] 等 Python 软件包来格式化您的 Python 脚本。Pylint 是一个静态代码分析器,自动对代码进行评分,而autopep8可以自动格式化代码,使其符合PEP8标准。
使用前面的代码片段来进一步了解。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
将其保存到名为 train.py 的文件中,并运行 Pylint 来检查该代码段的得分:
pylint train.py
输出结果
************* Module train
train.py:29:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:30:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:31:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:32:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:33:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:34:0: C0304: Final newline missing (missing-final-newline)
train.py:34:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:5:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:5:14: W0621: Redefining name 'data_path' from outer scope (line 29) (redefined-outer-name)
train.py:8:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:8:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:13:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:13:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:18:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:18:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:23:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:23:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:29:2: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)
------------------------------------------------------------------
Your code has been rated at 3.21/10
满分 10 分,只有 3.21 分。
你可以选择手动修复这些问题然后重新运行,或者使用autopep8软件包来自动解决一些问题。下面我们选择第二种方法。
autopep8 --in-place --aggressive --aggressive train.py
现在的 train.py 脚本如下所示:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(
df['Age'], bins=[
0, 18, 65, 99], labels=[
'child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
再次运行 Pylint 后,我们得到了 5.71 分(满分 10 分),这主要是由于缺少函数的文档说明:
************* Module train
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:6:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:6:14: W0621: Redefining name 'data_path' from outer scope (line 38) (redefined-outer-name)
train.py:10:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:10:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:16:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:16:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:25:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:25:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:31:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:31:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:38:4: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)
------------------------------------------------------------------
Your code has been rated at 5.71/10 (previous run: 3.21/10, +2.50)
现在我已经添加了文档说明,并修复了最后的缺失点。
现在的最终代码是这样的:
"""
This script aims at providing an end-to-end training pipeline.
Author: Patrick
Date: 2/14/2024
"""
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
return pd.read_csv(data_path)
def clean_data(df):
"""
Clean the input DataFrame by removing rows with
missing values and filtering out entries with non-positive ages.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The cleaned dataset.
"""
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
"""
Perform feature engineering on the DataFrame,
including creating age groups and determining if the individual is an adult.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with new features added.
"""
df['AgeGroup'] = pd.cut(
df['Age'], bins=[
0, 18, 65, 99], labels=[
'child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
"""
Preprocess the 'Age' and 'Fare' features of the
DataFrame using StandardScaler to standardize the features.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with standardized features.
"""
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
"""
Split the DataFrame into training and testing sets.
Args:
df (DataFrame): The dataset to split.
target_name (str, optional): The name of the target variable column. Defaults to 'Survived'.
Returns:
tuple: The training and testing features and target datasets.
"""
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data = load_data("data.csv")
data = clean_data(data)
data = feature_engineering(data)
data = preprocess_features(data)
X_train, X_test, y_train, y_test = split_data(data)
运行 Pylint 现在返回 10 分:
pylint train.py
-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 7.50/10, +2.50)
这突出显示了 Pylint 的功能之强大,它可以帮助您简化代码并快速符合 PEP8 标准。
错误处理是另一个关键概念。它能确保你的代码在遇到意外情况时不会崩溃或产生错误结果。
举个例子,假设您在API后端部署了一个模型,用户可以向该部署的模型发送数据。然而,用户可能会发送错误的数据,而你的应用程序如果崩溃了,可能会给用户留下不好的印象,并可能因此责备您的应用程序开发不到位。
如果用户能够获取明确的错误代码和相关信息,清晰地指出他们的错误,那就更好了。这正是Python中异常的作用所在。
举例来说,用户可以上传一个CSV文件到您的应用程序,将其加载到pandas数据框架中,然后将数据传给模型进行预测。这样,您可以拥有类似下面这样的函数:
import pandas as pd
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
return pd.read_csv(data_path)
到目前为止,一切顺利。但如果用户没有提供 CSV 文件,会发生什么情况呢?
你的程序将崩溃,并出现以下错误信息:
FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'
你正在使用API,它只会以HTTP 500代码响应用户,告诉他们"服务器内部出错"。用户可能会因此责怪您的应用程序,因为他们无法确定自己是否对该错误负有责任。更好的处理方法是添加一个try-except块,并捕获FileNotFoundError以正确处理这种情况。
import pandas as pd
import logging
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
try:
return pd.read_csv(data_path)
except FileNotFoundError:
logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)
我们目前只能记录该错误消息。最佳做法是定义一个自定义异常,然后在应用程序接口中进行处理,以向用户返回特定的错误代码。
import pandas as pd
import logging
class DataLoadError(Exception):
"""Exception raised when the data cannot be loaded."""
def __init__(self, message="Data could not be loaded"):
self.message = message
super().__init__(self.message)
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
try:
return pd.read_csv(data_path)
except FileNotFoundError:
logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)
raise DataLoadError(f"The file at path {data_path} does not exist. Please ensure that you have uploaded the file properly.")
然后,在应用程序接口的主要函数中:
try:
df = load_data('path/to/data.csv')
# Further processing and model prediction
except DataLoadError as e:
# Return a response to the user with the error message
# For example: return Response({"error": str(e)}, status=400)
用户将收到 400 错误代码(错误请求),并将收到有关错误原因的错误消息。
现在他了解了应该怎么做,并不会再责备程序工作不正常。
面向对象编程
面向对象编程(OOP)是Python中一个重要的编程范式,即使是初学者也应该熟悉。那么,什么是OOP呢?
面向对象编程是一种将数据和行为封装到单个对象中的编程方式,为程序提供了清晰的结构。
采用OOP有以下几个主要好处:
- 通过封装隐藏内部细节,提高代码模块化。
- 继承机制允许代码复用,提高开发效率。
- 将复杂问题分解为小对象,专注于逐个解决。
- 提升代码可读性和可维护性。
OOP还有其他一些优点,上述几点是最为关键的。
现在让我们看一个简单的例子,我们创建了一个名为TrainingPipeline的类,包含几个基本函数:
from abc import ABC, abstractmethod
class TrainingPipeline(ABC):
def __init__(self, data_path, target_name):
"""
Initialize the TrainingPipeline.
Args:
data_path (str): The file path to the dataset.
target_name (str): Name of the target column.
"""
self.data_path = data_path
self.target_name = target_name
self.data = None
self.X_train = None
self.X_test = None
self.y_train = None
self.y_test = None
@abstractmethod
def load_data(self):
"""Load dataset from data path."""
pass
@abstractmethod
def clean_data(self):
"""Clean the data."""
pass
@abstractmethod
def feature_engineering(self):
"""Perform feature engineering."""
pass
@abstractmethod
def preprocess_features(self):
"""Preprocess features."""
pass
@abstractmethod
def split_data(self):
"""Split data into training and testing sets."""
pass
def run(self):
"""Run the training pipeline."""
self.load_data()
self.clean_data()
self.feature_engineering()
self.preprocess_features()
self.split_data()
这是一个抽象基类,只定义了从基类派生出来的类必须实现的抽象方法。
这对于定义所有子类都必须遵循的蓝图或模板非常有用。
下面是一个子类示例:
import pandas as pd
from sklearn.preprocessing import StandardScaler
class ChurnPredictionTrainPipeline(TrainingPipeline):
def load_data(self):
"""Load dataset from data path."""
self.data = pd.read_csv(self.data_path)
def clean_data(self):
"""Clean the data."""
self.data.dropna(inplace=True)
def feature_engineering(self):
"""Perform feature engineering."""
categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns
self.data = pd.get_dummies(self.data, columns=categorical_cols, drop_first=True)
def preprocess_features(self):
"""Preprocess features."""
numerical_cols = self.data.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
self.data[numerical_cols] = scaler.fit_transform(self.data[numerical_cols])
def split_data(self):
"""Split data into training and testing sets."""
features = self.data.drop(self.target_name, axis=1)
target = self.data[self.target_name]
self.features_train, self.features_test, self.target_train, self.target_test = train_test_split(features, target, test_size=0.2, random_state=42)
这样做的好处是,你可以创建一个自动调用训练管道方法的应用程序,还可以创建不同的训练管道类。它们始终是兼容的,并且必须遵循抽象基类中定义的蓝图。
测试
测试可以决定整个项目的成败。
测试确实可能会增加一些开发时间投入,但从长远来看,它能够极大地提高代码质量、可维护性和可靠性。
测试对于确保项目的成功至关重要,尽管一开始编写测试代码会耗费一些时间,但这是一种非常值得的投资。不编写测试可能会在短期内加快开发速度,但从长远来看,缺乏测试会带来严重的代价:
- 代码库扩大后,任何小小修改都可能导致意外的破坏
- 新版本需要大量修复,给客户带来不佳体验
- 开发人员畏惧修改代码库,新功能发布受阻
因此,遵循 TDD 原则对于提高代码质量和开发效率至关重要。TDD 的三个核心原则是:
- 在开始编写生产代码之前,先编写一个失败的单元测试
- 编写的单元测试内容不要多于足以导致失败的内容
- 编写的生产代码不能多于足以通过当前失败测试的部分。
这种测试先行的模式能促使开发者先思考代码设计。
Python 拥有诸如 unittest 和 pytest 等优秀测试框架,其中 pytest 因其简洁语法而更加易用。尽管短期增加了开发量,但测试绝对是保证项目长期成功所必需的。
再次看看前一章中的 ChurnPredictionTrainPipeline 类:
import pandas as pd
from sklearn.preprocessing import StandardScaler
class ChurnPredictionTrainPipeline(TrainingPipeline):
def load_data(self):
"""Load dataset from data path."""
self.data = pd.read_csv(self.data_path)
...
使用 pytest 为加载数据添加单元测试:
import os
import shutil
import logging
from unittest.mock import patch
import joblib
import pytest
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from churn_library import ChurnPredictor
@pytest.fixture
def path():
"""
Return the path to the test csv data file.
"""
return r"./data/bank_data.csv"
def test_import_data_returns_dataframe(path):
"""
Test that import data can load the CSV file into a pandas dataframe.
"""
churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
churn_predictor.load_data()
assert isinstance(churn_predictor.data, pd.DataFrame)
def test_import_data_raises_exception():
"""
Test that exception of "FileNotFoundError" gets raised in case the CSV
file does not exist.
"""
with pytest.raises(FileNotFoundError):
churn_predictor = ChurnPredictionTrainPipeline("non_existent_file.csv",
"Churn")
churn_predictor.load_data()
def test_import_data_reads_csv(path):
"""
Test that the pandas.read_csv function gets called.
"""
with patch("pandas.read_csv") as mock_csv:
churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
churn_predictor.load_data()
mock_csv.assert_called_once_with(path)
这些单元测试包括
- 测试 CSV 文件能否加载到 pandas 数据框架中。
- 测试 CSV 文件不存在时是否会抛出 FileNotFoundError 异常。
- 测试是否调用了 pandas 的 read_csv 函数。
这个过程并不完全是 TDD,因为在添加单元测试之前,我已经开发了代码。但在理想情况下,你甚至可以在实现 load_data 函数之前编写这些单元测试。
结论
四条简单设计规则,目的是让代码更加简洁、可读和易维护。这四条规则是:
- 运行所有测试(最为重要)
- 消除重复代码
- 体现程序员的原本意图
- 减少类和方法的数量(最不重要)
前三条规则侧重于代码重构方面。在最初编码时不要过于追求完美,可以先写出简单甚至"丑陋"的代码,待代码能够运行后,再通过重构来遵循上述规则,使代码变得优雅。
推荐"先实现,后重构"的编程方式。不要一开始就过分追求完美,而是先让代码运行起来,功能被实现,之后再反复重构,循序渐进地遵从这四条简单设计原则,从而提高代码质量。
编写简洁代码对软件项目的成功至关重要,但这需要严谨的态度和持续的练习。作为数据科学家,我们往往更关注在Jupyter Notebooks中运行代码、寻找好的模型和获取理想指标,而忽视了代码的整洁度。但是,编写简洁代码也是数据科学家的必修课,因为这能确保模型更快地投入生产环境。
当编写需要重复使用的代码时,我们应当坚持编写简洁代码。起步可以从简单开始,不要一开始就过于追求完美,而是要反复打磨代码。永远不要忘记为函数编写单元测试,以确保功能的正常运行,避免将来扩展时出现重大问题。
坚持一些原则,比如消除重复代码、体现代码意图等,能让你远离"永远不要改变正在运行的系统"的思维定式。这些原则我正在学习并应用到日常工作中,它们确实很有帮助,但全面掌握需要漫长的过程和持续的努力。
最后,要尽可能自动化,利用集成开发环境提供的扩展功能,来帮助遵守清洁代码规则,提高工作效率。
参考资料
[1]PEP 8 样式指南: https://peps.python.org/pep-0008/
[2]Pylint: https://pylint.readthedocs.io/en/stable/
[3]autopep8: https://pypi.org/project/autopep8/