Python 中快速上手机器学习的七个基础算法-51CTO.COM

机器学习作为一种让计算机从数据中自动学习的技术，在近年来得到了迅猛发展。本文将介绍几种基础的机器学习算法，并通过Python代码示例展示它们的应用。

1. 什么是机器学习

机器学习是一种让计算机学会从数据中自动“学习”并做出预测或决策的技术。不需要显式地编程告诉计算机如何执行任务。机器学习的核心在于构建模型，通过大量数据训练模型，使其能够准确预测未知数据的结果。

2. 为什么选择Python

Python语言简单易学，拥有强大的科学计算库，如NumPy、Pandas、Scikit-learn等。这些库提供了大量的函数和工具，可以方便地处理数据、训练模型、评估性能。

3. 线性回归

线性回归是最简单的机器学习算法之一。它假设因变量y与自变量x之间存在线性关系，即y = ax + b。线性回归的目标是找到最佳拟合直线，使得所有点到直线的距离平方和最小。

代码示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# 创建数据集
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建线性回归模型
model = LinearRegression()

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 可视化
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

# 输出系数和截距
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

输出结果：运行上述代码后，会生成一张散点图，其中蓝色点表示真实值，红色线表示预测值。同时控制台会输出模型的系数和截距。

4. 逻辑回归

逻辑回归主要用于解决二分类问题。它通过Sigmoid函数将线性组合映射到[0,1]区间内，代表事件发生的概率。逻辑回归的目标是最大化似然函数，即找到一组参数使得训练样本出现的概率最大。

代码示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# 创建数据集
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建逻辑回归模型
model = LogisticRegression()

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 可视化
def plot_decision_boundary(model, axis):
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0])*100)).reshape(-1, 1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2])*100)).reshape(-1, 1),
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.predict(X_new)
    zz = y_predict.reshape(x0.shape)
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9'])
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)
    
plot_decision_boundary(model, axis=[-3, 3, -3, 3])
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.show()

# 输出准确率
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))

输出结果：运行上述代码后，会生成一张决策边界图，展示了逻辑回归模型如何区分两类样本。同时控制台会输出模型在测试集上的准确率。

5. 决策树

决策树是一种树形结构的分类和回归算法。它通过递归地划分数据集，构建一棵树形结构，最终实现分类或回归。每个内部节点表示一个属性上的测试，每个分支表示一个测试结果，每个叶节点表示一个类别或数值。

代码示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree

# 加载数据集
data = load_iris()
X = data.data[:, :2]  # 只使用前两个特征
y = data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建决策树模型
model = DecisionTreeClassifier(max_depth=3)

# 训练模型
model.fit(X_train, y_train)

# 可视化决策树
plt.figure(figsize=(15, 10))
plot_tree(model, filled=True, feature_names=data.feature_names[:2], class_names=data.target_names)
plt.show()

# 输出准确率
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

输出结果：运行上述代码后，会生成一张决策树的可视化图，展示了决策树如何根据特征进行分类。同时控制台会输出模型在测试集上的准确率。

6. 支持向量机 (SVM)

支持向量机是一种基于间隔最大化原则的分类和回归方法。它试图找到一个超平面，使得两类样本之间的间隔最大。对于非线性可分问题，可以通过核函数将数据映射到高维空间，从而找到合适的超平面。

代码示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# 创建数据集
X, y = make_blobs(n_samples=100, centers=2, random_state=42)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建SVM模型
model = SVC(kernel='linear')

# 训练模型
model.fit(X_train, y_train)

# 可视化
def plot_svm_boundary(model, axis):
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0])*100)).reshape(-1, 1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2])*100)).reshape(-1, 1),
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.decision_function(X_new).reshape(x0.shape)
    zero_line = y_predict == 0
    plt.contour(x0, x1, y_predict, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])
    plt.scatter(X[y==0, 0], X[y==0, 1])
    plt.scatter(X[y==1, 0], X[y==1, 1])
    
plot_svm_boundary(model, axis=[-4, 4, -4, 4])
plt.show()

# 输出准确率
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

输出结果：

运行上述代码后，会生成一张决策边界图，展示了SVM模型如何区分两类样本。同时控制台会输出模型在测试集上的准确率。

7. K近邻算法 (KNN)

K近邻算法是一种基于实例的学习方法。给定一个测试样本，KNN算法会在训练集中找到距离最近的K个邻居，并根据这些邻居的标签来预测测试样本的标签。通常采用欧氏距离作为距离度量。

代码示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# 创建数据集
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建KNN模型
model = KNeighborsClassifier(n_neighbors=3)

# 训练模型
model.fit(X_train, y_train)

# 可视化
def plot_knn_boundary(model, axis):
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0])*100)).reshape(-1, 1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2])*100)).reshape(-1, 1),
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.predict(X_new).reshape(x0.shape)
    plt.contourf(x0, x1, y_predict, cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(X[y==0, 0], X[y==0, 1])
    plt.scatter(X[y==1, 0], X[y==1, 1])
    
plot_knn_boundary(model, axis=[-3, 3, -3, 3])
plt.show()

# 输出准确率
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

输出结果：运行上述代码后，会生成一张决策边界图，展示了KNN模型如何区分两类样本。同时控制台会输出模型在测试集上的准确率。

实战案例：手写数字识别

手写数字识别是一个经典的机器学习问题，可以用来验证各种算法的效果。MNIST数据集包含了70000个大小为28x28像素的手写数字图片，其中60000张用于训练，10000张用于测试。

代码示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 加载MNIST数据集
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist['target']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建逻辑回归模型
model = LogisticRegression(max_iter=1000)

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 输出准确率
print("Accuracy:", accuracy_score(y_test, y_pred))

# 可视化预测结果
some_digit = X_test[0]
some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image, cmap=plt.cm.binary)
plt.axis("off")
plt.show()

print("Predicted:", model.predict([some_digit]))
print("Actual:", y_test[0])

输出结果：运行上述代码后，会输出模型在测试集上的准确率，并展示一个测试样本及其预测结果和真实标签。

总结

本文介绍了几种常用的机器学习算法，包括线性回归、逻辑回归、决策树、支持向量机和K近邻算法，并通过Python代码示例展示了它们的具体应用。通过实战案例手写数字识别进一步验证了这些算法的有效性。希望读者能够从中获得对机器学习的理解和实践能力。