机器学习如何训练最终模型-51CTO.COM

对于刚刚接触、或跨界转行至机器学习的朋友来说，“如何训练最终模型”可谓是一个经典话题。对此，Jason Brownlee博士专门撰文解答这个疑问(原文链接：http://machinelearningmastery.com/train-final-machine-learning-model/)，开数科技在此对文章进行了中文编译，希望能够为正在学习中的朋友们带去一些帮助。

原文作者：Dr. Jason Brownlee

中文编译：R.

特邀校审：Dr. Xu.Tang

来源：开数科技(微信公众号：open01tech)

How to Train a Final Machine Learning Model

机器学习如何训练最终模型

The machine learning model that we use to make predictions on new data is called the final model.

机器学习过程中，我们用来对新数据进行预测的模型被称为最终模型。

There can be confusion in applied machine learning about how to train a final model.

而对于如何训练最终模型，初学者可能会产生疑问或困惑。

This error is seen with beginners to the field who ask questions such as:

例如，初学者可能会提出以下问题：

• How do I predict with cross validation?

· 我应该如何通过交叉验证进行预测?

• Which model do I choose from cross-validation?

· 根据交叉验证我应该选择哪个模型?

• Do I use the model after preparing it on the training dataset?

· 我应该使用在训练集上建立的模型吗?

This post will clear up the confusion.

本文的目的在于解答这些问题。

In this post, you will discover how to finalize your machine learning model in order to make predictions on new data.

通过本文，你将会了解如何最终选定你的机器学习模型，从而对新的数据进行预测。

Let’s get started.

让我们开始吧。

What is a Final Model?

什么是“最终模型”?

A final machine learning model is a model that you use to make predictions on new data.

在机器学习中，“最终模型”是指用来预测新数据的模型。

That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value).

也就是说，在给定的新输入数据样例上，你可以使用最终模型预测出期待的输出结果。这有可能是一个分类问题(数据标注)或者是一个回归问题(数值估计)。

For example, whether the photo is a picture of a dog or a cat, or the estimated number of sales for tomorrow.

比如我们可以通过模型，去判断某个照片中是汪还是咪，又或者可以去预估明天的销售额。

The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:

进行机器学习的目的是训练一个“最好”的最终模型。这里“最好”是由以下因素决定的：

• Data: the historical data that you have available.

· 数据：可用的历史数据。

• Time: the time you have to spend on the project.

· 时间：用来训练模型的时间。

• Procedure: the data preparation steps, algorithm or algorithms, and the chosen algorithm configurations.

· 过程：数据准备步骤、算法或算法集，以及如何配置这些算法。

In your project, you gather the data, spend the time you have, and discover the data preparation procedures, algorithm to use, and how to configure it.

总体说来，整个过程涉及数据收集、训练、合理的设置流程、选择合适的算法，并进行正确配置。

The final model is the pinnacle of this process, the end you seek in order to start actually making predictions.

“最终模型”则是整个过程的终点，通过它你可以开始对实际数据进行预测。

The Purpose of Train/Test Sets

使用训练/测试数据集的目的

Why do we use train and test sets?

为什么要使用训练/测试数据集?

Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.

通过把数据分割成训练集/测试集，能够快速地评估你的算法的性能如何。

The training dataset is used to prepare a model, to train it.

训练数据集是用来形成、并训练模型的。

We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.

对于测试数据，我们假设测试数据集是新的数据，在模型训练过程中隐藏已知的输出值(事实上我们是知道输出值的)。基于测试数据的输入和在训练数据上构建的模型，我们可以预测测试数据上的输出值并将它们与真实输出进行比较。

Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.

通过将测试数据集的预测结果和我们事先已知的输出结果进行比对，可以衡量模型在测试数据上的表现，从而估计模型在未知数据集上的预测能力。

Let’s unpack this further

让我们进一步展开来解释

When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).

当我们评估一个算法，我们实际上是评估计算过程中的所有步骤，包括如何准备训练数据(例如：缩放)、算法的选择(例如：KNN)，以及如何配置我们的算法(例如：K = 3)。

The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.

所谓模型预测性能的优劣，也是对计算过程中所有涉及环节的综合评估。

We generalize the performance measure from:

一些评定因素包括：

• “the skill of the procedure on the test set“

· “测试/训练环节所用方法及性能”

• “the skill of the procedure on unseen data“.

· 到“通过计算模型在测试数据集上的预测精度来估计它在未知数据上的预测能力”。

This is quite a leap and requires that:

这两者之间实际上有相当大的距离，这个过程需要满足一下条件：

• The procedure is sufficiently robust that the estimate of skill is close to what we actually expect on unseen data.

· 《模型具有足够的鲁棒性使得这种估计能够充分接近模型在未知数据集上的预测精度。

· The choice of performance measure accurately captures what we are interested in measuring in predictions on unseen data.

· 评价指标的选择能够真实反映我们对于数据预测的关注点。

• The choice of data preparation is well understood and repeatable on new data, and reversible if predictions need to be returned to their original scale or related to the original input values.

· 数据的预处理是合理的，并且能够在新数据集上重复; 同时如果预测过程需要回溯到原数据的量纲上，那么预处理过程还要是可逆的。

• The choice of algorithm makes sense for its intended use and operational environment (e.g. complexity or chosen programming language).

· 算法的选择应该考虑其实际的应用目标和操作环境(例如算法复杂度或编程语言的选择)。

A lot rides on the estimated skill of the whole procedure on the test set.

机器学习方法在测试数据上的表现将会决定我们最终模型，包括数据预处理过程、具体模型类型、参数的选择和训练环境等诸多因素。

In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.

事实上，使用训练/测试数据分割法来估计模型在未知数据上的预测能力往往会有很大的分歧(除非有海量的数据进行分割)。也就是说，在不同的未知数据上，同一个模型的预测能力可能会有明显的差异。

The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another.

其结果是，我们可能不是非常确定模型在未知数据及上的表现如何，以及模型之间的差异如何。

Often, time permitting, we prefer to use k-fold cross-validation instead.

如果时间允许的话，使用交叉验证可能也是个不错的方法。

The Purpose of k-fold Cross Validation

交叉验证的目的

Why do we use k-fold cross validation?

为什么要使用交叉验证?

Cross-validation is another method to estimate the skill of a method on unseen data. Like using a train-test split.

类似前面提到的训练数据集预测方法，“交叉验证”是另一种用来估计模型在未知数据集上预测能力的方法。

Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.

交叉验证系统的在原数据的多个子集创建多个模型，并进行评估。

This, in turn, provides a population of performance measures.

这同时提供了相关模型的一组评价指标。

• We can calculate the mean of these measures to get an idea of how well the procedure performs on average.

· 我们可以对这组评价指标取均值以评估模型的性能。

• We can calculate the standard deviation of these measures to get an idea of how much the skill of the procedure is expected to vary in practice.

· 我们可以计算出这些指标的标准偏差，从而了解在真实数据集中会产生多大范围的变化。

This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.

这也有助于更细致的比较该选择何种算法或采用何种数据预处理方法。

Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.

此外，这些信息的价值还在于你能计算它们的均值和范围来构建机器学习模型的预测能力的置信区间。

Both train-test splits and k-fold cross validation are examples of resampling methods.

训练/测试数据集和交叉验证都是使用重采样的方法。

Why do we use Resampling Methods?

为什么要使用重采样方法?

The problem with applied machine learning is that we are trying to model the unknown.

应用机器学习的目的在于我们希望通过模型对未知数据进行预测。

On a given predictive modeling problem, the ideal model is one that performs the best when making predictions on new data.

对于一个既定预测的模型，理想状态是该模型对新数据能够给出接近真实情况的预测结果。

We don’t have new data, so we have to pretend with statistical tricks.

但在此之前，我们没有新的数据，所以我们不得不通过统计方法来模拟。

The train-test split and k-fold cross validation are called resampling methods. Resampling methods are statistical procedures for sampling a dataset and estimating an unknown quantity.

训练/测试数据集和交叉验证采用所谓的“重采样方法”。“重采样方法”是对数据集进行采样，并估计未知量的统计方法。

In the case of applied machine learning, we are interested in estimating the skill of a machine learning procedure on unseen data. More specifically, the skill of the predictions made by a machine learning procedure.

在应用机器学习时，我们关注的是模型的预测能力; 具体来说，就是模型预测值的准确性。

Once we have the estimated skill, we are finished with the resampling method.

一旦我们估计出模型的预测精度，那么重采样方法的任务也就结束了。

• If you are using a train-test split, that means you can discard the split datasets and the trained model.

· 如果你使用的是一个随机分割的训练与测试数据集，这意味着你现在可以无视这个数据集和相关的训练模型了。

• If you are using k-fold cross-validation, that means you can throw away all of the trained models.

· 如果你使用的是k-fold交叉验证，这意味着你可以扔掉所有在数据子集上训练的模型了。

They have served their purpose and are no longer needed.

因为它们的任务已经完成了。

You are now ready to finalize your model.

你现在即将完成你的模型了。

How to Finalize a Model?

如何完成模型?

You finalize a model by applying the chosen machine learning procedure on all of your data.

你可以将机器学习生成的模型应用在你全部的数据上。

That’s it.

就是这样。

With the finalized model, you can:

对于最终模型，您可以：

• Save the model for later or operational use.

· 保存模型为以后或操作使用。

• Make predictions on new data.

· 对新数据作出预测。

What about the cross-validation models or the train-test datasets?

那交叉验证模型或训练/测试数据集呢?

They’ve been discarded. They are no longer needed. They have served their purpose to help you choose a procedure to finalize.

它们已经完成自身的使命，以后也就不再需要它们了。

关于作者：Dr. Jason Brownlee

Dr. Jason Brownlee is a husband, proud father, academic researcher, author, professional developer and a machine learning practitioner. He is dedicated to helping developers get started and get good at applied machine learning.

特邀校审：Dr. Xu.Tang

新加坡国立大学统计学博士，原大公国际数据分析经理，现开数科技高级数据挖掘与分析师。

关于开数科技：

开数科技(OPEN01)致力于以世界领先的人工智能大数据处理技术、独到的IT架构、深度学习以及模式识别算法，为各行业用户提供实时、高效、多维度的数据分析产品和服务。核心团队成员汇集来自美国MIT、哈佛大学、纽约州立大学、英国剑桥大学等大数据专家，以及来自罗兰贝格、埃森哲等战略运营专家。