从 0 到 1 实现基于 Docker 的深度学习环境的安装与配置！-51CTO.COM

在当今人工智能技术飞速发展的时代，深度学习作为核心驱动力之一，正引领着科研与产业界的革新。搭建一个高效、可靠的深度学习环境对于开发者和研究人员来说，是最基本的需求。

本文笔者将从0到1实现基于Docker的深度学习环境的安装与配置！与大家分享~

本文中的方法，可用于Nvidia显卡及常见硬件，如：A100服务器、RTX4090家用显卡。操作系统以Ubuntu为例。

一、安装Ubuntu操作系统

安装Ubuntu的流程和以往并没有太大不同，依旧是老生常谈的三步曲：下载镜像、制作启动盘、安装系统。

1.下载合适的Ubuntu镜像

首先访问Ubuntu官方网站，下载所需要的系统版本：

桌面版：https://releases.ubuntu.com/22.04/ubuntu-22.04-desktop-amd64.iso
服务器版：https://releases.ubuntu.com/22.04/ubuntu-22.04-live-server-amd64.iso

其中，服务器版适合于不需要图形界面的服务器环境，或者构建高度定制化的系统。它更侧重于性能优化、远程管理以及自动化运维，适合云服务器部署、容器化应用、数据库服务器等场景。桌面版提供了完整的图形用户界面(GUI)，适合日常使用、开发环境搭建及需要直接交互的工作站。它预装了许多日常所需的软件和实用工具，以及对各种硬件的良好支持，包括自动安装大部分驱动程序，这对于笔记本电脑用户尤为便利。

2.制作系统安装盘

这里使用的是Balena Etcher工具制作系统安装盘。

下载完毕软件之后，打开软件，选择我们下载好的系统镜像，以及要制作成安装盘的U盘，点击“制作”按钮，稍等片刻，安装盘就制作完成啦。

3.安装系统

(1) 设置BIOS/UEFI启动顺序

重启计算机，并在启动画面出现时按下指定键（通常是F2、F10、F12、Del等，具体取决于主板型号）进入BIOS或UEFI设置。
寻找“Boot”或“启动”设置，将USB HDD或包含USB字样的设备调整为第一启动项。
保存更改并退出，计算机将自动重启并从U盘启动。

(2) 启动并进入安装界面

当看到Ubuntu的Logo出现时，表明系统已成功从U盘启动。稍作等待，安装程序将自动加载。
加载完毕后，会看到Ubuntu安装向导的第一个界面，选择“Install Ubuntu”。

(3) 安装过程中的额外驱动选项

在安装过程中，安装程序可能会检测到您的系统可能需要额外的硬件驱动，尤其是对于Nvidia显卡等。这时，您会看到一个询问是否安装第三方软件（包括MP3编解码器、Flash插件以及专有硬件驱动）的选项。如果不确定，建议勾选此选项，以确保安装后系统能立即识别并充分利用所有硬件功能。

(4) 等待安装完成

点击“Continue”开始安装过程，这可能需要一段时间，请耐心等待。
安装结束后，您会被提示重启系统。移除U盘，点击“Restart Now”。

重启后，将直接进入新安装的Ubuntu系统登录界面，使用之前设置的用户名和密码登录。

二、系统基础环境配置

安装完Ubuntu系统后的首要任务之一就是进行系统更新，以确保系统拥有最新的安全补丁、软件包升级和 bug 修复。

sudo apt update && sudo apt -y upgrade

如果觉得更新软件速度太慢，可以换国内镜像源，如清华大学镜像源。

sudo sed -i -e "s/cn.archive.ubuntu.com/mirrors.tuna.tsinghua.edu.cn/" /etc/apt/sources.list
sudo sed -i -e "s/security.ubuntu.com/mirrors.tuna.tsinghua.edu.cn/" /etc/apt/sources.list

等待软件和系统补丁更新完毕之后，执行重启操作，让补丁生效即可（首次更新，会更新内核）。

sudo reboot

安装 OpenSSH Server

不论是选择桌面版操作系统，还是选择服务端操作系统，默认情况下系统中不会包含 openssh-server 这个组件，如果有从局域网其他设备访问这台Linux设备的需求，可以先执行下面的命令，来安装它。

sudo apt update && sudo apt install -y openssh-server

程序安装完毕后即可执行ssh username@host-ip访问Linux服务器。如果要登录Linux使用的设备的用户名和Linux允许登录的用户名一致，则可省略 “username”。

ssh 10.11.12.240

The authenticity of host '10.11.12.240 (10.11.12.240)' can't be established.
ED25519 key fingerprint is SHA256:cYodQ6Chywyna1JbHWfA7XAFonHKAz48cPmjRyVOCFU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.11.12.240' (ED25519) to the list of known hosts.
soulteary@10.11.12.240's password:

首次登录的时候，需要先输入yes让当前的设备信任目标设备的指纹，然后输入密码，就能够看到熟悉的终端提示信息了：

Welcome to Ubuntu 22.04 LTS (GNU/Linux 5.15.0-25-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

189 updates can be applied immediately.
73 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable


The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

三、安装显卡驱动

可以通过 nvidia-detector 来获取最新的稳定版本的驱动。

# nvidia-detector
nvidia-driver-525

在安装驱动之前，暂时是不能使用 nvidia-smi 管理工具的。

# nvidia-smi
zsh: command not found: nvidia-smi

安装驱动时，建议除了安装 nvidia-driver 驱动，可以顺带安装 nvidia-dkms ，方便后续如果需要升降级内核的时候，减少不必要的麻烦：

sudo apt-get install -y nvidia-driver-525 nvidia-dkms-525

完成驱动安装之后，再次执行nvidia-smi，就可以进行显卡管理啦。

# nvidia-smi

Tue Mar 21 22:53:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  Off |
| 31%   34C    P8    19W / 450W |     53MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1331      G   /usr/lib/xorg/Xorg                 36MiB |
|    0   N/A  N/A      1552      G   /usr/bin/gnome-shell               15MiB |
+-----------------------------------------------------------------------------+

四、GPU Docker环境的安装和配置

1.宿主机Docker基础环境安装

参考Docker官方文档，进行Docker的快速安装配置，确保Docker服务运行正常。

(1) 使用apt remove命令移除可能存在的旧版Docker相关软件包，避免冲突。

sudo apt remove -y docker docker-engine docker.io containerd runc

(2) 安装必要的系统工具和库，如ca-certificates、curl、gnupg和lsb-release。

sudo apt install -y ca-certificates curl gnupg lsb-release

(3) 下载软件包签名使用的 GPG 密钥，并配置系统信任该密钥。

sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

如果无法访问官方地址，可以将密钥下载地址替换为下面的地址。

# 清华源
https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/ubuntu/gpg
# 阿里云
https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg

(4) 创建一个适合于当前 CPU 架构和系统版本的软件源。

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

同样的，如果希望能够更快的下载到软件，可以配置软件源来替换官方地址。

# 清华源
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/ubuntu/ \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# 阿里云
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu/ \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

最后一步，就是安装 Docker 的社区版，以及常用的CLI命令。

sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin

2.安装Docker显卡运行时

想要在Docker中能够“调用显卡”，需要安装“NVIDIA容器工具包存储库”。

distribution=ubuntu22.04 && \
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

命令执行完毕之后，系统中就添加好了Lib Nvidia Container工具的软件源，然后更新系统软件列表，使用命令安装 nvidia-container-toolkit 即可：

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

完成 nvidia-container-toolkit 的安装之后，继续执行nvidia-ctk runtime configure命令，为Docker添加nvidia运行时。完成后，我们的应用就能在容器中使用显卡资源了：

sudo nvidia-ctk runtime configure --runtime=docker

命令执行成功，将看到类似下面的日志输出：

# sudo nvidia-ctk runtime configure --runtime=docker

INFO[0000] Loading docker config from /etc/docker/daemon.json 
INFO[0000] Successfully loaded config                   
INFO[0000] Flushing docker config to /etc/docker/daemon.json 
INFO[0000] Successfully flushed config                  
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that the docker daemon be restarted.

在完成配置之后，重启 docker 服务，让配置生效：

sudo systemctl restart docker

服务重启完毕，查看Docker运行时列表，能够看到 nvidia 已经生效。

# docker info | grep Runtimes

 Runtimes: nvidia runc io.containerd.runc.v2

五、安装AI相关Docker镜像并使用

相比较直接安装和配置深度学习应用所需要的环境，通过Docker，可以下载到各种具备不同能力的“开箱即用”的环境，通常可以从下面三个地址获取基础镜像：

诸如在RTX 4090这类卡刚发布后，相比较自己从零到一构建镜像，官方镜像是个不错的额外选项，能够更好发挥显卡性能，还不需要折腾。

举个例子，如果想使用最新的CUDA版本，搭配一个能开箱即用的 PyTorch 环境，而此时 Conda 社区还未做兼容适配，最好的选择不是去翻不同软件包社区，做一些Hack完成安装，而是直接使用官方的镜像。

比如，一条命令，就能够启动一个包含了最新版本的 CUDA 和 PyTorch 的实验环境（环境的发布文档）：

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.02-py3

当然，也可以调整命令，比如执行nvidia-smi来检查运行环境以及获取显卡的状态：

# docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.02-py3 nvidia-smi

=============
== PyTorch ==
=============

NVIDIA Release 23.02 (build 53420872)
PyTorch Version 1.14.0a0+44dac51

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Tue Mar 21 15:30:19 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  Off |
| 31%   33C    P0    33W / 450W |    174MiB / 24564MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

在上面的命令中，虽然调用了显卡，但是输出的日志中提醒并行计算需要的缓存是不足的。为了最佳的性能实现，可以继续调整命令如下：

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io/nvidia/pytorch:23.02-py3

将--gpus all替换为显卡编号，即可在多卡机器中指定某张卡来运行程序：

docker run --gpus "0" --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io/nvidia/pytorch:23.02-py3

如果希望八卡的机器只有单数卡能够被容器访问，可以调整参数为：

--gpus "1,3,5,7"