本文经DCOS(公众号 ID: indagate)授权转载,转载请联系出处。
撰文 | 段全锋
编辑 | zouyee
段全锋: 软件工程师,熟悉K8s架构、精通Runtime底层技术细节等。
目前我司现网的K8s集群的运行时已经完成从docker到Containerd的切换,有小伙伴对K8s与Containerd调用链涉及的组件不了解,其中Containerd和RunC是什么关系,docker和containerd又有什么区别,以及K8s调用Containerd创建容器又是怎样的流程,最终RunC又是如何创建容器的,诸如此类的疑问。本文就针对K8s使用Containerd作为运行时的整个调用链进行介绍和源码级别的分析。
其中关于kubelet与运行时的分层架构图可以参看下图:
那么关于各类运行时的介绍可以参看Containerd深度剖析-runtime篇
一、“运行时简介”
容器运行时意思就是能够管理容器运行的整个生命周期,具体一点就是如何制作容器的镜像、容器镜像格式是什么样子的、管理容器的镜像、容器镜像的分发、如何运行一个容器以及管理创建的容器实例等等。
容器运行时有一个行业标准叫做OCI规范,这个规范分成两部分:
a. 容器运行时规范:描述了如何通过一个bundle运行容器,bundle就是一个目录,里面包括一个容器的规格文 件,文件叫config.json 和一个rootfs,rootfs中包含了一个容器运行时所需操作系统的文件。
b. 容器镜像规范:定义了容器的镜像如何打包如何将镜像转换成一个bundle。
的老爷子,考虑是新冠肺炎转成重症。虽然意识还比较清醒,但血气分析结果很差,需要住进ICU进行呼吸监护。从普通病房转运至ICU,坐电梯上几层楼就能到,但如果给氧条件不够,风险会非常大。他的妻女想见老爷子一面、说说话,怕以后再也见不到了。
目前流行将运行时分成low-level运行时和high-level运行时,low-level运行时专注于如何创建一个容器例如runc和kata,high-level包含了更多上层功能,比如镜像管理,以docker和containerd为代表。
K8s的kubelet是调用容器运行时创建容器的,但是容器运行时这么多不可能逐个兼容,K8s在对接容器运行时定义了CRI接口,容器运行时只需实现该接口就能被使用。下图分别是k8s使用docker和containerd的调用链,使用containerd时CRI接口是在containerd代码中实现的;使用docker时的CRI接口是在k8s的代码中实现的,叫做docker-shim(kubernetes/pkg/kubelet/dockershim/docker_service.go),这部分代码在k8s代码中是历史原因,当时docker是容器方面行业事实上的标准,但随着越来越多运行时实现了CRI支持,docker-shim的维护日益变成社区负担,在最新的K8s版本中,该部分代码目前已经移出,暂时由mirantis进行维护,下图是插件的发展历程。
二、“Containerd CRI简介”
Containerd是一个行业标准的容器运行时,它是一个daemon进程,可以管理主机上容器的全部生命周期和它的文件系统,包括:镜像的分发和存储、容器的运行和监控,底层的存储和网络。
Containerd有多种客户端,比如K8s、docker等,为了不同客户端的容器或者镜像能隔离开,Containerd中有namespace概念,默认情况下docker的namespace是moby,K8s的是k8s.io。
container在Containerd中代表的是一个容器的元数据,containerd中的Task用于获取容器对象并将它转换成在操作系统中可运行的进程,它代表的就是容器中可运行的对象。
Containerd内部的cri模块实现K8s的CRI接口,所以K8s的kubelet可以直接使用containerd。CRI的接口包括:RuntimeService 和 ImageService
// Runtime service defines the public APIs for remote container runtimes
service RuntimeService {
// Version returns the runtime name, runtime version, and runtime API version.
rpc Version(VersionRequest) returns (VersionResponse) {}
// RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure
// the sandbox is in the ready state on success.
rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
// StopPodSandbox stops any running process that is part of the sandbox and
// reclaims network resources (e.g., IP addresses) allocated to the sandbox.
// If there are any running containers in the sandbox, they must be forcibly
// terminated.
// This call is idempotent, and must not return an error if all relevant
// resources have already been reclaimed. kubelet will call StopPodSandbox
// at least once before calling RemovePodSandbox. It will also attempt to
// reclaim resources eagerly, as soon as a sandbox is not needed. Hence,
// multiple StopPodSandbox calls are expected.
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
// RemovePodSandbox removes the sandbox. If there are any running containers
// in the sandbox, they must be forcibly terminated and removed.
// This call is idempotent, and must not return an error if the sandbox has
// already been removed.
rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
// PodSandboxStatus returns the status of the PodSandbox. If the PodSandbox is not
// present, returns an error.
rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
// ListPodSandbox returns a list of PodSandboxes.
rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}
// CreateContainer creates a new container in specified PodSandbox
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
// StartContainer starts the container.
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
// StopContainer stops a running container with a grace period (i.e., timeout).
// This call is idempotent, and must not return an error if the container has
// already been stopped.
// The runtime must forcibly kill the container after the grace period is
// reached.
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
// RemoveContainer removes the container. If the container is running, the
// container must be forcibly removed.
// This call is idempotent, and must not return an error if the container has
// already been removed.
rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
// ListContainers lists all containers by filters.
rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
// ContainerStatus returns status of the container. If the container is not
// present, returns an error.
rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
// UpdateContainerResources updates ContainerConfig of the container synchronously.
// If runtime fails to transactionally update the requested resources, an error is returned.
rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
// ReopenContainerLog asks runtime to reopen the stdout/stderr log file
// for the container. This is often called after the log file has been
// rotated. If the container is not running, container runtime can choose
// to either create a new log file and return nil, or return an error.
// Once it returns error, new container log file MUST NOT be created.
rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}
// ExecSync runs a command in a container synchronously.
rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}
// Exec prepares a streaming endpoint to execute a command in the container.
rpc Exec(ExecRequest) returns (ExecResponse) {}
// Attach prepares a streaming endpoint to attach to a running container.
rpc Attach(AttachRequest) returns (AttachResponse) {}
// PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}
// ContainerStats returns stats of the container. If the container does not
// exist, the call returns an error.
rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}
// ListContainerStats returns stats of all running containers.
rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}
// PodSandboxStats returns stats of the pod sandbox. If the pod sandbox does not
// exist, the call returns an error.
rpc PodSandboxStats(PodSandboxStatsRequest) returns (PodSandboxStatsResponse) {}
// ListPodSandboxStats returns stats of the pod sandboxes matching a filter.
rpc ListPodSandboxStats(ListPodSandboxStatsRequest) returns (ListPodSandboxStatsResponse) {}
// UpdateRuntimeConfig updates the runtime configuration based on the given request.
rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}
// Status returns the status of the runtime.
rpc Status(StatusRequest) returns (StatusResponse) {}
}
// ImageService defines the public APIs for managing images.
service ImageService {
// ListImages lists existing images.
rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
// ImageStatus returns the status of the image. If the image is not
// present, returns a response with ImageStatusResponse.Image set to
// nil.
rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
// PullImage pulls an image with authentication config.
rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
// RemoveImage removes the image.
// This call is idempotent, and must not return an error if the image has
// already been removed.
rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
// ImageFSInfo returns information of the filesystem that is used to store images.
rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}
}
kubelet调用CRI接口创建一个包含A和B两个业务container的Pod流程如下所示:
① 为Pod创建sandbox
② 创建container A
③ 启动container A
④ 创建container B
⑤ 启动container B
三、“Containerd CRI实现”
RunPodSandbox
RunPodSandbox的流程如下:
① 拉取sandbox的镜像,在containerd中配置
② 获取创建pod要使用的runtime,可以在创建pod的yaml中指定,如果没指定使用containerd中默认的
(runtime在containerd中配置)
③ 如果pod不是hostNetwork那么添加创建新net namespace,并使用cni插件设置网络(criService在初始化时会加载containerd中cri指定的插件信息)
④ 调用containerd客户端创建一个container
⑤ 在rootDir/io.containerd.grpc.v1.cri/sandboxes下为当前pod以pod Id为名创建一个目录
(pkg/cri/cri.go)
⑥ 根据选择的runtime为sandbox容器创建task
⑦ 启动sandbox容器的task,将sandbox添加到数据库中
代码在containerd/pkg/cri/server/sanbox_run.go 中
// RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure
// the sandbox is in ready state.
func (c *criService) RunPodSandbox(ctx context.Context, r *runtime.RunPodSandboxRequest) (_ *runtime.RunPodSandboxResponse, retErr error) {
config := r.GetConfig()
log.G(ctx).Debugf("Sandbox config %+v", config)
// Generate unique id and name for the sandbox and reserve the name.
id := util.GenerateID()
metadata := config.GetMetadata()
if metadata == nil {
return nil, errors.New("sandbox config must include metadata")
}
name := makeSandboxName(metadata)
log.G(ctx).WithField("podsandboxid", id).Debugf("generated id for sandbox name %q", name)
// cleanupErr records the last error returned by the critical cleanup operations in deferred functions,
// like CNI teardown and stopping the running sandbox task.
// If cleanup is not completed for some reason, the CRI-plugin will leave the sandbox
// in a not-ready state, which can later be cleaned up by the next execution of the kubelet's syncPod workflow.
var cleanupErr error
// Reserve the sandbox name to avoid concurrent `RunPodSandbox` request starting the
// same sandbox.
if err := c.sandboxNameIndex.Reserve(name, id); err != nil {
return nil, fmt.Errorf("failed to reserve sandbox name %q: %w", name, err)
}
defer func() {
// Release the name if the function returns with an error.
// When cleanupErr != nil, the name will be cleaned in sandbox_remove.
if retErr != nil && cleanupErr == nil {
c.sandboxNameIndex.ReleaseByName(name)
}
}()
var (
err error
sandboxInfo = sb.Sandbox{ID: id}
)
ociRuntime, err := c.getSandboxRuntime(config, r.GetRuntimeHandler())
if err != nil {
return nil, fmt.Errorf("unable to get OCI runtime for sandbox %q: %w", id, err)
}
sandboxInfo.Runtime.Name = ociRuntime.Type
// Retrieve runtime options
runtimeOpts, err := generateRuntimeOptions(ociRuntime, c.config)
if err != nil {
return nil, fmt.Errorf("failed to generate sandbox runtime options: %w", err)
}
...
// Create initial internal sandbox object.
sandbox := sandboxstore.NewSandbox(
...
)
if _, err := c.client.SandboxStore().Create(ctx, sandboxInfo); err != nil {
return nil, fmt.Errorf("failed to save sandbox metadata: %w", err)
}
...
// Setup the network namespace if host networking wasn't requested.
if !hostNetwork(config) {
netStart := time.Now()
// If it is not in host network namespace then create a namespace and set the sandbox
// handle. NetNSPath in sandbox metadata and NetNS is non empty only for non host network
// namespaces. If the pod is in host network namespace then both are empty and should not
// be used.
var netnsMountDir = "/var/run/netns"
if c.config.NetNSMountsUnderStateDir {
netnsMountDir = filepath.Join(c.config.StateDir, "netns")
}
sandbox.NetNS, err = netns.NewNetNS(netnsMountDir)
if err != nil {
return nil, fmt.Errorf("failed to create network namespace for sandbox %q: %w", id, err)
}
// Update network namespace in the store, which is used to generate the container's spec
sandbox.NetNSPath = sandbox.NetNS.GetPath()
defer func() {
// Remove the network namespace only if all the resource cleanup is done
if retErr != nil && cleanupErr == nil {
if cleanupErr = sandbox.NetNS.Remove(); cleanupErr != nil {
log.G(ctx).WithError(cleanupErr).Errorf("Failed to remove network namespace %s for sandbox %q", sandbox.NetNSPath, id)
return
}
sandbox.NetNSPath = ""
}
}()
if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
}
// Save sandbox metadata to store
if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
}
// Define this defer to teardownPodNetwork prior to the setupPodNetwork function call.
// This is because in setupPodNetwork the resource is allocated even if it returns error, unlike other resource
// creation functions.
defer func() {
// Remove the network namespace only if all the resource cleanup is done.
if retErr != nil && cleanupErr == nil {
deferCtx, deferCancel := ctrdutil.DeferContext()
defer deferCancel()
// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)
}
}
}()
// Setup network for sandbox.
// Certain VM based solutions like clear containers (Issue containerd/cri-containerd#524)
// rely on the assumption that CRI shim will not be querying the network namespace to check the
// network states such as IP.
// In future runtime implementation should avoid relying on CRI shim implementation details.
// In this case however caching the IP will add a subtle performance enhancement by avoiding
// calls to network namespace of the pod to query the IP of the veth interface on every
// SandboxStatus request.
if err := c.setupPodNetwork(ctx, &sandbox); err != nil {
return nil, fmt.Errorf("failed to setup network for sandbox %q: %w", id, err)
}
sandboxCreateNetworkTimer.UpdateSince(netStart)
}
if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
}
controller, err := c.getSandboxController(config, r.GetRuntimeHandler())
if err != nil {
return nil, fmt.Errorf("failed to get sandbox controller: %w", err)
}
// Save sandbox metadata to store
if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
}
runtimeStart := time.Now()
if err := controller.Create(ctx, id); err != nil {
return nil, fmt.Errorf("failed to create sandbox %q: %w", id, err)
}
resp, err := controller.Start(ctx, id)
if err != nil {
sandbox.Container, _ = c.client.LoadContainer(ctx, id)
if resp != nil && resp.SandboxID == "" && resp.Pid == 0 && resp.CreatedAt == nil && len(resp.Labels) == 0 {
// if resp is a non-nil zero-value, an error occurred during cleanup
cleanupErr = fmt.Errorf("failed to cleanup sandbox")
}
return nil, fmt.Errorf("failed to start sandbox %q: %w", id, err)
}
// TODO: get rid of this. sandbox object should no longer have Container field.
if ociRuntime.SandboxMode == string(criconfig.ModePodSandbox) {
container, err := c.client.LoadContainer(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to load container %q for sandbox: %w", id, err)
}
sandbox.Container = container
}
labels := resp.GetLabels()
if labels == nil {
labels = map[string]string{}
}
sandbox.ProcessLabel = labels["selinux_label"]
if err := sandbox.Status.Update(func(status sandboxstore.Status) (sandboxstore.Status, error) {
// Set the pod sandbox as ready after successfully start sandbox container.
status.Pid = resp.Pid
status.State = sandboxstore.StateReady
status.CreatedAt = protobuf.FromTimestamp(resp.CreatedAt)
return status, nil
}); err != nil {
return nil, fmt.Errorf("failed to update sandbox status: %w", err)
}
// Add sandbox into sandbox store in INIT state.
if err := c.sandboxStore.Add(sandbox); err != nil {
return nil, fmt.Errorf("failed to add sandbox %+v into store: %w", sandbox, err)
}
// Send CONTAINER_CREATED event with both ContainerId and SandboxId equal to SandboxId.
// Note that this has to be done after sandboxStore.Add() because we need to get
// SandboxStatus from the store and include it in the event.
c.generateAndSendContainerEvent(ctx, id, id, runtime.ContainerEventType_CONTAINER_CREATED_EVENT)
// start the monitor after adding sandbox into the store, this ensures
// that sandbox is in the store, when event monitor receives the TaskExit event.
//
// TaskOOM from containerd may come before sandbox is added to store,
// but we don't care about sandbox TaskOOM right now, so it is fine.
go func() {
resp, err := controller.Wait(ctrdutil.NamespacedContext(), id)
if err != nil {
log.G(ctx).WithError(err).Error("failed to wait for sandbox controller, skipping exit event")
return
}
e := &eventtypes.TaskExit{
ContainerID: id,
ID: id,
// Pid is not used
Pid: 0,
ExitStatus: resp.ExitStatus,
ExitedAt: resp.ExitedAt,
}
c.eventMonitor.backOff.enBackOff(id, e)
}()
// Send CONTAINER_STARTED event with ContainerId equal to SandboxId.
c.generateAndSendContainerEvent(ctx, id, id, runtime.ContainerEventType_CONTAINER_STARTED_EVENT)
sandboxRuntimeCreateTimer.WithValues(labels["oci_runtime_type"]).UpdateSince(runtimeStart)
return &runtime.RunPodSandboxResponse{PodSandboxId: id}, nil
}
CreateContainer
CreateContainer在指定的PodSandbox中创建一个新的container元数据,流程如下:
① 获取容器的sandbox信息
② 为容器要用的镜像初始化镜像handler
③ 为容器在rootDir/io.containerd.grpc.v1.cri目录下以容器Id命名的目录
④ 从sandbox中获取所使用的runtime
⑤ 为容器创建containerSpec
⑥ 使用containerd客户端创建container
⑦ 保存container的信息
代码见containerd/pkg/cri/server/container_create.go 下面是省略过的代码。
func (c *criService) CreateContainer(ctx context.Context, r
*runtime.CreateContainerRequest) (_ *runtime.CreateContainerResponse, retErr error) {
sandbox, err := c.sandboxStore.Get(r.GetPodSandboxId())
s, err := sandbox.Container.Task(ctx, nil)
sandboxPid := s.Pid()
image, err := c.localResolve(config.GetImage().GetImage())
if err != nil {
return nil, errors.Wrapf(err, "failed to resolve image %q", config.GetImage().GetImage())
}
containerdImage, err := c.toContainerdImage(ctx, image)
// Run container using the same runtime with sandbox.
sandboxInfo, err := sandbox.Container.Info(ctx)
if err != nil {
return nil, errors.Wrapf(err, "failed to get sandbox %q info", sandboxID)
}
// Create container root directory. containerRootDir := c.getContainerRootDir(id)
if err = c.os.MkdirAll(containerRootDir, 0755); err != nil {
return nil, errors.Wrapf(err, "failed to create container root directory %q", containerRootDir)
}
ociRuntime, err := c.getSandboxRuntime(sandboxConfig, sandbox.Metadata.RuntimeHandler)
if err != nil {
return nil, errors.Wrap(err, "failed to get sandbox runtime")
}
spec, err := c.containerSpec(id, sandboxID, sandboxPid, sandbox.NetNSPath, containerName, containerdImage.Name(), config, sandboxConfig,
&image.ImageSpec.Config, append(mounts, volumeMounts...), ociRuntime)
if err != nil {
return nil, errors.Wrapf(err, "failed to generate container %q spec", id)
}
opts = append(opts, containerd.WithSpec(spec, specOpts...),
containerd.WithRuntime(sandboxInfo.Runtime.Name, runtimeOptions), containerd.WithContainerLabels(containerLabels), containerd.WithContainerExtension(containerMetadataExtension, &meta))
var cntr containerd.Container
if cntr, err = c.client.NewContainer(ctx, id, opts...); err != nil {
return nil, errors.Wrap(err, "failed to create containerd container")
}
// Add container into container store.
if err := c.containerStore.Add(container); err != nil {
return nil, errors.Wrapf(err, "failed to add container %q into store", id)
}
}
StartContainer
StartContainer用于启动一个容器,流程如下:
① 读取保存的container元数据
② 读取关联的sandbox信息
③ 为容器创建task
④ 启动task
代码见containerd/pkg/cri/server/container_start.go ,下面是该部分省略过后的代码:
func (c *criService) StartContainer(ctx context.Context, r *runtime.StartContainerRequest) (retRes *runtime.StartContainerResponse, retErr error) {
cntr, err := c.containerStore.Get(r.GetContainerId())
// Get sandbox config from sandbox store. sandbox, err := c.sandboxStore.Get(meta.SandboxID) ctrInfo, err := container.Info(ctx)
if err != nil {
return nil, errors.Wrap(err, "failed to get container info")
}
taskOpts := c.taskOpts(ctrInfo.Runtime.Name)
task, err := container.NewTask(ctx, ioCreation, taskOpts...)
if err != nil {
return nil, errors.Wrap(err, "failed to create containerd task")
}
// wait is a long running background request, no timeout needed. exitCh, err := task.Wait(ctrdutil.NamespacedContext())
// Start containerd task.
if err := task.Start(ctx); err != nil {
return nil, errors.Wrapf(err, "failed to start containerd task %q", id)
}
}
创建task的代码如下,调用了containerd的客户端的TasksClient,向服务器端发送创建task的请求
func (c *container) NewTask(ctx context.Context, ioCreate cio.Creator, opts
...NewTaskOpts) (_ Task, err error) {
......
request := &tasks.CreateTaskRequest{
ContainerID: c.id,
Terminal: cfg.Terminal, Stdin: cfg.Stdin,
Stdout: cfg.Stdout,
Stderr: cfg.Stderr,
}
......
response, err := c.client.TaskService().Create(ctx, request)
......
task启动的代码如下,调用了containerd的客户端的TasksClient,向服务器端发送启动task的请求。
func (t *task) Start(ctx context.Context) error {
r, err := t.client.TaskService().Start(ctx, &tasks.StartRequest{
ContainerID: t.id,
})
if err != nil {
if t.io != nil {
t.io.Cancel()
t.io.Close()
}
return errdefs.FromGRPC(err)
}
t.pid = r.Pid
return nil
}
四、“Task Service”
Task Service创建task流程
下面是tasks-service处理创建task请求的代码,根据容器运行时创建容器。
func (l *local) Create(ctx context.Context, r *api.CreateTaskRequest, _
...grpc.CallOption) (*api.CreateTaskResponse, error) { container, err := l.getContainer(ctx, r.ContainerID)
......
rtime, err := l.getRuntime(container.Runtime.Name)
if err != nil {
return nil, err
}
_, err = rtime.Get(ctx, r.ContainerID)
if err != nil && err != runtime.ErrTaskNotExists {
return nil, errdefs.ToGRPC(err)
}
if err == nil {
return nil, errdefs.ToGRPC(fmt.Errorf("task %s already exists", r.ContainerID))
}
c, err := rtime.Create(ctx, r.ContainerID, opts)
......
return &api.CreateTaskResponse{
ContainerID: r.ContainerID, Pid: c.PID(),
},
runtime创建容器代码如下,启动了shim并向shim发送创建请求。
func (m *TaskManager) Create(ctx context.Context, id string, opts runtime.CreateOpts) (_ runtime.Task, retErr error) {
......
shim, err := m.startShim(ctx, bundle, id, opts)
t, err := shim.Create(ctx, opts)
.....
}
startShim调用shim可执行文件启动了一个service,代码如下:
func (m *TaskManager) startShim(ctx context.Context, bundle *Bundle, id string, opts runtime.CreateOpts) (*shim, error) {
......
b := shimBinary(ctx, bundle, opts.Runtime, m.containerdAddress, m.containerdTTRPCAddress, m.events, m.tasks)
shim, err := b.Start(ctx, topts, func() {
log.G(ctx).WithField("id", id).Info("shim disconnected")
cleanupAfterDeadShim(context.Background(), id, ns, m.tasks, m.events, b) m.tasks.Delete(ctx, id)
})
......
执行shim命令所使用的可执行文件是containerd-shim-<runtime>-<version> ,比如我们平时使用的运行时类型是io.containerd.runc.v2 ,那么所用的可执行文件就是containerd-shim-runc-v2 ,完整的命令格式是
containerd-shim-runc-v2 -namespace xxxx -address xxxx -publish-binary xxxx -id xxxx start
func (b *binary) Start(ctx context.Context, opts *types.Any, onClose func()) (_ *shim, err error) {
args := []string{"id", b.bundle.ID}
args = append(args, "start")
cmd, err :=
client.Command(ctx,
b.runtime,
b.containerdAddress,
b.containerdTTRPCAddress,
b.bundle.Path,
opts,
args...,
)
......
out, err := cmd.CombinedOutput()
if err != nil {
return nil, errors.Wrapf(err, "%s", out)
}
address := strings.TrimSpace(string(out))
conn, err := client.Connect(address, client.AnonDialer)
if err != nil {
return nil, err
}
onCloseWithShimLog := func() {
onClose()
cancelShimLog()
f.Close()
}
client := ttrpc.NewClient(conn, ttrpc.WithOnClose(onCloseWithShimLog))
return &shim{
bundle: b.bundle,
client: client,
},
Task Service启动task流程
下面是tasks-service启动一个task的流程:
func (l *local) Start(ctx context.Context, r *api.StartRequest, _ ...grpc.CallOption) (*api.StartResponse, error) {
t, err := l.getTask(ctx, r.ContainerID)
if err != nil {
return nil, err
}
p := runtime.Process(t)
if r.ExecID != "" {
if p, err = t.Process(ctx, r.ExecID); err != nil {
return nil, errdefs.ToGRPC(err)
}
}
if err := p.Start(ctx); err != nil {
return nil, errdefs.ToGRPC(err)
}
state, err := p.State(ctx)
if err != nil {
return nil, errdefs.ToGRPC(err)
}
return &api.StartResponse{
Pid: state.Pid,
}, nil
}
启动容器的进程通过向shim的server端发送请求完成。
func (s *shim) Start(ctx context.Context) error {
response, err := s.task.Start(ctx,
&task.StartRequest{
ID: s.ID(),
})
if err != nil {
return errdefs.FromGRPC(err)
}
s.taskPid = int(response.Pid)
return nil
五、“Containerd-shim启动流程”
containerd/runtime/v2/shim/shim.go 中
RunManager(ctx context.Context, manager Manager, opts ...BinaryOpts)
是containerd-shim-runc-v2 start 的代码入口:
case "start":
opts := StartOpts{
Address: addressFlag,
TTRPCAddress: ttrpcAddress,
Debug: debugFlag,
}
address, err := manager.Start(ctx, id, opts)
if err != nil {
return err
}
if _, err := os.Stdout.WriteString(address); err != nil {
return err
}
return nil
}
containerd-shim-runc-v2 start进程会再次创建一个containerd-shim-runc-v2 -namespace xxxx -id xxxx - address xxxx 的进程用于启动shim server。
func (manager) Start(ctx context.Context, id string, opts shim.StartOpts) (_ string, retErr error) {
cmd, err := newCommand(ctx, id, opts.Address, opts.TTRPCAddress, opts.Debug)
...
// make sure that reexec shim-v2 binary use the value if need
if err := shim.WriteAddress("address", address); err != nil {
return "", err
}
...
if err := cmd.Start(); err != nil {
f.Close()
return "", err
}
...
// make sure to wait after start
go cmd.Wait()
...
server, err := newServer(ttrpc.WithUnaryServerInterceptor(unaryInterceptor))
if err != nil {
return fmt.Errorf("failed creating server: %w", err)
}
for _, srv := range ttrpcServices {
if err := srv.RegisterTTRPC(server); err != nil {
return fmt.Errorf("failed to register service: %w", err)
}
}
if err := serve(ctx, server, signals, sd.Shutdown); err != nil {
if err != shutdown.ErrShutdown {
return err
}
}
}
shim server是个ttrpc服务,提供如下接口:
type TaskService interface {
State(context.Context, *StateRequest) (*StateResponse, error)
Create(context.Context, *CreateTaskRequest) (*CreateTaskResponse, error)
Start(context.Context, *StartRequest) (*StartResponse, error)
Delete(context.Context, *DeleteRequest) (*DeleteResponse, error)
Pids(context.Context, *PidsRequest) (*PidsResponse, error)
Pause(context.Context, *PauseRequest) (*emptypb.Empty, error)
Resume(context.Context, *ResumeRequest) (*emptypb.Empty, error)
Checkpoint(context.Context, *CheckpointTaskRequest) (*emptypb.Empty, error)
Kill(context.Context, *KillRequest) (*emptypb.Empty, error)
Exec(context.Context, *ExecProcessRequest) (*emptypb.Empty, error)
ResizePty(context.Context, *ResizePtyRequest) (*emptypb.Empty, error)
CloseIO(context.Context, *CloseIORequest) (*emptypb.Empty, error)
Update(context.Context, *UpdateTaskRequest) (*emptypb.Empty, error)
Wait(context.Context, *WaitRequest) (*WaitResponse, error)
Stats(context.Context, *StatsRequest) (*StatsResponse, error)
Connect(context.Context, *ConnectRequest) (*ConnectResponse, error)
Shutdown(context.Context, *ShutdownRequest) (*emptypb.Empty, error)
}
创建task是执行了runc create --bundle xxxx xxxx 命令,参考代码:
func (r *Runc) Create(context context.Context, id, bundle string, opts *CreateOpts) error
{
args := []string{"create", "bundle", bundle}
......
cmd := r.command(context, append(args, id)...)
......
ec, err := Monitor.Start(cmd)
......
}
启动task是执行了runc start xxxx 命令,参考代码:
func (r *Runc) Start(context context.Context, id string) error {
return r.runOrError(r.command(context, "start", id))
}
小结
kubelet创建sandbox流程总结如下:
① containerd的cri模块创建sandbox元数据并保存
② containerd的cri模块创建sandbox容器并保存
③ containerd的cri模块通过grpc调用tasks-service创建task
④ tasks-service模块创建containerd-shim-xxxx-xxxx start 进程
⑤ containerd-shim-xxxx-xxxx start 进程创建containerd-shim- xxxx-xxxx 进程并退出
⑥ containerd-shim-xxxx-xxxx 进程启动shim server,提供ttrpc服务
⑦ tasks-service模块调用shim server的Create接口,创建task,shim server 执行runc create 命令
⑧ containerd的cri模块通过grpc调用tasks-service启动task
⑨ tasks-service模块调用shimserver的Start接口,启动task,shim server 执行runc start命令
本文转载自微信公众号「 DCOS」,作者「DCOS」,可以通过以下二维码关注。
转载本文请联系「DCOS」公众号。