With its efficient, stable, and responsive features, cloud-native has become a key driver of digital innovation in enterprises. At the same time, security risks are also increasing in cloud-native environments, prompting enterprises to seek appropriate architecture design solutions.
In this article, we invited Mr. Bai Liming, technology director of Dosec.cn, to present some best practices for building cloud-native security architectures based on the company's expertise and experience.
1. Development of cloud-native
The concept of cloud-native was first introduced in 2013 by Pivotal, a company recognized for its multi-cloud application platform Cloud Foundry. Two years later, Matt Stine, Pivotal's technical product manager, defined the five principles of cloud-native architecture in his book "Migrating to Cloud-Native Application Architecture":
Compliant with 12-factor apps;
Microservice-oriented architecture;
Self-Service Agile Architecture;
API-based collaboration;
Antifragile.
According to the CNCF Cloud Native Definition v1.0, which was approved on June 11, 2018, cloud-native should have the following characteristics:
· Containers;
· Service meshes;
· Microservices;
· Immutable infrastructure;
· Declarative APIs.
Applications that meet all five of the characteristics above will be cloud-natives ones.
Throughout the evolution of cloud-native, containerization has further simplified the capabilities and features of the operating system. Cloud-native operating systems were developed to meet the immutable infrastructure requirement. It features a streamlined kernel, retains only container-related dependency libraries, and uses a container user end as a package manager.
In cloud-native operating systems,all processes must run in containers. As no application can be installed on the OS host, the OS becomes completely immutable, known as the immutable infrastructure, and is expected to be the future of OS development.
In the past, applications were run on physical machines, but as the infrastructure evolved, they moved to virtual machines and later to containers. In the era of cloud computing, serverless architecture seems to be the newest fad.
A physical machine's life cycle is typically measured in years and terminated after a year or five. For virtual machines, the unit of measurement is the month.
With the advent of containerization,each update requires rebuilding a new container; as a result, container lifecycles are measured in days. While serverless computing progresses,function virtualization will be measured in minutes.
The emergence of containerization accelerated the process of standardizing containers. Containers and DevOps complement each other, and application container platforms should follow a DevOps development model to speed up the release process. Generally,containerization promotes DevOps, and containers rely on DevOps for speeding up iteration.
With containers as the unit of analysis, cloud-native and services represent the network boundary.Cloud-native has no concept of IP addresses as they are all dynamic, and we cannot configure their IP addresses on conventional firewalls. With cloud-native, the container services are updated every day, so the IP address is changed accordingly, and the original network policies are no longer valid.
In the era of physical machines, it is more challenging to deploy physical devices, so running several applications on one physical machine is common. For virtual machines, individual services were usually divided into a single virtual machine to improve service availability. Currently, service interfaces are increasingly dependent upon micro-services, so they must be adapted into microservice architectures.
Here take Weibo (a Chinese microblogging site similar to Twitter) as an example: when there is a hot event, both physical and virtual machines require a more extended build period in hours to allow business recovery. In a containerized scenario, the container begins to operate in seconds, whereas physical and virtual machines start up much more slowly. Therefore, since Weibo adopted a container architecture, hot events are rarely the cause of downtimes. Moreover, this can also be attributed to the self-healing and dynamic scaling capabilities of the K8S platform.
Docker was commonly equated with containers during the early days of container runtimes. Similar to containers,which have four modules, Docker includes four interfaces. Docker, however, is a complete development kit, and K8S will only use the runtime. Therefore, to improve operational efficiency, K8S gradually stopped supporting Docker Shim in version 1.20 and switched to using Docker and Containerd instead.
However, neither Containerd nor Docker provides comprehensive security features. In Cri-o, the needs for relative security can be met, and there is no daemon. Each Cri-o process consists of a parent and child process, which can run as a service. In addition, the next aspect of containers to be considered is the security of the underlying infrastructure, including the technological containerization of security.
2. Risks associated with cloud-native
A cloud-native architecture needs to address five main security concerns:
· Image security
· Image repo security
· Cluster component security
· Container network risks
· Microservices risks
The risks associated with image security are by far the most extensive. Unlike infrastructure security,cloud-native focuses more on performance optimization and infrastructure containerization. At the moment, 51% of DockerHub images have high-risk vulnerabilities, while 80% have low- to medium-risk vulnerabilities. It is common for enterprises to download images from DockerHub.
As for image repositories,enterprises cannot upload all of their R&D and business images to a public repository but must store the source code in their own repository. However,enterprise repositories can also contain vulnerabilities that hackers may exploit,leading to the replacement of images in the repository. It is possible that the actual image pulled from the node is from a hacker with a Trojan horse.
Cluster components such as Docker,K8s, OpenShift, and Cri-o have vulnerabilities and 45 vulnerabilities in other container runtimes such as Containerd and Kata Container. Vulnerabilities associated with cluster components are relatively few, but they do exist.
A hacker who exploits these vulnerabilities will also have access to other containers within the cluster.Physical firewalls can only prevent traffic emanating from outside of the cluster, however, attacks that originate inside the firewalls, such as those caused by K8S overlays and underlays, are not covered by firewalls, hence posing an internal network risk to clusters.
The vulnerability of business images can also lead to a second problem: the vulnerability of the built-in image components. If a developer uses an API or a vulnerable development framework,this type of security problem can arise when the developer packages the components into an image. Previously, the widely impacting Spring Framework 0-day was an infrastructure vulnerability that affected approximately 90% of Chinese Internet enterprises. R&D is typically responsible for introducing this type of microservices risk.
3. Design of a cloud-native security architecture
In the past, infrastructures were primarily protected by firewalls and physical security measures. For the computing environment of containers, container runtime security and image security require professional protection. Moreover, regarding the security of containers, it involves the discovery of microservices and the protection of serverless applications.
A cloud-native scenario requires the R&D security system to be integrated, which differs from a traditional security system. Research and development personnel should be involved in the security design process, and they should always pay attention to the cloud-native data security in R&D and the permissions related to security management.
As part of Dosec.cn's container security solution, there are many built-in and machine behavior learning policies, as well as other disposal policies and events.
Auditing orchestration files is one of the features. It can read all the existing Dockerfiles, Yaml files, and orchestration files directly from the developer's code repository. By inferring syntax from the Dockerfile file, it can detect errors in the command.
In the event that an issue is discovered during the audit, it will be reported to R&D team, and the image building will be disabled. If there is no problem, modifications will be immediately conducted, and the image will be generated once the changes have been made. Next, the image will be reversed into a Dockerfile and compared. A warning will be issued if any tampering with Dockfile is detected.
Moreover, the container business running on the image will also be reversed in order to check whether the image on which the container depends is correct and whether the process running in the image matches the process packaged in the Dockerfile. An alert will be raised if there is an inconsistency found, reporting that the business may be at risk.
The cloud-native approach is immutable, and the underlying OS and image are also included in the immutable infrastructure, so the image is also immutable. An image is built according to the Dockerfile, and the running containers are associated with the image.
Another feature includes the ability to read Yaml files directly from the code repository and to control their permissions. A warning will be raised if there is any deprecated and incorrect syntax, high-risk commands, or other dangerous parameters in the Yaml file. The purpose is to link security, O&M, and R&D teams. It is essential that a cloud-native security strategy is developed in concert with the operational team, developers, and security personnel and should never be solely the responsibility of the security department.
A range of open-source image component scanning tools are available on the market. Currently, Dosec.cn's Jingjie Container Security Platform is available in both open-source and commercial editions, and the main difference is the custom rules and vulnerability library. Open-source vulnerability libraries are based on the open-source CBE vulnerability libraries, which support the Chinese vulnerability database CNNVD. CNNVD requires cooperation, and ordinary open-source vendors may not obtain this database. This is one of the key differences between open-source and commercial editions.
Some custom features are available only in the commercial edition, such as trusted image, base image identification, and host image scanning. There are always security risks associated with image repositories, and we need to scan image repositories for vulnerabilities to build security capabilities within the enterprise.
Furthermore, Dosec.cn has been involved with Harbor for its vulnerabilities, so it has some advantages.
Components of the cluster are also at risk. To find the cluster components at stake, assembling the cluster itself and comparing it with the vulnerability database and the vulnerable version is necessary. Meanwhile, version matching would not work for API interfaces and permission vulnerabilities, but POC tests would be required to determine the risks associated with all cluster components.
By scanning each component's configuration in clusters can scan the permission of configuration. In the early versions of K8S, authentication permissions were not enabled by default, but now it defaults to HTTPS.
Moreover, features such as whether audit logs are turned on, need to be configured based on cluster security,along with compliance check baselines to be scanned.
With cloud-native microservices, the service split will lead to exponential growth in scale, which requires automatic discovery of microservices by security software and identification of the types of services, allowing automatic vulnerability scanning. This method is very labor-saving.
Two methods can be used to detect the in-container security after running. The first is learning and standardizing all the behaviors of containers. Meanwhile, reads/writes on container files, process start-ups and shutdowns, and access calls will be captured and recorded in the behavior model. Accordingly, all the traffic of container running will be considered normal, while the other traffic disposed of will be treated as an exception.
Learning takes time, however, and if the learning process encounters attacks or executes, the results will be biased. A policy can be built into the attack model that will exclude behaviors when they are found to violate the policy. It can be combined with machine learning to protect against zero-day attacks while preventing attacks during the learning process. Blacklisting policies integrated into the system enable it to achieve a perfect closed-loop of machine runtime security testing. This seems to be the best practice for container runtime security at the moment.
Microsegmentation in cloud-native is required to achieve the following features: First, it must enable visualization of access relationships. Inherently, cloud-native segmentation meets the zero trust requirement. K8s does not have an IP concept and relies solely on Labels.These labels are tagged by the R&D and business teams, who will utilize them to implement microsegmentation dynamically. Thus, it is necessary to automatically generate and rehearse the container's policy based on the learning relationship.
When the policy learning is complete and confirmed, it will enter rehearsal mode, where the rehearsal time can be set. The normal traffic flow will not be blocked for a certain period. In the event that traffic flow is found to be affected by the policy, it will be warned.In this case, the company's R&D or business team can make a judgment in person, and if the business traffic is safe, the machine behavior learning model will be edited in order to exclude it.
If no more exceptions are found after a certain period, the trained policy will not affect regular traffic patterns and can effectively defend against attacks. By clicking policy execution, the automatic policy can now be applied to the production environment without affecting it.
Lastly, in cloud-native environments, the security of its own software platform must comply with the three-layer architecture: first, there is the management layer, which must be decoupled from the task center so that all clusters are convergent.
If the image repository contains too much data, the scanning can be integrated directly with the repository image.Instead of relying on network bandwidth to pull the image, it could scan directly while reading the storage path. In this manner,network utilization,as well as disk IO usage, can be significantly reduced, enabling direct reading. Currently, this is the most influential architecture design for container security.
4. Best practices in cloud-native security
There are three main components of DevSecOps design in cloud-native environments. First, there is the construction phase. Dosec.cn provides a golden image repository where all the images are reinforced. R&D personnel can directly pull and build business images from the golden repository.
Having cooperated with CNNVD,Dosec.cn's vulnerability library will be updated directly following synchronization. Additionally, Dosec.cn will maintain its golden image repository in real-time according to the daily vulnerability updates. Moreover,Dosec.cn has its own scanner and security researchers investigating the latest vulnerabilities and zero-day attacks.
The recommendation for enterprises is to maintain two image repositories and set trust judgments for the production image repositories in the cluster. Thus, hackers are prevented from entering the clusters and pulling down business containers directly.
Image scanning is used for business development to scan the configuration of the application layer, and if a vulnerability is discovered, it blocks synchronization. A trust judgment can be set up in the production environment that incorporates all conditions, such as whether the enterprise is using its own environment image repository.
Using the platform, it is also possible to assess the risks associated with vulnerabilities in cluster components and microservices. Among other things, scanning and analyzing vulnerability in images can filter out images so that each image can be identified as its creator, technical impact components, software component analysis, source code scanning, development security scanning, and application vulnerability scanning.
In the event that a container security platform detects an attack, it will provide overall security prevention prior to, during, and after the event. A full evaluation and reinforcement of clusters are conducted beforehand, and all behavior learning will start after the enhancement. When an event occurs, it will check for and implement zero-day defenses, with real-time notifications sent out.
When an attack is detected, the image running should be terminated first. The image will not be uploaded during the R&D, downloaded to storage, or run in production. For images after the running of containers, segmentation policies can be executed automatically or manually for existing images, and rules can be set up for automatic and manual execution.
As the network domains between clusters vary, and the K8S network plug-in operates as the overlay network plug-in by default, the network domain can naturally serve as the security domain between clusters.
Microsegmentation in cloud-native must support IP blocking, both in a way that supports zero-trust and Label blocking as well as IP configuration.
The design of cloud-native security platforms is based on this principle. Meanwhile, we should not only deploy a dedicated cloud-native security firewall but also take full advantage of traditional security firewalls to protect security.
The prevention of zero-day attacks can be modeled based on the following five factors:
· Learning in-container behaviors to build a security model;
· Analyze the product risk event list based on events such as file accesses, abnormal network connections, and system calls outside the model when detected;
· Team members must respond and take responsibility for the prevention of abnormal behavior or for correcting errors as soon as possible;
· Develop models in the test environment and apply them directly to the production environment without the need to re-learn them;
· Zero-vulnerability, supporting 0-day mitigation.
During a particular learning cycle,the process starts and stops, and the files that are read and written by the process are required to be learned. Suppose that, after the learning cycle, a brute force attack is launched on a database, causing a large number of network and validation errors in a short period, and it could be directly considered as not meeting the learning specifications.
The first four factors above learn the behavior of running containers, while the last one predicts the state of running containers before they run. In addition to this, historical containers,as well as all previous containers, keep a record of the learning process in order to prevent zero-day attacks in the future.
Guest Introduction
Mr. Bai Liming is a technical partner with Dosec.cn and was previously responsible for the cloud-native platform for OurGame.com. He has over seven years of experience in DevSecOps R&D and is one of the key developers of the first cloud-native security product in China. Aside from this, he was also a key contributor to the establishment of "Classified Protection of Cybersecurity 2.0" issued by the Ministry of Public Security and the white paper on Cloud Native Architecture Security from the China Academy of Information and Communications Technology (CAICT).