vlambda博客
学习文章列表

读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统

Configuring and Securing the Production System

生产(来自生产环境)是描述主系统的通用名称——为真正的客户工作的系统。这是公司可用的主要环境。它也可以称为 live。该系统需要在互联网上公开可用才能使用,这也使安全性和可靠性成为重中之重。在本章中,我们将了解如何为生产部署 Kubernetes 集群。

我们将了解如何使用第三方产品 Amazon Web Services (AWS) 进行设置,并将说明为什么创建自己的产品是个坏主意。我们将在这个新部署中部署我们的系统,并将检查如何设置负载均衡器以有序的方式将流量从旧的单体转移到新系统。

我们还将了解如何自动扩展 Kubernetes 集群内的 Pod 和节点以使资源适应需求。

本章将涵盖以下主题:

  • Using Kubernetes in the wild
  • Setting up the Docker registry
  • Creating the cluster
  • Using HTTPS and TLS to secure external access
  • Being ready for migration to microservices
  • Autoscaling the cluster
  • Deploying a new Docker image smoothly

我们还将介绍一些良好做法,以确保我们的部署尽可能顺利和可靠地部署。 在本章结束时,您将把系统部署在一个公开可用的 Kubernetes 集群中。

Technical requirements

我们将使用 AWS 作为示例的云供应商。我们需要安装一些实用程序来从命令行进行交互。 本文档(https://aws.amazon .com/cli/)。此实用程序允许从命令行执行 AWS 任务。

为了操作 Kubernetes 集群,我们将使用 eksctl。查看此文档 (https://eksctl.io/introduction/installation/) 了解安装说明。

您还需要安装 aws-iam-authenticator。您可以在此处查看安装说明 (https:/ /docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html)。

本章的代码可以在 GitHub 上的这个链接找到:https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/tree/master/Chapter07

确保您的计算机上安装了 ab (Apache Bench)。它与 Apache 捆绑在一起,默认安装在 macOS 和一些 Linux 发行版中。您可以查看这篇文章:https://www.petefreitag.com/item/689.cfm

Using Kubernetes in the wild

在将集群部署为生产环境时,最好的建议是使用商业服务。所有主要的云提供商(AWS EKS、Google Kubernetes Engine (GKE) 和 Azure Kubernetes Service (AKS )) 允许您创建托管 Kubernetes 集群,这意味着唯一需要的参数是选择物理节点的数量和类型,然后通过 kubectl 访问它。

我们将使用 AWS 作为本书中的示例,但请查看其他提供商的文档,以防它们更适合您的用例。

Kubernetes是一个抽象层,所以这种操作方式非常方便。定价类似于为充当节点服务器的原始实例付费,并且无需安装和管理 Kubernetes 控制平面,因此实例充当 Kubernetes 节点。

值得再说一遍:除非你有很好的理由, 不要部署自己的 Kubernetes 集群;相反,请使用云提供商产品。这将更容易,并将为您节省大量维护成本。 以高性能的方式配置 Kubernetes 节点并实施良好实践以避免安全问题并非易事。

如果您有自己的内部数据中心,创建自己的 Kubernetes 集群可能是不可避免的,但在任何其他情况下,直接使用由已知云提供商管理的集群更有意义。可能您当前的提供商已经提供了托管 Kubernetes 的产品!

Creating an IAM user

AWS 使用不同的用户向他们授予多个角色。它们具有不同的权限,使用户能够执行操作。此系统在 AWS 命名法中称为 身份和访问管理 (IAM)。

创建合适的 IAM 用户可能非常复杂,具体取决于您的设置以及 AWS 在您的组织中的使用方式。检查文档( https://docs.aws.amazon.com/IAM/latest/UserGuide /id_users_create.html) 并找到您组织中负责与 AWS 打交道的人员,并与他们核对以了解所需的步骤。

我们来看看创建IAM用户的步骤:

  1. We need to create an AWS user if it is not created with proper permissions. Be sure that it will be able to access the API by activating the Programmatic access as seen in the following screenshot:
读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统

这将显示它的访问密钥秘密密钥密码。请务必安全地存放它们。

  1. To access through the command line, you need to use the AWS CLI. With the AWS CLI and the access information, configure your command line to use aws:
$ aws configure
AWS Access Key ID [None]: <your Access Key>
AWS Secret Access Key [None]: <your Secret Key>
Default region name [us-west-2]: <EKS region>
Default output format [None]:

您应该能够使用以下命令获取身份以检查配置是否成功:

$ aws sts get-caller-identity
{
"UserId": "<Access Key>",
"Account": "<account ID>",
"Arn": "arn:aws:iam::XXXXXXXXXXXX:user/jaime"
}

您现在可以访问命令行 AWS 操作。

Keep in mind that the IAM user can create more keys if necessary, revoke the existing ones, and so on. This normally is handled by an admin user in charge of AWS security. You can read more in the Amazon documentation ( https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey_API). Key rotation is a good idea to ensure that old keys are deprecated. You can do it through the aws client interface.

我们将使用 Web 控制台进行一些操作,但其他操作需要使用 aws

Setting up the Docker registry

我们需要能够访问存储要部署的图像的 Docker 注册表。确保 Docker 注册表可访问的最简单方法是在同一服务中使用 Docker 注册表。

You can still use the Docker Hub registry, but using a registry in the same cloud provider is typically easier as it's better integrated. It will also help in terms of authentication.

我们需要使用以下步骤配置 Elastic Container Registry (ECR):

  1. Log into the AWS console and search for Kubernetes or ECR:
读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统
  1. Create a new registry called frontend. It will create a full URL, which you will need to copy:
读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统
  1. We need to make our local docker log in the registry. Note that aws ecr get-login will return a docker command that will log you in, so copy it and paste:
$ aws ecr get-login --no-include-email
<command>
$ docker login -u AWS -p <token>
Login Succeeded
  1. Now we can tag the image that we want to push with the full registry name, and push it:
$ docker tag thoughts_frontend 033870383707.dkr.ecr.us-west-2.amazonaws.com/frontend
$ docker push 033870383707.dkr.ecr.us-west-2.amazonaws.com/frontend
The push refers to repository [033870383707.dkr.ecr.us-west-2.amazonaws.com/frontend]
...
latest: digest: sha256:21d5f25d59c235fe09633ba764a0a40c87bb2d8d47c7c095d254e20f7b437026 size: 2404
  1. The image is pushed! You can check it by opening the AWS console in the browser:
读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统
  1. We need to repeat the process to also push the Users Backend and Thoughts Backend.
We use the setting of two containers for the deployment of the Users Backend and Thoughts Backend, which includes one for the service and another for a volatile database. This is done for demonstration purposes, but won't be the configuration for a production system, as the data will need to be persistent.

At the end of the chapter, there's a question about how to deal with this situation. Be sure to check it!

将添加所有不同的注册表。您可以在浏览器 AWS 控制台中查看它们:

读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统

我们的管道将需要进行调整以推送到此存储库。

A good practice in deployment is to make a specific step called promotion, where the images ready to use in production are copied to an specific registry, lowering the chance that a bad image gets deployed by mistake in production.

This process may be done several times to promote the images in different environments. For example, deploy a version in an staging environment. Run some tests, and if they are correct, promote the version, copying it into the production registry and labelling it as good to deploy on the production environment.

This process can be done with different registries in different providers.

我们需要在部署中使用完整 URL 的名称。

Creating the cluster

为了让我们的代码在云端可用并可供公众访问,我们需要设置一个工作生产集群,这需要两个步骤:

  1. Create the EKS cluster in AWS cloud (this enables you to run the kubectl commands that operate in this cloud cluster).
  2. Deploy your services, using a set of .yaml files, as we've seen in previous chapters. The files require minimal changes to adapt them to the cloud.

让我们检查第一步。

Creating the Kubernetes cluster

创建集群的最佳方法是使用 eksctl 实用程序。这为我们自动化了大部分工作,并允许我们在以后必要时对其进行扩展。

Be aware that EKS is available only in some regions, not all. Check the AWS regional table ( https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/) to see the available zones. We will use the Oregon ( us-west-2 ) region.

要创建 Kubernetes 集群,我们需要执行以下步骤:

  1. First, check that eksctl is properly installed:
$ eksctl get clusters
No clusters found
  1. Create a new cluster. It will take around 10 minutes:
$ eksctl create cluster -n Example
[i] using region us-west-2
[i] setting availability zones to [us-west-2d us-west-2b us-west-2c]
...
[✔] EKS cluster "Example" in "us-west-2" region is ready

  1. This creates the cluster. Checking the AWS web interface will show the newly configured elements.
The --arg-access option needs to be added for a cluster capable of autoscaling. This will be described in more detail in the Autoscaling the cluster section.
  1. The eksctl create command also adds a new context with the information about the remote Kubernetes cluster and activates it, so kubectl will now point to this new cluster.
Note that kubectl has the concept of contexts as different clusters it can connect. You can see all the available contexts running kubectl config get-contexts and kubectl config use-context <context-name> to change them. Check the Kubernetes documentation ( https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) on how to create new contexts manually.
  1. This command sets kubectl with the proper context to run commands. By default, it generates a cluster with two nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-X.us-west-2.internal Ready <none> 11m v1.13.7-eks-c57ff8
ip-Y.us-west-2.internal Ready <none> 11m v1.13.7-eks-c57ff8
  1. We can scale the number of nodes. To reduce the usage of resources and save money. We need to retrieve the name of the nodegroup, which controls the number of nodes, and then downscale it:
$ eksctl get nodegroups --cluster Example
CLUSTER NODEGROUP CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID
Example ng-fa5e0fc5 2019-07-16T13:39:07Z 2 2 0 m5.large ami-03a55127c613349a7
$ eksctl scale nodegroup --cluster Example --name ng-fa5e0fc5 -N 1
[i] scaling nodegroup stack "eksctl-Example-nodegroup-ng-fa5e0fc5" in cluster eksctl-Example-cluster
[i] scaling nodegroup, desired capacity from to 1, min size from 2 to 1
  1. You can contact the cluster through kubectl and carry the operations normally:
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 7m31s

集群已设置完毕,我们可以从命令行对其运行命令。

Creating an EKS cluster can be tweaked in a lot of ways, but AWS can be temperamental in terms of access, users, and permissions. For example, the cluster likes to have a CloudFormation rule to handle the cluster, and all the elements should be created with the same IAM user. Check with anyone that works with the infrastructure definition in your organization to check what's the proper configuration. Don't be afraid of running tests, a cluster can be quickly removed through the eksctl configuration or the AWS console.

此外,eksctl 尽可能创建具有不同可用区(同一地理区域内的 AWS 隔离位置)中的节点的集群,从而最大限度地降低因 AWS 数据中心问题而导致整个集群停机的风险.

Configuring the cloud Kubernetes cluster

Configuring the AWS image registry

第一个区别是我们需要将镜像更改为完整的注册表,因此集群使用 ECS 注册表中可用的镜像。

请记住,您需要在 AWS 中指定注册表,以便 AWS 集群可以正确访问它。

例如,在 frontend/deployment.yaml 文件中,我们需要这样定义它们:

containers:
- name: frontend-service
image: XXX.dkr.ecr.us-west-2.amazonaws.com/frontend:latest
imagePullPolicy: Always

该图像应从 AWS 注册表中提取。应更改拉取策略以强制从集群中拉取。

在创建 example 命名空间后,您可以通过应用文件在远程服务器中进行部署:

$ kubectl create namespace example
namespace/example created
$ kubectl apply -f frontend/deployment.yaml
deployment.apps/frontend created

稍后,部署会创建 pod:

$ kubectl get pods -n example
NAME READY STATUS RESTARTS AGE
frontend-58898587d9-4hj8q 1/1 Running 0 13s

现在我们需要改变其余的元素。所有部署都需要进行调整以包含正确的注册表。

检查 GitHub 上的代码以检查所有 deployment.yaml 文件。

Configuring the usage of an externally accessible load balancer

第二个区别是使前端服务在外部可用,因此互联网流量可以访问集群。

这很容易通过将服务从 NodePort 更改为 LoadBalancer 来完成。检查 frontend/service.yaml 文件:

apiVersion: v1
kind: Service
metadata:
namespace: example
labels:
app: frontend-service
name: frontend-service
spec:
ports:
- name: frontend
port: 80
targetPort: 8000
selector:
app: frontend
type: LoadBalancer

这将创建一个可以从外部访问的新 Elastic Load Balancer (ELB)。现在,让我们开始部署。

Deploying the system

整个系统可以从Chapter07子目录部署,代码如下:

$ kubectl apply --recursive -f .
deployment.apps/frontend unchanged
ingress.extensions/frontend created
service/frontend-service created
deployment.apps/thoughts-backend created
ingress.extensions/thoughts-backend-ingress created
service/thoughts-service created
deployment.apps/users-backend created
ingress.extensions/users-backend-ingress created
service/users-service created

此命令迭代地遍历子目录并应用任何 .yaml 文件。

几分钟后,您应该会看到一切正常运行:

$ kubectl get pods -n example
NAME READY STATUS RESTARTS AGE
frontend-58898587d9-dqc97 1/1 Running 0 3m
thoughts-backend-79f5594448-6vpf4 2/2 Running 0 3m
users-backend-794ff46b8-s424k 2/2 Running 0 3m

要获取公共接入点,您需要检查服务:

$ kubectl get svc -n example
NAME TYPE CLUSTER-IP EXTERNAL-IP AGE
frontend-service LoadBalancer 10.100.152.177 a28320efca9e011e9969b0ae3722320e-357987887.us-west-2.elb.amazonaws.com 3m
thoughts-service NodePort 10.100.52.188 <none> 3m
users-service NodePort 10.100.174.60 <none> 3m

请注意,前端服务有一个可用的外部 ELB DNS。

如果将该 DNS 放在浏览器中,则可以按如下方式访问该服务:

读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统

恭喜,您拥有自己的云 Kubernetes 服务。服务可访问的 DNS 名称不是很好,因此我们将了解如何添加已注册的 DNS 名称并在 HTTPS 端点下公开它。

Using HTTPS and TLS to secure external access

为了向您的客户提供良好的服务,您的外部端点应通过 HTTPS 提供服务。这意味着您和您的客户之间的通信是私密的,无法在整个网络路由中被嗅探。

HTTPS 的工作方式是服务器和客户端对通信进行加密。为了确保服务器是他们所说的那样,需要有一个由授权 DNS 验证的机构颁发的 SSL 证书。

请记住,HTTPS 的重点是 <跨度> 不是 服务器本质上是值得信赖的,但客户端和服务器之间的通信是私有的。服务器仍然可能是恶意的。这就是验证特定 DNS 不包含拼写错误很重要的原因。

您可以在这部精彩的漫画中获得有关 HTTPS 工作原理的更多信息: https://howhttps.works/

为您的外部端点获取证书需要两个阶段:

  • You own a particular DNS name, normally by buying it from a name registrar.
  • You obtain a unique certificate for the DNS name by a recognized Certificate Authority (CA). The CA has to validate that you control the DNS name.
为了帮助促进 HTTPS 的使用,非营利组织 让我们加密 ( https://letsencrypt.org) 提供有效期为 60 天的免费证书。这将比通过您的云提供商获得更多的工作,但如果资金紧张,这可能是一种选择。

如今,云提供商可以轻松完成此过程,因为他们可以同时充当两者,从而简化流程。

The important element that needs to communicate through HTTPS is the edge of our network. The internal network where our own microservices are communicating doesn't require to be HTTPS, and HTTP will suffice. It needs to be a private network out of public interference, though.

按照我们的示例,AWS 允许我们创建证书并将其与 ELB 关联,以 HTTP 服务流量。

Having AWS to serve HTTPS traffic ensures that we are using the latest and safest security protocols, such as Transport Layer Security ( TLS) v1.3 (the latest at the time of writing), but also that it keeps backward compatibility with older protocols, such as SSL.

In other words, it is the best option to use the most secure environment by default.

设置 HTTPS 的第一步是直接从 AWS 购买 DNS 域名,或者将控制权转移到 AWS。这可以通过他们的服务 Route 53 来完成。您可以在 文档 "_blank">https://aws.amazon.com/route53/

It is not strictly required to transfer your DNS to Amazon, as long as you can point it toward the externally facing ELB, but it helps with the integration and obtaining of certificates. You'll need to prove that you own the DNS record when creating a certificate, and using AWS makes it simple as they create a certificate to a DNS record they control. Check the documentation at https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-validate-dns.html.

要在您的 ELB 上启用 HTTPS 支持,让我们检查以下步骤:

  1. Go to Listeners in the AWS console:
读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统
  1. Click on Edit and add a new rule for HTTPS support:
读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统
  1. As you can see, it will require an SSL certificate. Click on Change to go to management:
读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统
  1. From here, you can either add an existing certificate or buy one from Amazon.
Be sure to check the documentation about the load balancer in Amazon. There are several kinds of ELBs that can be used, and some have different features than others depending on your use case. For example, some of the new ELBs are able to redirect toward HTTPS even if your customer requests the data in HTTP. See the documentation at https://aws.amazon.com/elasticloadbalancing/.

恭喜,现在您的外部端点支持 HTTPS,确保您与客户的通信是私密的。

Being ready for migration to microservices

为了在迁移过程中顺利运行,您需要部署一个负载均衡器,允许您在后端之间快速交换并保持服务正常运行。

正如我们在 第 1 章中所讨论的,采取行动 - 设计、计划, 和 Execute,HAProxy 是一个很好的选择,因为它用途广泛,并且具有良好的 UI,让您只需单击网页即可快速进行操作。它还有一个出色的统计页面,可让您监控服务的状态。

AWS 有一个 HAProxy 替代方案,称为 Application Load Balancer (ALB)。这是 ELB 上的一项功能丰富的更新,它允许您将不同的 HTTP 路径路由到不同的后端服务。

HAProxy 具有更丰富的功能集和更好的仪表板来与之交互。它也可以通过配置文件进行更改,这有助于控制更改,正如我们将在 中看到的那样第 8 章使用 GitOps 原则

显然,只有当所有服务都在 AWS 中可用时,它才可用,但在这种情况下,它可能是一个很好的解决方案,因为它会更简单,并且更符合技术堆栈的其余部分。看看文档在 https://aws.amazon.com/blogs/aws/新的 aws-application-load-balancer/

要在服务前部署负载均衡器,我建议不要将其部署在 Kubernetes 上,而是以与传统服务相同的方式运行它。这种负载均衡器将是系统的关键部分,消除不确定性对于成功运行很重要。这也是一项相对简单的服务。

Keep in mind that a load balancer needs to be properly replicated, or it becomes a single point of failure. Amazon and other cloud providers allow you to set up an ELB or other kinds of load balancer toward your own deployment of load balancers, enabling the traffic to be balanced among them.

例如,我们创建了一个示例配置和 docker-compose 文件来快速运行它,但是可以以您的团队最熟悉的任何方式设置配置。

Running the example

该代码可在 GitHub 上获得( https://github.com/PacktPublishing/Hands-On-Docker-for-Microservices-with-Python/tree/master/Chapter07/haproxy)。我们从 Docker Hub (https://hub.docker.com/_/haproxy /),添加我们自己的配置文件。

让我们看一下配置文件 haproxy.cfg 中的主要元素:

frontend haproxynode
bind *:80
mode http
default_backend backendnodes

backend backendnodes
balance roundrobin
option forwardfor
server aws a28320efca9e011e9969b0ae3722320e-357987887
.us-west-2.elb.amazonaws.com:80 check
server example www.example.com:80 check

listen stats
bind *:8001
stats enable
stats uri /
stats admin if TRUE

我们定义了一个前端,它接受任何到端口 80 的请求并将请求发送到后端。后端平衡对两个服务器的请求,exampleaws。基本上,example 指向 www.example.com(旧服务的占位符),aws 指向之前创建的负载均衡器。

我们在端口 8001 中启用统计服务器并允许管理员访问。

docker-compose 配置启动服务器并将 localhost 端口转发到容器端口 8000(负载平衡器)和 8001(统计信息)。使用以下命令启动它:

$ docker-compose up --build proxy
...

现在我们可以访问 localhost:8000,这将在 thoughts 服务和 404 错误之间交替。

When calling example.com this way, we are forwarding the host request. This means we send a request requesting Host:localhost to example.com, which returns a 404 error. Be sure to check on your service that the same host information is accepted by all the backends.

打开统计页面以检查设置:

读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统

检查后端节点中 awsexample 的条目。还有很多有趣的信息,比如请求数、最后一次连接、数据等等。

您可以在检查 example 后端时执行操作,然后在下拉菜单中将状态设置为 MAINT。应用后,example 后端处于维护模式并从负载平衡器中删除。统计页面如下:

读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统

现在访问 localhost:8000 中的负载均衡器只会返回 thoughts 前端。您可以重新启用后端,将其设置为 READY 状态。

有一种状态叫 DRAIN 将阻止新会话进入所选服务器,但现有会话将继续进行。这在某些配置中可能很有趣,但如果后端真的是无状态的,则直接移动到 MAINT 状态应该足够了。

HAProxy 也可以配置为使用检查来确保后端可用。我们在示例中添加了一个注释的,它发送一个 HTTP 命令来检查返回:

option httpchk HEAD / HTTP/1.1\r\nHost:\ example.com

校验对两个后端都是一样的,所以需要成功返回。默认情况下,它将每隔几秒钟运行一次。

您可以在 http://www.haproxy.org/ 查看完整的 HAProxy 文档。有很多细节可以配置。与您的团队跟进以确保超时、转发标头等区域的配置正确无误。

Kubernetes 中也使用了健康检查的概念,以确保 pod 和容器准备好接受请求并保持稳定。 我们将在下一节中了解如何确保正确部署新映像。

Deploying a new Docker image smoothly

在生产环境中部署服务时,确保服务顺利运行以避免中断服务至关重要。

Kubernetes 和 HAProxy 能够检测服务何时正常运行,并在未正常运行时采取措施,但我们需要提供一个充当健康检查的端点并将其配置为定期 ping,以便及早发现问题。

For simplicity, we will use the root URL as a health check, but we can design specific endpoints to be tested. A good health checkup checks that the service is working as expected, but is light and quick. Avoid the temptation of over testing or performing an external verification that could make the endpoint take a long time.

An API endpoint that returns an empty response is a great example, as it checks that the whole piping system works, but it's very fast to answer.

在 Kubernetes 中,有两个测试可以确保 Pod 正常工作,即就绪探针和活跃度探针。

The liveness probe

活性探针检查容器是否正常工作。它是在容器中启动并正确返回的过程。如果它返回错误(或更多,取决于配置),Kubernetes 将杀死容器并重新启动它。

liveness probe 会在容器内部执行,所以它需要是有效的。对于 Web 服务,添加 curl 命令是个好主意:

spec:
containers:
- name: frontend-service
livenessProbe:
exec:
command:
- curl
- http://localhost:8000/
initialDelaySeconds: 5
periodSeconds: 30

虽然有诸如检查 TCP 端口是否打开或发送 HTTP 请求等选项,但运行命令是最通用的选项。也可以出于调试目的对其进行检查。有关更多选项,请参阅文档。

Be careful of being very aggressive on liveness probes. Each check puts some load on the container, so depending on load multiple probes can end up killing more containers than they should.

If your services are restarted often by the liveness probe, either the probe is too aggressive or the load is high for the number of containers, or a combination of both.

探针配置为等待 5 秒,然后每 30 秒运行一次。默认情况下,经过 3 次检查失败后,它会重新启动容器。

The readiness probe

就绪探测检查容器是否准备好接受更多请求。这是一个不那么激进的版本。如果测试返回错误或超时,容器将不会重新启动,但只会标记为不可用。

就绪探针通常用于避免过早接受请求,但它会在启动后运行。智能就绪探针可以标记容器何时满负荷并且无法接受更多请求,但通常以与活性证明类似的方式配置的探针就足够了。

就绪探针在部署配置中定义,与活动探针相同。让我们来看看:

spec:
containers:
- name: frontend-service
readinessProbe:
exec:
command:
- curl
- http://localhost:8000/
initialDelaySeconds: 5
periodSeconds: 10

准备就绪探测应该比活性探测更具侵略性,因为结果更安全。这就是 periodSeconds 更短的原因。您可能需要或不需要两者,具体取决于您的特定用例,但需要准备探测来启用滚动更新,我们将在接下来看到。

示例代码中的 frontend/deployment.yaml 部署包括两个探针。查看 Kubernetes 文档(https://kubernetes .io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)了解更多详细信息和选项。

Be aware that the two probes are used for different objectives. The readiness probe delays the input of requests until the pod is ready, while the liveness probe helps with stuck containers.

A delay in the liveness probe getting back will restart the pod, so an increase in load could produce a cascade effect of restarting pods. Adjust accordingly, and remember that both probes don't need to repeat the same command.

就绪和活跃度探测都帮助 Kubernetes 控制如何创建 pod,这会影响部署的更新。

Rolling updates

默认情况下,每次我们更新部署镜像时,Kubernetes 部署都会重新创建容器。

Notifying Kubernetes that a new version is available is not enough to push a new image to the registry, even if the tag is the same. You'll need to change the tag described in the image field in the deployment .yaml file.

我们需要控制图像的更改方式。为了不中断服务,我们需要执行滚动更新。这种更新会添加新容器,等待它们准备好,将它们添加到池中,然后删除旧容器。这种部署比删除所有容器并重新启动它们要慢一些,但它允许服务不中断。

可以通过调整部署中的 strategy 部分来配置此过程的执行方式:

spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 1

让我们理解这段代码:

  • strategy and type can be either RollingUpdate (the default) or Recreate, which stops existing pods and creates new ones.
  • maxUnavailable defines the maximum number of unavailable pods during a change. This defines how quick o new containers will be added and old ones removed. It can be described as a percentage, like our example, or a fixed number.
  • maxSurge defines the number of extra pods that can be created over the limit of desired pods. This can be a specific number or a percentage of the total.
  • As we set replicas to 4, in both cases the result is one pod. This means that during a change, up to one pod may be unavailable and that we will create the new pods one by one.

较高的数字将更快地执行更新,但会消耗更多资源 (maxSurge) 或在更新期间更积极地减少可用资源 (maxUnavailable)。

For a small number of replicas, be conservative and grow the numbers when you are more comfortable with the process and have more resources.

最初,手动扩展 pod 将是最简单和最好的选择。如果流量变化很大,有高峰和低谷,则可能值得自动扩展集群。

Autoscaling the cluster

我们之前已经看到如何更改服务的 pod 数量,以及如何添加和删除节点。这可以自动描述一些规则,允许集群弹性改变其资源。

Keep in mind that autoscaling requires tweaking to adjust to your specific use case. This is a technique to use if the resource utilization changes greatly over time; for example, if there's a daily pattern where some hours present way more activity than others, or if there's a viral element that means the service multiplies the requests by 10 unexpectedly.

If your usage of servers is small and the utilization stays relatively constant, there's probably no need to add autoscaling.

集群可以在两个不同的方面自动向上或向下扩展:

  • The number of pods can be set to increase or decrease automatically in a Kubernetes configuration.
  • The number of nodes can be set to increase or decrease automatically in AWS.

Pod 的数量和节点的数量都需要彼此保持一致,以允许自然增长。

如果 pod 的数量增加而没有增加更多的硬件(节点),Kubernetes 集群将不会有更多的容量,只会在不同的分布中分配相同的资源。

如果在没有创建更多 pod 的情况下增加节点数量,那么在某些时候,额外的节点将没有 pod 可以分配,从而导致资源利用率不足。另一方面,添加的任何新节点都会产生相关成本,因此我们希望正确使用它。

To be able to automatically scale a pod, be sure that it is scalable. To ensure the pod is scalable check that it is an stateless web service and obtain all the information from an external source.

Note that, in our code example, the frontend pod is scalable, while the Thoughts and Users Backend is not, as they include their own database container the application connects to.

Creating a new pod creates a new empty database, which is not the expected behavior. This has been done on purpose to simplify the example code. The intended production deployment is, as described before, to connect to an external database instead.

Kubernetes 配置和 EKS 都具有允许根据规则更改 pod 和节点数量的功能。

Creating a Kubernetes Horizontal Pod Autoscaler

在 Kubernetes 命名法中,向上和向下扩展 pod 的服务称为 Horizo​​ntal Pod Autoscaler (HPA) .

这是因为它需要一种方法来检查测量以进行缩放。要启用这些指标,我们需要部署 Kubernetes 指标服务器。

Deploying the Kubernetes metrics server

Kubernetes 指标服务器捕获内部低级指标,例如 CPU 使用率、内存等。 HPA 将捕获这些指标并使用它们来扩展资源。

The Kubernetes metrics server is not the only available server for feeding metrics to the HPA, and other metrics systems can be defined. The list of the currently available adaptors is available in the Kubernetes metrics project ( https://github.com/kubernetes/metrics/blob/master/IMPLEMENTATIONS.md#custom-metrics-api).

This allows for custom metrics to be defined as a target. Start first with default ones, though, and only move to custom ones if there are real limitations for your specific deployment.

要部署 Kubernetes 指标服务器,请从官方项目页面下载最新版本(https://github.com/kubernetes-incubator/metrics-server/releases)。在撰写本文时,它是 0.3.3.

下载 tar.gz 文件,在撰写本文时该文件为 metrics-server-0.3.3.tar.gz。解压缩并将版本应用到集群:

$ tar -xzf metrics-server-0.3.3.tar.gz
$ cd metrics-server-0.3.3/deploy/1.8+/
$ kubectl apply -f .
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
serviceaccount/metrics-server created
deployment.extensions/metrics-server created
service/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created

您将在 kube-system 命名空间中看到新的 pod:

$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
...
metrics-server-56ff868bbf-cchzp 1/1 Running 0 42s

您可以使用 kubectl top 命令 获取有关节点和 pod 的基本信息:

$ kubectl top node
NAME CPU(cores) CPU% MEM(bytes) MEMORY%
ip-X.us-west-2.internal 57m 2% 547Mi 7%
ip-Y.us-west-2.internal 44m 2% 534Mi 7%
$ kubectl top pods -n example
$ kubectl top pods -n example
NAME CPU(cores) MEMORY(bytes)
frontend-5474c7c4ff-d4v77 2m 51Mi
frontend-5474c7c4ff-dlq6t 1m 50Mi
frontend-5474c7c4ff-km2sj 1m 51Mi
frontend-5474c7c4ff-rlvcc 2m 51Mi
thoughts-backend-79f5594448-cvdvm 1m 54Mi
users-backend-794ff46b8-m2c6w 1m 54Mi

为了正确控制使用限制,我们需要在部署中配置分配的内容并限制资源。

Configuring the resources in deployments

在容器的配置中,我们可以指定请求的资源是什么以及它们的最大资源。

它们都向 Kubernetes 告知容器的预期内存和 CPU 使用率。创建新容器时,Kubernetes 会自动将其部署在有足够资源覆盖它的节点上。

frontend/deployment.yaml 文件中,我们包含以下 resources 实例:

spec:
containers:
- name: frontend-service
image: 033870383707.dkr.ecr.us-west-2
.amazonaws.com/frontend:latest
imagePullPolicy: Always
...
resources:
requests:
memory: "64M"
cpu: "60m"
limits:
memory: "128M"
cpu: "70m"

最初请求的内存是 64 MB,以及 0.06 个 CPU 内核。

内存资源也可以使用 Mi 的 2 次方,相当于一个兆字节 ( 10002 个字节),称为一个兆字节( 220 个字节)。在任何情况下,差异都是很小的。您也可以使用 G 或 T 来获得更大的数量。

CPU 资源是按分数测量的,其中 1 是节点运行的任何系统中的核心(例如,AWS vCPU)。请注意,1000m,意味着 1000 毫 CPU 相当于一个完整的内核。

限制为 128 MB 和 0.07 个 CPU 内核。容器将无法使用超过限制的内存或 CPU。

Aim at round simple numbers to understand the limits and requested resources. Don't expect to have them perfect the first time; the applications will change their consumption.

Measuring the metrics in an aggregated way, as we will talk about in Chapter 11, Handling Change, Dependencies, and Secrets in the System, will help you to see the evolution of the system and tweak it accordingly.

该限制为自动缩放器创建了基准,因为它将以资源的百分比来衡量。

Creating an HPA

要创建一个新的 HPA,我们可以使用 kubectl autoscale 命令:

$ kubectl autoscale deployment frontend --cpu-percent=10 --min=2 --max=8 -n example
horizontalpodautoscaler.autoscaling/frontend autoscaled

这将创建一个新的 HPA,该 HPA 以 example 命名空间中的 frontend 部署为目标,并将 pod 的数量设置在 28 之间。要扩展的参数是 CPU,我们将其设置为可用 CPU 的 10%,是所有 pod 的平均值。如果它更高,它将创建新的 pod,如果它更低,它将减少它们。

The 10% limit is used to be able to trigger the autoscaler and to demonstrate it.

自动缩放器作为一种特殊的 Kubernetes 对象工作,可以这样查询:

$ kubectl get hpa -n example
NAME REFERENCE TARGETS MIN MAX REPLICAS AGE
frontend Deployment/frontend 2%/10% 2 8 4 80s

注意目标是如何说它目前在 2% 左右,接近极限。这是使用具有相对较高基线的小型可用 CPU 设计的。

几分钟后,副本数将下降,直到达到最小值 2

The downscaling may take a few minutes. This generally is the expected behavior, upscaling being more aggressive than downscaling.

为了创建一些负载,让我们将应用程序 Apache Bench (ab) 与前端中专门创建的端点结合使用,该端点使用大量 CPU:

$ ab -n 100 http://<LOADBALANCER>.elb.amazonaws.com/load
Benchmarking <LOADBALANCER>.elb.amazonaws.com (be patient)....

请注意,ab 是一个方便的测试应用程序,可以同时产生 HTTP 请求。如果您愿意,您可以快速连续多次从浏览器中点击 URL。

Remember to add the load balancer DNS, as retrieved in the Creating the cluster section.

这将在集群中产生额外的 CPU 负载,并使部署规模扩大:

NAME     REFERENCE           TARGETS MIN MAX REPLICAS AGE
frontend Deployment/frontend 47%/10% 2 8 8 15m

请求完成后,几分钟后,pod 的数量 会慢慢减少,直到再次碰到两个 pod。

但是我们也需要一种扩展节点的方法,否则我们将无法增加系统中的资源总数。

Scaling the number of nodes in the cluster

也可以增加在 EKS 集群中作为节点工作的 AWS 实例的数量。这为集群增加了额外的资源,并且可以启动更多的 Pod。

允许这样做的底层 AWS 服务是 Auto Scaling 组。这是一组 EC2 实例,它们共享相同的映像并具有定义的大小,包括最小和最大实例。

在任何 EKS 集群的核心,都有一个 Auto Scaling 组来控制集群的节点。请注意,eksctl 创建 Auto Scaling 组并将其公开为节点组:

$ eksctl get nodegroup --cluster Example
CLUSTER NODEGROUP MIN MAX DESIRED INSTANCE IMAGE ID
Example ng-74a0ead4 2 2 2 m5.large ami-X

使用 eksctl,我们可以像创建集群时描述的那样手动扩展或缩减集群:

$ eksctl scale nodegroup --cluster Example --name ng-74a0ead4 --nodes 4
[i] scaling nodegroup stack "eksctl-Example-nodegroup-ng-74a0ead4" in cluster eksctl-Example-cluster
[i] scaling nodegroup, desired capacity from to 4, max size from 2 to 4

此节点组在 AWS 控制台中也可见,位于 EC2 | Auto Scaling 组

读书笔记《hands-on-docker-for-microservices-with-python》配置和保护生产系统

在 Web 界面中,我们有一些有趣的信息可用于收集有关 Auto Scaling 组的信息。 Activity History 选项卡允许您查看任何放大或缩小事件,Monitoring 选项卡允许您检查指标。

大多数处理都是使用 eksctl 自动创建的,例如 Instance Type 和 AMI-ID(实例上的初始软件,包含操作系统)。它们应该主要由 eksctl 控制。

If you need to change the Instance Type, eksctl requires you to create a new nodegroup, move all the pods, and then delete the old. You can learn more about the process in the eksctl documentation ( https://eksctl.io/usage/managing-nodegroups/).

但从 Web 界面可以轻松编辑缩放参数并添加自动缩放策略。

通过 Web 界面更改参数可能会混淆检索到的数据 eksctl,因为它是独立设置的。

可以为 AWS 安装 Kubernetes 自动扩缩器,但它需要 secrets 配置文件以在 autoscaler pod 中包含适当的 AMI,并具有添加实例的 AWS 权限。

在代码中用 AWS 术语描述自动缩放策略也可能会让人很困惑。 Web界面使它更容易一些。优点是您可以在配置文件中描述可以受源代码控制的所有内容。

我们将在此处使用 Web 界面配置,但您可以按照以下说明进行操作 https://eksctl.io/usage/autoscaling/

对于扩展策略,可以创建两个主要组件:

  • Scheduled actions: They are scale up and down events that happen at defined times. The action can change the number of nodes through a combination of the desired number and the minimum and maximum number, for example, increasing the cluster during the weekend. The actions can be repeated periodically, such as each day or each hour. The action can also have an ending time, which will revert the values to the ones previously defined. This can be used to give a boost for a few hours if we expect extra load in the system, or to reduce costs during night hours.
  • Scaling policies: These are policies that look for demand at a particular time and scale up or down the instances, between the described numbers. There are three types of policies: target tracking, step scaling, and simple scaling. Target tracking is the simplest, as it monitors the target (typically CPU usage) and scales up and down to keep close to the number. The other two policies require you to generate alerts using the AWS CloudWatch metrics system, which is more powerful but also requires using CloudWatch and a more complex configuration.

节点的数量不仅可以增加而且可以减少,这意味着删除节点。

Deleting nodes

删除节点时,运行的 pod 需要移动到另一个节点。这是由 Kubernetes 自动处理的,EKS 将以安全的方式进行操作。

This can also happen if a node is down for any reason, such as an unexpected hardware problem. As we've seen before, the cluster is created in multiple availability zones to minimize risks, but some nodes may have problems if there's a problem in an Amazon availability zone.

Kubernetes was designed for this kind of problem, so it's good at moving pods from one node to another in unforeseen circumstances.

将 pod 从一个节点移动到另一个节点是通过销毁 pod 并在新节点中重新启动它来完成的。由于 pod 由部署控制,它们将保留适当数量的 pod,如副本或自动缩放值所述。

请记住,Pod 本质上是易变的,因此应将其设计为可以销毁和重新创建。

升级还可以导致现有的 pod 移动到其他节点以更好地利用资源,尽管这种情况不太常见。节点数量的增加通常与 Pod 数量的增加同时进行。

控制节点的数量需要考虑要遵循的策略以达到最佳结果,具体取决于需求。

Designing a winning autoscaling strategy

正如我们所见,两种自动缩放,pod 和节点,都需要相互关联。减少节点数量可以降低成本,但会限制可用于增加 pod 数量的可用资源。

永远记住,自动缩放是一个大数字游戏。除非您有足够的负载变化来证明它的合理性,否则对其进行调整将产生无法与开发和维护流程的成本相比的成本节约。对预期收益和维护成本进行成本分析。

在处理更改集群大小时优先考虑简单性。在晚上和周末缩小规模可以节省很多钱,而且比生成复杂的 CPU 算法来检测高点和低点要容易得多。

Keep in mind that autoscaling is not the only way of reducing costs with cloud providers, and can be used combined with other strategies.

For example, in AWS, reserving EC2 instances for a year or more allows you to greatly reduce the bill. They can be used for the cluster baseline and combined with more expensive on-demand instances for autoscaling, which yields an extra reduction in costs: https://aws.amazon.com/ec2/pricing/reserved-instances/.

作为一般规则,您的目标应该是拥有额外的硬件来允许扩展 pod,因为这样会更快。在不同 pod 以不同速率缩放的情况下,这是允许的。根据应用程序的不同,当一项服务的使用率上升时,另一项服务的使用率可能会下降,这将使利用率保持在相似的数字上。

This is not the use case that comes to mind, but for example, scheduled tasks during the night may make use of available resources that at daytime are being used by external requests.

They can work in different services, balancing automatically as the load changes from one service to the other.

减少净空后,开始扩展节点。始终留有安全边际,以避免陷入节点扩展速度不够快且由于资源不足而无法启动更多 Pod 的情况。

pod autoscaler 可以尝试创建新的 pod,如果没有可用资源,它们将不会启动。同理,如果一个节点被删除,任何没有被删除的 Pod 都可能因为资源不足而无法启动。

请记住,我们向 Kubernetes 描述了在 resources 部分的部署。确保那里的数字表示吊舱所需的数字。

为了确保 pod 充分分布在不同的节点上,您可以使用 Kubernetes 亲和性和反亲和性规则。这些规则允许定义某种类型的 pod 是否应该在同一个节点上运行。

这很有用,例如,确保所有类型的 pod 均匀分布在 zone 中,或者确保两个服务始终部署在同一个节点上以减少延迟。

您可以在此博客文章中了解有关亲和力以及如何配置的更多信息:https://supergiant.io/blog/learn-how-to-assign-pods-to-nodes-in-kubernetes-using-nodeselector-and-affinity -features/以及在 Kubernetes 官方配置中( https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)。

一般来说,Kubernetes 和 eksctl 默认值适用于大多数应用程序。此建议仅用于高级配置。

Summary

在本章中,我们了解了如何将 Kubernetes 集群应用到生产环境中,并在云提供商(在本例中为 AWS)中创建 Kubernetes 集群。我们已经了解了如何设置 Docker 注册表、使用 EKS 创建集群以及调整现有 YAML 文件以便它们为环境做好准备。

请记住,尽管我们使用 AWS 作为示例,但我们讨论的所有元素在其他云提供商中都可用。检查他们的文档,看看它们是否更适合您。

我们还了解了如何部署 ELB 以使集群可用于公共接口,以及如何在其上启用 HTTPS 支持。

我们讨论了部署的不同元素,以使集群更具弹性并顺利部署新版本,而不中断服务——通过使用 HAProxy 能够快速启用或禁用服务以及确保更改容器映像在有秩序的时尚。

我们还介绍了自动缩放如何帮助合理使用资源并允许您通过创建更多 pod 和添加更多 AWS 实例以在需要时向集群添加资源并删除它们以避免不必要的成本来覆盖系统中的负载峰值.

在下一章中,我们将了解如何使用 GitOps 原则控制 Kubernetes 集群的状态,以确保对其进行的任何更改都得到适当的审查和捕获。

Questions

  1. What are the main disadvantages of managing your own Kubernetes cluster?
  2. Can you name some commercial cloud providers that have a managed Kubernetes solution?
  3. Is there any action you need to do to be able to push to an AWS Docker registry?
  4. What tool do we use to set up an EKS cluster?
  5. What are the main changes we did in this chapter to adapt the YAML files from previous chapters?
  6. Are there any Kubernetes elements that are not required in the cluster from this chapter?
  7. Why do we need to control the DNS associated with an SSL certificate?
  8. What is the difference between the liveness and readiness probes?
  9. Why are rolling updates important in production environments?
  10. What is the difference between autoscaling pods and nodes?
  11. In this chapter, we deployed our own database containers. In production this will change, as it's required to connect to an already existing external database. How would you change the configuration to do so?


Further reading