vlambda博客
学习文章列表

[k8s源码分析][kubelet] devicemanager 之 使用device-plugin(模拟gpu)

本文将分析device plugin是如何使用的, 然后再开始对device pluginkubelet之间如何进行协同工作进行分析.

本文将以gpu-device-plugin为例子, 然后由于机器上没有真正的GPU, 因此将虚拟出几个GPU, 但是功能上会完全一样.

2. 例子

2.1 当前集群的状态


[root@master kubectl]# ./kubectl get nodes
NAME STATUS ROLES AGE VERSION
172.21.0.12 NotReady <none> 15d v0.0.0-master+$Format:%h$
172.21.0.16 Ready <none> 15d v0.0.0-master+$Format:%h$
[root@master kubectl]#
[root@master kubectl]# ./kubectl describe node 172.21.0.12
Name: 172.21.0.12
...
Capacity:
cpu: 2
ephemeral-storage: 51473888Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3880944Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 47438335103
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3778544Ki
pods: 110
...
[root@master kubectl]# ./kubectl describe node 172.21.0.16
Name: 172.21.0.16
...
Capacity:
cpu: 2
ephemeral-storage: 51473888Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8009720Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 47438335103
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7907320Ki
pods: 110
...

这里主要关注资源(CapacityAllocatable), 所以无关的的地方就滤过了.
Capacity: 代表容量
Allocatable: 可分配的各种资源
如果不理解没关系, 在分析device manager的时候会有一个更清晰的认识.

从上面的信息可以看到当前集群中的两个节点都没有任何外来的资源. 另外需要关注一个目录/var/lib/kubelet/device-plugins, 该目录很重要:

kubelet_internal_checkpoint: 保存了device manager的状态, device manager重启的时候会从该文件中加载数据.
kubelet.sock: device manger的服务端, 各种device-plugin向该服务端请求注册.

[root@master device-plugins]# pwd
/var/lib/kubelet/device-plugins
[root@master device-plugins]# ls
DEPRECATION kubelet_internal_checkpoint kubelet.sock
[root@master device-plugins]# cat kubelet_internal_checkpoint
{"Data":{"PodDeviceEntries":null,"RegisteredDevices":{}},"Checksum":3467439661}
[root@master device-plugins]#

2.2 运行device-plugin

[k8s源码分析][kubelet] devicemanager 之 使用device-plugin(模拟gpu)

由于没有真正的GPU, 所以改了一下NVIDIA关于获取和监控gpu的代码. 由于其本质上是获取机器上的所有GPUUUID 然后注册到device manager中, 因此本文就自己构造了几个GPU UUID. (效果是一样的.)

// k8s-device-plugin/nvidia.go
func getDevices() []*pluginapi.Device {
n := uint(10)
var devs []*pluginapi.Device
for i := uint(0); i < n; i++ {
devs = append(devs, &pluginapi.Device{
ID: fmt.Sprintf("%v-%v", resourceName, i),
Health: pluginapi.Healthy,
})
}
return devs
}
// k8s-device-plugin/main.go
newResourceName := os.Getenv("resourcename")
if newResourceName != "" {
resourceName = newResourceName
}
serverSock = fmt.Sprintf("%v%v.sock", pluginapi.DevicePluginPath, resourceName)

// k8s-device-plugin/server.go
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
devs := m.devs
name := fmt.Sprintf("NVIDIA_VISIBLE_DEVICES/%v", resourceName)
...
for _, req := range reqs.ContainerRequests {
response := pluginapi.ContainerAllocateResponse{
Envs: map[string]string{
name: strings.Join(req.DevicesIDs, ","),
},
}
...
}

运行

[root@master NVIDIA]# pwd
/root/go/src/github.com/NVIDIA
[root@master NVIDIA]# git clone https://github.com/nicktming/k8s-device-plugin.git
[root@master k8s-device-plugin]# go build .
[root@master k8s-device-plugin]# export resourcename=nicktming.com/gpu
[root@master k8s-device-plugin]# ./k8s-device-plugin
2019/10/31 16:33:43 Loading NVML
2019/10/31 16:33:43 Fetching devices.
2019/10/31 16:33:43 Starting FS watcher.
2019/10/31 16:33:43 Starting OS watcher.
2019/10/31 16:33:43 Starting to serve on /var/lib/kubelet/device-plugins/gpu.sock
2019/10/31 16:33:43 Registered device plugin with Kubelet

2.3 查看节点状态

首先查看集群中该节点的资源信息

[root@master kubectl]# ./kubectl describe node 172.21.0.16
Name: 172.21.0.16
...
Capacity:
cpu: 2
ephemeral-storage: 51473888Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8009720Ki
nicktming.com/gpu: 10
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 47438335103
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7907320Ki
nicktming.com/gpu: 10
pods: 110
...

可以看到刚刚运行device-plugin的节点(172.21.0.16)已经向kubelet中的device manager注册了该资源nicktming.com/gpu 并且可分配的资源数为10.

2.4 申请该资源

先申请8gpu.

[root@master kubectl]# ./kubectl get nodes
NAME STATUS ROLES AGE VERSION
172.21.0.12 Ready <none> 15d v0.0.0-master+$Format:%h$
172.21.0.16 Ready <none> 15d v0.0.0-master+$Format:%h$
[root@master kubectl]# ./kubectl get pods --all-namespaces
No resources found.
[root@master kubectl]# cat deviceplugin/pod-gpu-8.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-gpu-8
spec:
containers:
- name: podtest-8
image: nginx
resources:
limits:
nicktming.com/gpu : 8
requests:
nicktming.com/gpu : 8
ports:
- containerPort: 80

[root@master kubectl]# ./kubectl apply -f deviceplugin/pod-gpu-8.yaml
pod/test-gpu-8 created

查看状态: 可以看到成功申请了8gpu, 毫无疑问该pod必须只能运行172.21.0.16节点上, 因为目前只有该节点有此资源nicktming.com/gpu.

当然真实情况中docker(nvidia docker)看到了环境变量NVIDIA_VISIBLE_DEVICES=具体的GPU UUID, 就会将对应的gpu投射到容器中.

[root@master kubectl]# ./kubectl get pods
NAME READY STATUS RESTARTS AGE
test-gpu-8 1/1 Running 0 50s
[root@master kubectl]# ./kubectl exec -it test-gpu-8 env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES/nicktming.com/gpu=nicktming.com/gpu-2,nicktming.com/gpu-3,nicktming.com/gpu-7,nicktming.com/gpu-6,nicktming.com/gpu-1,nicktming.com/gpu-5,nicktming.com/gpu-8,nicktming.com/gpu-9
[root@master kubectl]#
[root@master kubectl]# ./kubectl describe pods test-gpu-8 | grep -i node
Node: 172.21.0.16/172.21.0.16
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s

查看/var/lib/kubelet/device-plugins中内容, 可用看到多了一个gpu.sock, 此处是devicemanager需要与对应的device-plugin发请求. (后面源码部分中会具体分析)

[root@master device-plugins]# pwd
/var/lib/kubelet/device-plugins
[root@master device-plugins]# ls
DEPRECATION gpu.sock kubelet_internal_checkpoint kubelet.sock
[root@master device-plugins]# cat kubelet_internal_checkpoint | jq .
{
"Data": {
"PodDeviceEntries": [
{
"PodUID": "94c13838-fbba-11e9-ba9e-525400d54f7e",
"ContainerName": "podtest-8",
"ResourceName": "nicktming.com/gpu",
"DeviceIDs": [
"nicktming.com/gpu-9",
"nicktming.com/gpu-2",
"nicktming.com/gpu-3",
"nicktming.com/gpu-7",
"nicktming.com/gpu-6",
"nicktming.com/gpu-1",
"nicktming.com/gpu-5",
"nicktming.com/gpu-8"
],
"AllocResp": "CroBChZOVklESUFfVklTSUJMRV9ERVZJQ0VTEp8Bbmlja3RtaW5nLmNvbS9ncHUtMixuaWNrdG1pbmcuY29tL2dwdS0zLG5pY2t0bWluZy5jb20vZ3B1LTcsbmlja3RtaW5nLmNvbS9ncHUtNixuaWNrdG1pbmcuY29tL2dwdS0xLG5pY2t0bWluZy5jb20vZ3B1LTUsbmlja3RtaW5nLmNvbS9ncHUtOCxuaWNrdG1pbmcuY29tL2dwdS05"
}
],
"RegisteredDevices": {
"nicktming.com/gpu": [
"nicktming.com/gpu-6",
"nicktming.com/gpu-7",
"nicktming.com/gpu-8",
"nicktming.com/gpu-0",
"nicktming.com/gpu-1",
"nicktming.com/gpu-2",
"nicktming.com/gpu-3",
"nicktming.com/gpu-4",
"nicktming.com/gpu-9",
"nicktming.com/gpu-5"
]
}
},
"Checksum": 3602853121
}

接下来再创建一个申请3gpupod, 按照常识, 该pod无法创建成功, 因为现在只剩下2gpu, 分别是nicktming.com/gpu-4 和 nicktming.com/gpu-0.

[root@master kubectl]# cat deviceplugin/pod-gpu-3.yaml 
apiVersion: v1
kind: Pod
metadata:
name: test-gpu-3
spec:
containers:
- name: podtest-3
image: nginx
resources:
limits:
nicktming.com/gpu : 3
requests:
nicktming.com/gpu : 3
ports:
- containerPort: 80

[root@master kubectl]# ./kubectl apply -f deviceplugin/pod-gpu-3.yaml
pod/test-gpu-3 created
[root@master kubectl]# ./kubectl get pods
NAME READY STATUS RESTARTS AGE
test-gpu-3 0/1 Pending 0 6s
test-gpu-8 1/1 Running 0 8m20s
[root@master kubectl]# ./kubectl describe pod test-gpu-3
Name: test-gpu-3
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 24s (x2 over 24s) default-scheduler 0/2 nodes are available: 2 Insufficient nicktming.com/gpu.

可以看到该pod一直处于pending状态, 无法进行调度, 因为集群中的两个节点都无法满足该pod.

2.5 为另外一个节点添加资源


由于资源不够, 此时在另外一个节点172.21.0.12中添加资源, 就是运行同样资源的device-plugin.

[root@worker device-plugin]# ifconfig 
...
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.21.0.12 netmask 255.255.240.0 broadcast 172.21.15.255
...
[root@worker device-plugin]# pwd
/root/worker/device-plugin
[root@worker device-plugin]# export resourcename=nicktming.com/gpu
[root@worker device-plugin]# ls
k8s-device-plugin
[root@worker device-plugin]# ./k8s-device-plugin
2019/10/31 17:00:42 Loading NVML
2019/10/31 17:00:42 Fetching devices.
2019/10/31 17:00:42 Starting FS watcher.
2019/10/31 17:00:42 Starting OS watcher.
2019/10/31 17:00:42 Starting to serve on /var/lib/kubelet/device-plugins/gpu.sock
2019/10/31 17:00:42 Registered device plugin with Kubelet

查看节点(172.21.0.12)状态, 可以看到该节点已经有了该资源(nicktming.com/gpu)

[root@master kubectl]# ./kubectl describe node 172.21.0.12
Name: 172.21.0.12
...
Capacity:
cpu: 2
ephemeral-storage: 51473888Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3880944Ki
nicktming.com/gpu: 10
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 47438335103
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3778544Ki
nicktming.com/gpu: 10
pods: 110
...

查看pod运行情况, 可以看到test-gpu-3已经运行在172.21.0.12, 关于调度部分可以参考 kube-scheduler, 因为该pod会隔一段时间拿回来调度, 此时发现已经有可用的资源, 就是被调度到某一台机器上了.

[root@master kubectl]# ./kubectl get pods
NAME READY STATUS RESTARTS AGE
test-gpu-3 1/1 Running 0 10m
test-gpu-8 1/1 Running 0 18m
[root@master kubectl]# ./kubectl exec -it test-gpu-3 env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES/nicktming.com/gpu=nicktming.com/gpu-0,nicktming.com/gpu-2,nicktming.com/gpu-3
[root@master kubectl]# ./kubectl describe pod test-gpu-3 | grep -i node
Node: 172.21.0.12/172.21.0.12
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Warning FailedScheduling 3m57s (x16 over 11m) default-scheduler 0/2 nodes are available: 2 Insufficient nicktming.com/gpu.
[root@master kubectl]#

2.6 创建另外一种资源rdma

[root@master k8s-device-plugin]# export resourcename=nicktming.com/rdma
[root@master k8s-device-plugin]# ./k8s-device-plugin
2019/10/31 18:02:34 Loading NVML
2019/10/31 18:02:34 Fetching devices.
2019/10/31 18:02:34 Starting FS watcher.
2019/10/31 18:02:34 Starting OS watcher.
2019/10/31 18:02:34 Starting to serve on /var/lib/kubelet/device-plugins/rdma.sock
2019/10/31 18:02:34 Registered device plugin with Kubelet


查看状态

[root@master kubectl]# ./kubectl describe node 172.21.0.16
Name: 172.21.0.16
...
Capacity:
cpu: 2
ephemeral-storage: 51473888Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8009720Ki
nicktming.com/gpu: 10
nicktming.com/rdma: 10
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 47438335103
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7907320Ki
nicktming.com/gpu: 10
nicktming.com/rdma: 10
pods: 110
...

此时申请2gpu10rdma设备.

[root@master kubectl]# ./kubectl get pods
NAME READY STATUS RESTARTS AGE
test-gpu-3 1/1 Running 0 82m
test-gpu-8 1/1 Running 0 90m
[root@master kubectl]# cat deviceplugin/pod-gpu2-rdma10.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-gpu2-rdma10
spec:
containers:
- name: testpod-gpu2-rdma10
image: nginx
resources:
limits:
nicktming.com/gpu : 2
nicktming.com/rdma : 10
requests:
nicktming.com/gpu : 2
nicktming.com/rdma : 10
ports:
- containerPort: 80

[root@master kubectl]# ./kubectl apply -f deviceplugin/pod-gpu2-rdma10.yaml
pod/test-gpu2-rdma10 created
[root@master kubectl]# ./kubectl get pods
NAME READY STATUS RESTARTS AGE
test-gpu-3 1/1 Running 0 82m
test-gpu-8 1/1 Running 0 91m
test-gpu2-rdma10 1/1 Running 0 6s
[root@master kubectl]# ./kubectl exec -it test-gpu2-rdma10 env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES/nicktming.com/gpu=nicktming.com/gpu-0,nicktming.com/gpu-4
NVIDIA_VISIBLE_DEVICES/nicktming.com/rdma=nicktming.com/rdma-4,nicktming.com/rdma-2,nicktming.com/rdma-7,nicktming.com/rdma-3,nicktming.com/rdma-9,nicktming.com/rdma-5,nicktming.com/rdma-0,nicktming.com/rdma-1,nicktming.com/rdma-6,nicktming.com/rdma-8
[root@master kubectl]#

查看kubelet_internal_checkpoint

[root@master device-plugins]# pwd
/var/lib/kubelet/device-plugins
[root@master device-plugins]# ls
DEPRECATION gpu.sock kubelet_internal_checkpoint kubelet.sock rdma.sock
[root@master device-plugins]#
[root@master device-plugins]# cat kubelet_internal_checkpoint | jq .
{
"Data": {
"PodDeviceEntries": [
{
"PodUID": "94c13838-fbba-11e9-ba9e-525400d54f7e",
"ContainerName": "podtest-8",
"ResourceName": "nicktming.com/gpu",
"DeviceIDs": [
"nicktming.com/gpu-8",
"nicktming.com/gpu-9",
"nicktming.com/gpu-2",
"nicktming.com/gpu-3",
"nicktming.com/gpu-7",
"nicktming.com/gpu-6",
"nicktming.com/gpu-1",
"nicktming.com/gpu-5"
],
"AllocResp": "CroBChZOVklESUFfVklTSUJMRV9ERVZJQ0VTEp8Bbmlja3RtaW5nLmNvbS9ncHUtMixuaWNrdG1pbmcuY29tL2dwdS0zLG5pY2t0bWluZy5jb20vZ3B1LTcsbmlja3RtaW5nLmNvbS9ncHUtNixuaWNrdG1pbmcuY29tL2dwdS0xLG5pY2t0bWluZy5jb20vZ3B1LTUsbmlja3RtaW5nLmNvbS9ncHUtOCxuaWNrdG1pbmcuY29tL2dwdS05"
},
{
"PodUID": "4d589c87-fbc7-11e9-ba9e-525400d54f7e",
"ContainerName": "testpod-gpu2-rdma10",
"ResourceName": "nicktming.com/rdma",
"DeviceIDs": [
"nicktming.com/rdma-9",
"nicktming.com/rdma-5",
"nicktming.com/rdma-0",
"nicktming.com/rdma-1",
"nicktming.com/rdma-6",
"nicktming.com/rdma-8",
"nicktming.com/rdma-3",
"nicktming.com/rdma-2",
"nicktming.com/rdma-7",
"nicktming.com/rdma-4"
],
"AllocResp": "Cv8BCilOVklESUFfVklTSUJMRV9ERVZJQ0VTL25pY2t0bWluZy5jb20vcmRtYRLRAW5pY2t0bWluZy5jb20vcmRtYS00LG5pY2t0bWluZy5jb20vcmRtYS0yLG5pY2t0bWluZy5jb20vcmRtYS03LG5pY2t0bWluZy5jb20vcmRtYS0zLG5pY2t0bWluZy5jb20vcmRtYS05LG5pY2t0bWluZy5jb20vcmRtYS01LG5pY2t0bWluZy5jb20vcmRtYS0wLG5pY2t0bWluZy5jb20vcmRtYS0xLG5pY2t0bWluZy5jb20vcmRtYS02LG5pY2t0bWluZy5jb20vcmRtYS04"
},
{
"PodUID": "4d589c87-fbc7-11e9-ba9e-525400d54f7e",
"ContainerName": "testpod-gpu2-rdma10",
"ResourceName": "nicktming.com/gpu",
"DeviceIDs": [
"nicktming.com/gpu-0",
"nicktming.com/gpu-4"
],
"AllocResp": "ClMKKE5WSURJQV9WSVNJQkxFX0RFVklDRVMvbmlja3RtaW5nLmNvbS9ncHUSJ25pY2t0bWluZy5jb20vZ3B1LTAsbmlja3RtaW5nLmNvbS9ncHUtNA=="
}
],
"RegisteredDevices": {
"nicktming.com/gpu": [
"nicktming.com/gpu-0",
"nicktming.com/gpu-4",
"nicktming.com/gpu-9",
"nicktming.com/gpu-7",
"nicktming.com/gpu-8",
"nicktming.com/gpu-1",
"nicktming.com/gpu-2",
"nicktming.com/gpu-3",
"nicktming.com/gpu-5",
"nicktming.com/gpu-6"
],
"nicktming.com/rdma": [
"nicktming.com/rdma-0",
"nicktming.com/rdma-1",
"nicktming.com/rdma-2",
"nicktming.com/rdma-7",
"nicktming.com/rdma-8",
"nicktming.com/rdma-9",
"nicktming.com/rdma-3",
"nicktming.com/rdma-4",
"nicktming.com/rdma-5",
"nicktming.com/rdma-6"
]
}
},
"Checksum": 3285376913
}

3. 总结

相信到这里对device-plugin如何使用就比较明朗了, 但是里面究竟发生了什么, 会在后续源码部分进行分析, 这里的例子也是为源码分析做准备.
接下来会从两个部分来分析device-plugindevice manager的工作机制.
1. device-plugindevice manager注册资源的过程.
2. pod申请资源的过程.