K8S问题排查-删除Pod后处于Terminating状态
问题背景
通过kubectl delete
命令删除某个业务Pod后,该Pod一直处于Terminating
状态。
原因分析
根据现象看,应该是删除过程中有哪个流程异常,导致最终的删除卡在了Terminating
状态。先describe
看一下:
[root@node1 ~]# kubectl describe pod -n xxx cam1-78b6fc6bc8-cjsw5
// 没有发现什么异常信息,这里就不贴日志了
Event
事件中未见明显异常,那就看负责删除Pod的kubelet
组件日志(已过滤出关键性日志):
I0728 16:24:57.339295 9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
I0728 16:24:57.339720 9744 kuberuntime_container.go:581] Killing container "docker://a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589" with 60 second grace period
I0728 16:25:18.259418 9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
2021-07-28 16:25:19.247 [INFO][394011] ipam.go 1173: Releasing all IPs with handle 'cam.cam1-78b6fc6bc8-cjsw5'
2021-07-28 16:25:19.254 [INFO][393585] k8s.go 498: Teardown processing complete.
// 可疑点1:没有获取到pod IP
W0728 16:25:19.303513 9744 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cam1-78b6fc6bc8-cjsw5_cam": Unexpected command output Device "eth0" does not exist.
with error: exit status 1
I0728 16:25:19.341068 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589"}
I0728 16:25:20.578095 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"c3b992465cd2085300995066526a36665664558446ff6e1756135c3a5b6df2e6"}
I0728 16:25:20.711967 9744 kubelet_pods.go:1090] Killing unwanted pod "cam1-78b6fc6bc8-cjsw5"
// 可疑点2:Unmount失败
E0728 16:25:20.939400 9744 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g podName:5c948341-c030-4996-b888-f032577d97b0 nodeName:}" failed. No retries permitted until 2021-07-28 16:25:21.439325811 +0800 CST m=+199182.605079651 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"diag-log\" (UniqueName: \"kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g\") pod \"5c948341-c030-4996-b888-f032577d97b0\" (UID: \"5c948341-c030-4996-b888-f032577d97b0\") : Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g:目标忙。\n (有些情况下通过 lsof(8) 或 fuser(1) 可以\n 找到有关使用该设备的进程的有用信息。)\n\n"
从删除Pod的日志看,有2个可疑点:
-
docker_sandbox.go:384
打印的获取pod IP
错误; -
nestedpendingoperations.go:301
打印的Unmount
失败错误;
pkg/kubelet/dockershim/docker_sandbox.go:348
func (ds *dockerService) getIP(podSandboxID string, sandbox *dockertypes.ContainerJSON) string {
if sandbox.NetworkSettings == nil {
return ""
}
if networkNamespaceMode(sandbox) == runtimeapi.NamespaceMode_NODE {
// For sandboxes using host network, the shim is not responsible for
// reporting the IP.
return ""
}
// Don't bother getting IP if the pod is known and networking isn't ready
ready, ok := ds.getNetworkReady(podSandboxID)
if ok && !ready {
return ""
}
ip, err := ds.getIPFromPlugin(sandbox)
if err == nil {
return ip
}
if sandbox.NetworkSettings.IPAddress != "" {
return sandbox.NetworkSettings.IPAddress
}
if sandbox.NetworkSettings.GlobalIPv6Address != "" {
return sandbox.NetworkSettings.GlobalIPv6Address
}
// 错误日志在这里
klog.Warningf("failed to read pod IP from plugin/docker: %v", err)
return ""
}
继续看getIP
方法的调用处代码,这里如果没有拿到IP
,也没有什么异常,直接把空值放到PodSandboxStatusResponse
中并返回;
pkg/kubelet/dockershim/docker_sandbox.go:404
func (ds *dockerService) PodSandboxStatus(ctx context.Context, req *runtimeapi.PodSandboxStatusRequest) (*runtimeapi.PodSandboxStatusResponse, error) {
podSandboxID := req.PodSandboxId
r, metadata, err := ds.getPodSandboxDetails(podSandboxID)
if err != nil {
return nil, err
}
// Parse the timestamps.
createdAt, _, _, err := getContainerTimestamps(r)
if err != nil {
return nil, fmt.Errorf("failed to parse timestamp for container %q: %v", podSandboxID, err)
}
ct := createdAt.UnixNano()
// Translate container to sandbox state.
state := runtimeapi.PodSandboxState_SANDBOX_NOTREADY
if r.State.Running {
state = runtimeapi.PodSandboxState_SANDBOX_READY
}
// 调用getIP方法的位置
var IP string
if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
IP = ds.getIP(podSandboxID, r)
}
labels, annotations := extractLabels(r.Config.Labels)
status := &runtimeapi.PodSandboxStatus{
Id: r.ID,
State: state,
CreatedAt: ct,
Metadata: metadata,
Labels: labels,
Annotations: annotations,
Network: &runtimeapi.PodSandboxNetworkStatus{
Ip: IP,
},
Linux: &runtimeapi.LinuxPodSandboxStatus{
Namespaces: &runtimeapi.Namespace{
Options: &runtimeapi.NamespaceOption{
Network: networkNamespaceMode(r),
Pid: pidNamespaceMode(r),
Ipc: ipcNamespaceMode(r),
},
},
},
}
return &runtimeapi.PodSandboxStatusResponse{Status: status}, nil
}
到此看不出这个错误会不会中断删除流程,那就本地构造一下试试。修改上面的代码,在调用getIP
方法的位置后面增加调试日志(从本地验证结果看,Pod正常删除,说明异常问题与此处无关);
// 调用getIP方法的位置
var IP string
if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
IP = ds.getIP(podSandboxID, r)
}
// 新加调试日志,如果是指定的Pod,强制将IP置空
isTestPod := strings.Contains(metadata.GetName(), "testpod")
if isTestPod {
IP = ""
}
再看第2点,这个是ERROR
级别的错误,问题出在Unmount
挂载点时失败。那么卸载挂载点失败会导致卸载流程提前终止吗?网上关于Pod删除流程的源码分析文章很多,我们就直接找几篇[2,3,4]看看能不能解答上面的问题。
简单总结来说,删除一个Pod的流程如下:
-
调用 kube-apiserver
的DELETE
接口(默认带grace-period=30s
); -
第一次的删除只是更新Pod对象的元信息( DeletionTimestamp
字段和DeletionGracePeriodSeconds
字段),并没有在Etcd
中删除记录; -
kubectl
命令的执行会阻塞并显示正在删除Pod; -
kubelet
组件监听到Pod对象的更新事件,执行killPod()
方法; -
kubelet
组件监听到pod的删除事件,第二次调用kube-apiserver
的DELETE
接口(带grace-period=0
) -
kube-apiserver
的DELETE
接口去etcd
中删除Pod对象; -
kubectl
命令的执行返回,删除Pod成功;
从前面kubelet
删除异常的日志看,确实有两次DELETE
操作,并且中间有个Killing container
的日志,但从上面的删除流程看,两次DELETE
操作之间应该是调用killPod()
方法,通过查看源码,对应的日志应该是Killing unwanted pod
,所以,实际上第二次的DELETE
操作并没有触发。
pkg/kubelet/kubelet_pods.go:1073
func (kl *Kubelet) podKiller() {
killing := sets.NewString()
// guard for the killing set
lock := sync.Mutex{}
for podPair := range kl.podKillingCh {
runningPod := podPair.RunningPod
apiPod := podPair.APIPod
lock.Lock()
exists := killing.Has(string(runningPod.ID))
if !exists {
killing.Insert(string(runningPod.ID))
}
lock.Unlock()
// 这里在调用killPod方法前会打印v2级别的日志
if !exists {
go func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {
klog.V(2).Infof("Killing unwanted pod %q", runningPod.Name)
err := kl.killPod(apiPod, runningPod, nil, nil)
if err != nil {
klog.Errorf("Failed killing the pod %q: %v", runningPod.Name, err)
}
lock.Lock()
killing.Delete(string(runningPod.ID))
lock.Unlock()
}(apiPod, runningPod)
}
}
}
怎么确认第二次的DELETE
操作有没有触发呢?很简单,看代码或者实际验证都可以。这里我就在测试环境删除个Pod看下相关日志:
[root@node2 ~]# kubectl delete pod -n xxx testpodrc2-7b749f6c9c-qh68l
pod "testpodrc2-7b749f6c9c-qh68l" deleted
// 已过滤出关键性日志
[root@node2 ~]# tailf kubelet.log
I0730 13:27:31.854178 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:31.854511 24588 kuberuntime_container.go:581] Killing container "docker://e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e" with 30 second grace period
I0730 13:27:32.203167 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:32.993294 24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e"}
I0730 13:27:32.993428 24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"c6a587614976beed0cbb6e5fabf70a2d039eec6c160154fce007fe2bb1ba3b4f"}
I0730 13:27:34.072494 24588 kubelet_pods.go:1090] Killing unwanted pod "testpodrc2-7b749f6c9c-qh68l"
I0730 13:27:40.084182 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:40.085735 24588 kubelet.go:1898] SyncLoop (REMOVE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
对比正常和异常场景下的日志可以看出,正常的删除操作下,Killing unwanted pod
日志之后会有DELETE
和REMOVE
的操作,这也就说明问题出在第二次DELETE
操作没有触发。查看相关代码:
pkg/kubelet/status/status_manager.go:470
//kubelet组件有一个statusManager模块,它会for循环调用syncPod()方法
//方法内部有机会调用kube-apiserver的DELETE接口(强制删除,非平滑)
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
...
//当pod带有DeletionTimestamp字段,并且其内容器已被删除、持久卷已被删除等的多条件下,才会进入if语句内部
if m.canBeDeleted(pod, status.status) {
deleteOptions := metav1.NewDeleteOptions(0)
deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))
//强制删除pod对象:kubectl delete pod podA --grace-period=0
err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)
...
}
}
从源码可以看出,第二次DELETE
操作是否触发依赖于canBeDeleted
方法的校验结果,而这个方法内会检查持久卷是否已经被删除:
pkg/kubelet/status/status_manager.go:538
func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {
if pod.DeletionTimestamp == nil || kubepod.IsMirrorPod(pod) {
return false
}
return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)
}
pkg/kubelet/kubelet_pods.go:900
func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
...
// 这里会判断挂载卷是否已卸载
if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {
// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
klog.V(3).Infof("Pod %q is terminated, but some volumes have not been cleaned up", format.Pod(pod))
return false
}
if kl.kubeletConfiguration.CgroupsPerQOS {
pcm := kl.containerManager.NewPodContainerManager()
if pcm.Exists(pod) {
klog.V(3).Infof("Pod %q is terminated, but pod cgroup sandbox has not been cleaned up", format.Pod(pod))
return false
}
}
return true
}
结合出问题的日志,基本能确认是Unmount
挂载点失败导致的异常。那么,挂载点为啥会Unmount
失败?
// umount失败关键日志
Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g:目标忙。\n (有些情况下通过 lsof(8) 或 fuser(1) 可以\n 找到有关使用该设备的进程的有用信息。)\n\n"
仔细看卸载失败的日志,可以看到这个挂载点的后端存储是glusterfs
,而目标忙
一般来说是存储设备侧在使用,所以无法卸载。那就找找看是不是哪个进程使用了这个挂载目录(以下定位由负责glusterfs
的同事提供):
[root@node1 ~]# fuser -mv /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g
用户 进程号 权限 命令
root kernel mount /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-dialog-gl.uster-pv-50g
root 94549 f.... glusterfs
除了内核的mount
,还有个pid=94549
的glusterfs
进程在占用挂载点所在目录,看看是什么进程:
[root@node1 ~]# ps -ef| grep 94549
root 94549 1 0 7月26 ? 00:01:13 /usr/sbin/glusterfs --log-level=ERR0R --log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/global-diaglog-pv/web-fddf96444-stxpf-glusterfs.log --fuse-mountopts=auto_unmount --process-name fuse --volfile-server=xxx --volfile-server=xxx --tfolfile-server=xxx --volfile-id=global-diaglog --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/xxx/volumes/kubernetef.io-glusterfs/global-diaglog-pv
发现这个进程维护的是web-xxx
的挂载信息,而web-xxx
和cam-xxx
没有任何关联。由此推断出是glusterfs
管理的挂载信息发送错乱导致,具体错乱原因就转给相关负责的同事看了。
解决方案
从分析结果看,是共享存储卷未正常卸载导致的删除Pod异常,非K8S问题。
参考资料
-
https://github.com/kubernetes/kubernetes/tree/v1.15.12
-
https://blog.csdn.net/nangonghen/article/details/109305635 -
https://juejin.cn/post/6844903842321039368
-
https://developer.aliyun.com/article/608727