K8S问题排查-删除Pod后处于Terminating状态

vlambda
2021-08-18

K8S问题排查-删除Pod后处于Terminating状态

问题背景

通过kubectl delete命令删除某个业务Pod后，该Pod一直处于Terminating状态。

原因分析

根据现象看，应该是删除过程中有哪个流程异常，导致最终的删除卡在了Terminating状态。先describe看一下：

[root@node1 ~]# kubectl describe pod -n xxx cam1-78b6fc6bc8-cjsw5
// 没有发现什么异常信息，这里就不贴日志了

Event事件中未见明显异常，那就看负责删除Pod的kubelet组件日志（已过滤出关键性日志）：

I0728 16:24:57.339295    9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
I0728 16:24:57.339720    9744 kuberuntime_container.go:581] Killing container "docker://a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589" with 60 second grace period
I0728 16:25:18.259418    9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
2021-07-28 16:25:19.247 [INFO][394011] ipam.go 1173: Releasing all IPs with handle 'cam.cam1-78b6fc6bc8-cjsw5'
2021-07-28 16:25:19.254 [INFO][393585] k8s.go 498: Teardown processing complete.

// 可疑点1：没有获取到pod IP
W0728 16:25:19.303513    9744 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cam1-78b6fc6bc8-cjsw5_cam": Unexpected command output Device "eth0" does not exist.
 with error: exit status 1
 
I0728 16:25:19.341068    9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589"}
I0728 16:25:20.578095    9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"c3b992465cd2085300995066526a36665664558446ff6e1756135c3a5b6df2e6"}

I0728 16:25:20.711967    9744 kubelet_pods.go:1090] Killing unwanted pod "cam1-78b6fc6bc8-cjsw5"

// 可疑点2：Unmount失败
E0728 16:25:20.939400    9744 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g podName:5c948341-c030-4996-b888-f032577d97b0 nodeName:}" failed. No retries permitted until 2021-07-28 16:25:21.439325811 +0800 CST m=+199182.605079651 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"diag-log\" (UniqueName: \"kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g\") pod \"5c948341-c030-4996-b888-f032577d97b0\" (UID: \"5c948341-c030-4996-b888-f032577d97b0\") : Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g：目标忙。\n        (有些情况下通过 lsof(8) 或 fuser(1) 可以\n         找到有关使用该设备的进程的有用信息。)\n\n"

从删除Pod的日志看，有2个可疑点：

docker_sandbox.go:384打印的获取 pod IP错误；
nestedpendingoperations.go:301打印的 Unmount失败错误；

pkg/kubelet/dockershim/docker_sandbox.go:348
func (ds *dockerService) getIP(podSandboxID string, sandbox *dockertypes.ContainerJSON) string {
 if sandbox.NetworkSettings == nil {
  return ""
 }
 if networkNamespaceMode(sandbox) == runtimeapi.NamespaceMode_NODE {
  // For sandboxes using host network, the shim is not responsible for
  // reporting the IP.
  return ""
 }

 // Don't bother getting IP if the pod is known and networking isn't ready
 ready, ok := ds.getNetworkReady(podSandboxID)
 if ok && !ready {
  return ""
 }

 ip, err := ds.getIPFromPlugin(sandbox)
 if err == nil {
  return ip
 }
 
 if sandbox.NetworkSettings.IPAddress != "" {
  return sandbox.NetworkSettings.IPAddress
 }
 if sandbox.NetworkSettings.GlobalIPv6Address != "" {
  return sandbox.NetworkSettings.GlobalIPv6Address
 }

 // 错误日志在这里
 klog.Warningf("failed to read pod IP from plugin/docker: %v", err)
 return ""
}

继续看getIP方法的调用处代码，这里如果没有拿到IP，也没有什么异常，直接把空值放到PodSandboxStatusResponse中并返回；

pkg/kubelet/dockershim/docker_sandbox.go:404
func (ds *dockerService) PodSandboxStatus(ctx context.Context, req *runtimeapi.PodSandboxStatusRequest) (*runtimeapi.PodSandboxStatusResponse, error) {
 podSandboxID := req.PodSandboxId

 r, metadata, err := ds.getPodSandboxDetails(podSandboxID)
 if err != nil {
  return nil, err
 }

 // Parse the timestamps.
 createdAt, _, _, err := getContainerTimestamps(r)
 if err != nil {
  return nil, fmt.Errorf("failed to parse timestamp for container %q: %v", podSandboxID, err)
 }
 ct := createdAt.UnixNano()

 // Translate container to sandbox state.
 state := runtimeapi.PodSandboxState_SANDBOX_NOTREADY
 if r.State.Running {
  state = runtimeapi.PodSandboxState_SANDBOX_READY
 }

 // 调用getIP方法的位置
 var IP string
 if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
  IP = ds.getIP(podSandboxID, r)
 }

    labels, annotations := extractLabels(r.Config.Labels)
 status := &runtimeapi.PodSandboxStatus{
  Id:          r.ID,
  State:       state,
  CreatedAt:   ct,
  Metadata:    metadata,
  Labels:      labels,
  Annotations: annotations,
  Network: &runtimeapi.PodSandboxNetworkStatus{
   Ip: IP,
  },
  Linux: &runtimeapi.LinuxPodSandboxStatus{
   Namespaces: &runtimeapi.Namespace{
    Options: &runtimeapi.NamespaceOption{
     Network: networkNamespaceMode(r),
     Pid:     pidNamespaceMode(r),
     Ipc:     ipcNamespaceMode(r),
    },
   },
  },
 }
 return &runtimeapi.PodSandboxStatusResponse{Status: status}, nil
}

到此看不出这个错误会不会中断删除流程，那就本地构造一下试试。修改上面的代码，在调用getIP方法的位置后面增加调试日志（从本地验证结果看，Pod正常删除，说明异常问题与此处无关）；

 // 调用getIP方法的位置
 var IP string
 if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
  IP = ds.getIP(podSandboxID, r)
 }
 
 // 新加调试日志，如果是指定的Pod，强制将IP置空
 isTestPod := strings.Contains(metadata.GetName(), "testpod")
 if isTestPod {
  IP = ""
 }

再看第2点，这个是ERROR级别的错误，问题出在Unmount挂载点时失败。那么卸载挂载点失败会导致卸载流程提前终止吗？网上关于Pod删除流程的源码分析文章很多，我们就直接找几篇[2,3,4]看看能不能解答上面的问题。

简单总结来说，删除一个Pod的流程如下：

调用 kube-apiserver的 DELETE接口（默认带 grace-period=30s）；
第一次的删除只是更新Pod对象的元信息（ DeletionTimestamp字段和 DeletionGracePeriodSeconds字段），并没有在 Etcd中删除记录；
kubectl命令的执行会阻塞并显示正在删除Pod；
kubelet组件监听到Pod对象的更新事件，执行 killPod()方法；
kubelet组件监听到pod的删除事件，第二次调用 kube-apiserver的 DELETE接口（带 grace-period=0）
kube-apiserver的 DELETE接口去 etcd中删除Pod对象；
kubectl命令的执行返回，删除Pod成功；

从前面kubelet删除异常的日志看，确实有两次DELETE操作，并且中间有个Killing container的日志，但从上面的删除流程看，两次DELETE操作之间应该是调用killPod()方法，通过查看源码，对应的日志应该是Killing unwanted pod，所以，实际上第二次的DELETE操作并没有触发。

pkg/kubelet/kubelet_pods.go:1073
func (kl *Kubelet) podKiller() {
 killing := sets.NewString()
 // guard for the killing set
 lock := sync.Mutex{}
 for podPair := range kl.podKillingCh {
  runningPod := podPair.RunningPod
  apiPod := podPair.APIPod

  lock.Lock()
  exists := killing.Has(string(runningPod.ID))
  if !exists {
   killing.Insert(string(runningPod.ID))
  }
  lock.Unlock()

  // 这里在调用killPod方法前会打印v2级别的日志
  if !exists {
   go func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {
    klog.V(2).Infof("Killing unwanted pod %q", runningPod.Name)
    err := kl.killPod(apiPod, runningPod, nil, nil)
    if err != nil {
     klog.Errorf("Failed killing the pod %q: %v", runningPod.Name, err)
    }
    lock.Lock()
    killing.Delete(string(runningPod.ID))
    lock.Unlock()
   }(apiPod, runningPod)
  }
 }
}

怎么确认第二次的DELETE操作有没有触发呢？很简单，看代码或者实际验证都可以。这里我就在测试环境删除个Pod看下相关日志：

[root@node2 ~]# kubectl delete pod -n xxx  testpodrc2-7b749f6c9c-qh68l
pod "testpodrc2-7b749f6c9c-qh68l" deleted

// 已过滤出关键性日志
[root@node2 ~]# tailf kubelet.log
I0730 13:27:31.854178   24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:31.854511   24588 kuberuntime_container.go:581] Killing container "docker://e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e" with 30 second grace period
I0730 13:27:32.203167   24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

I0730 13:27:32.993294   24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e"}
I0730 13:27:32.993428   24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"c6a587614976beed0cbb6e5fabf70a2d039eec6c160154fce007fe2bb1ba3b4f"}

I0730 13:27:34.072494   24588 kubelet_pods.go:1090] Killing unwanted pod "testpodrc2-7b749f6c9c-qh68l"

I0730 13:27:40.084182   24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:40.085735   24588 kubelet.go:1898] SyncLoop (REMOVE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

对比正常和异常场景下的日志可以看出，正常的删除操作下，Killing unwanted pod日志之后会有DELETE和REMOVE的操作，这也就说明问题出在第二次DELETE操作没有触发。查看相关代码：

pkg/kubelet/status/status_manager.go:470
//kubelet组件有一个statusManager模块，它会for循环调用syncPod()方法
//方法内部有机会调用kube-apiserver的DELETE接口(强制删除，非平滑)
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
 ...
    
    //当pod带有DeletionTimestamp字段，并且其内容器已被删除、持久卷已被删除等的多条件下，才会进入if语句内部
    if m.canBeDeleted(pod, status.status) {
        deleteOptions := metav1.NewDeleteOptions(0)
        deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))
        
        //强制删除pod对象：kubectl delete pod podA --grace-period=0
        err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions) 
 ...
        
    }
}

从源码可以看出，第二次DELETE操作是否触发依赖于canBeDeleted方法的校验结果，而这个方法内会检查持久卷是否已经被删除：

pkg/kubelet/status/status_manager.go:538
func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {
 if pod.DeletionTimestamp == nil || kubepod.IsMirrorPod(pod) {
  return false
 }
 return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)
}

pkg/kubelet/kubelet_pods.go:900
func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
 ...
    
 // 这里会判断挂载卷是否已卸载
 if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {
  // We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
  klog.V(3).Infof("Pod %q is terminated, but some volumes have not been cleaned up", format.Pod(pod))
  return false
 }
 if kl.kubeletConfiguration.CgroupsPerQOS {
  pcm := kl.containerManager.NewPodContainerManager()
  if pcm.Exists(pod) {
   klog.V(3).Infof("Pod %q is terminated, but pod cgroup sandbox has not been cleaned up", format.Pod(pod))
   return false
  }
 }
 return true
}

结合出问题的日志，基本能确认是Unmount挂载点失败导致的异常。那么，挂载点为啥会Unmount失败？

// umount失败关键日志
Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g：目标忙。\n        (有些情况下通过 lsof(8) 或 fuser(1) 可以\n         找到有关使用该设备的进程的有用信息。)\n\n"

仔细看卸载失败的日志，可以看到这个挂载点的后端存储是glusterfs，而目标忙一般来说是存储设备侧在使用，所以无法卸载。那就找找看是不是哪个进程使用了这个挂载目录（以下定位由负责glusterfs的同事提供）：

[root@node1 ~]# fuser -mv /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g
用户  进程号  权限  命令
root  kernel mount /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-dialog-gl.uster-pv-50g
root  94549  f.... glusterfs

除了内核的mount，还有个pid=94549的glusterfs进程在占用挂载点所在目录，看看是什么进程：

[root@node1 ~]# ps -ef| grep 94549
root 94549 1 0 7月26 ? 00:01:13 /usr/sbin/glusterfs --log-level=ERR0R --log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/global-diaglog-pv/web-fddf96444-stxpf-glusterfs.log --fuse-mountopts=auto_unmount --process-name fuse --volfile-server=xxx --volfile-server=xxx --tfolfile-server=xxx --volfile-id=global-diaglog --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/xxx/volumes/kubernetef.io-glusterfs/global-diaglog-pv

发现这个进程维护的是web-xxx的挂载信息，而web-xxx和cam-xxx没有任何关联。由此推断出是glusterfs管理的挂载信息发送错乱导致，具体错乱原因就转给相关负责的同事看了。