Description of problem: Upgrade from 4.8 to 4.9 failed. The failure happens because the drain node failed with a storage pod(csi-snapshot-controller-55b748c55b-hrsbq). Version-Release number of selected component (if applicable): 4.8.24-x86_64 4.9.0-0.nightly-2021-12-13-195043 How reproducible: Only once occurred. Steps to Reproduce: 1. 2. 3. Actual results: Upgrade failed. Expected results: Upgrade Successful Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
From Kubernetes perspective, csi-snapshot-controller-55b748c55b-hrsbq is just a very innocent pod that does not do anything special on the host, esp. it does *not* manipulate with any storage on the node. It only talks to the API server and watches and manipulates some CRDs, like any other operator (but upstream they call it "controller"). In the node logs I can see that kubelet is not able to clean up some cgroups: $ grep "controller-55b748c55b-hrsbq\|a4cd6587-d0b6-4c2a-bc7f-4700789d5456" kubelet_service.log Dec 14 11:19:57.236489 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.236450 1447 kubelet.go:1919] SyncLoop (ADD, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" Dec 14 11:19:57.317868 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.317832 1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" Dec 14 11:19:57.364152 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.363970 1447 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-contro ller-55b748c55b-hrsbq" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456") Dec 14 11:19:57.466646 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.465930 1447 reconciler.go:269] operationExecutor.MountVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-controller-55b748c55b-hrs bq" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456") Dec 14 11:19:57.534028 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.533933 1447 operation_generator.go:672] MountVolume.SetUp succeeded for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-controller-55b748c55b-hrsb q" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456") Dec 14 11:19:57.570613 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.570006 1447 kuberuntime_manager.go:443] No sandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" can be found. Need to start a new one Dec 14 11:19:59.148855 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.135912 1447 reconciler.go:196] operationExecutor.UnmountVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "a4cd6587-d0b6-4c2a-bc7f-4700789d5456" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456") Dec 14 11:19:59.222740 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.222704 1447 operation_generator.go:797] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb" (OuterVolumeSpecName: "kube-api-access-znzcb") pod "a4cd6587-d0b6-4c2a-bc7f-4700789d5456" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456"). InnerVolumeSpecName "kube-api-access-znzcb". PluginName "kubernetes.io/projected", VolumeGidValue "" Dec 14 11:19:59.237607 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.237582 1447 reconciler.go:319] Volume detached for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") on node "ip-10-0-66-41.us-east-2.compute.internal" DevicePath "" Dec 14 11:19:59.628645 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.627969 1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" Dec 14 11:20:00.648121 ip-10-0-66-41 hyperkube[1447]: E1214 11:20:00.647922 1447 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Dec 14 11:20:01.140452 ip-10-0-66-41 hyperkube[1447]: W1214 11:20:01.133168 1447 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable poda4cd6587-d0b6-4c2a-bc7f-4700789d5456] : unable to destroy cgroup paths for cgroup [kubepods burstable poda4cd6587-d0b6-4c2a-bc7f-4700789d5456] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice] [ "failed to delete cgroup paths" repeats a lot ] From storage perspective, the only one Secret volume got mounted and unmounted without any issues. I let node team dig what else went wrong with the pod.
$ grep csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator kubelet_service.log Dec 14 11:19:57.236489 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.236450 1447 kubelet.go:1919] SyncLoop (ADD, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" Dec 14 11:19:57.317868 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.317832 1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" Dec 14 11:19:57.570613 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.570006 1447 kuberuntime_manager.go:443] No sandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" can be found. Need to start a new one Dec 14 11:19:59.628645 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.627969 1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" Dec 14 11:20:00.648121 ip-10-0-66-41 hyperkube[1447]: E1214 11:20:00.647922 1447 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF That sounds very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1952224 (quickly create and delete pods). Here they are using 4.8.24, which should have all the required fixes, https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c57
Not completed this sprint.
Sorry my bad, 4.8 won't have all the fixes. Only a set of fixes where backported 4.8 to fix an urgent bug. But the original fix won't be backported to 4.8 due to it's complexity and lack of support in the upstream as pointed out by [1] https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c56 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c57 *** This bug has been marked as a duplicate of bug 1952224 ***