2033278 – Upgrade[4.8->4.9] failure happens because the drain node failed with a storage pod

Bug 2033278 - Upgrade[4.8->4.9] failure happens because the drain node failed with a storage pod

Summary: Upgrade[4.8->4.9] failure happens because the drain node failed with a storag...

Keywords:
Status:	CLOSED DUPLICATE of bug 1952224
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Harshal Patil
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-16 12:24 UTC by pmali
Modified:	2022-01-07 04:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-07 04:11:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description pmali 2021-12-16 12:24:13 UTC

Description of problem:
Upgrade from 4.8 to 4.9 failed.
The failure happens because the drain node failed with a storage pod(csi-snapshot-controller-55b748c55b-hrsbq).

Version-Release number of selected component (if applicable):

4.8.24-x86_64
4.9.0-0.nightly-2021-12-13-195043

How reproducible:
Only once occurred.

Steps to Reproduce:
1.
2.
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade Successful

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 3 Jan Safranek 2021-12-17 14:04:38 UTC

From Kubernetes perspective, csi-snapshot-controller-55b748c55b-hrsbq is just a very innocent pod that does not do anything special on the host, esp. it does *not* manipulate with any storage on the node. It only talks to the API server and watches and manipulates some CRDs, like any other operator (but upstream they call it "controller").

In the node logs I can see that kubelet is not able to clean up some cgroups:

$ grep "controller-55b748c55b-hrsbq\|a4cd6587-d0b6-4c2a-bc7f-4700789d5456" kubelet_service.log

Dec 14 11:19:57.236489 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.236450    1447 kubelet.go:1919] SyncLoop (ADD, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.317868 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.317832    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.364152 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.363970    1447 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-contro
ller-55b748c55b-hrsbq" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:57.466646 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.465930    1447 reconciler.go:269] operationExecutor.MountVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-controller-55b748c55b-hrs
bq" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:57.534028 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.533933    1447 operation_generator.go:672] MountVolume.SetUp succeeded for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-controller-55b748c55b-hrsb
q" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:57.570613 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.570006    1447 kuberuntime_manager.go:443] No sandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" can be found. Need to start a new one
Dec 14 11:19:59.148855 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.135912    1447 reconciler.go:196] operationExecutor.UnmountVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "a4cd6587-d0b6-4c2a-bc7f-4700789d5456" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:59.222740 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.222704    1447 operation_generator.go:797] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb" (OuterVolumeSpecName: "kube-api-access-znzcb") pod "a4cd6587-d0b6-4c2a-bc7f-4700789d5456" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456"). InnerVolumeSpecName "kube-api-access-znzcb". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 14 11:19:59.237607 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.237582    1447 reconciler.go:319] Volume detached for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") on node "ip-10-0-66-41.us-east-2.compute.internal" DevicePath ""
Dec 14 11:19:59.628645 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.627969    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:20:00.648121 ip-10-0-66-41 hyperkube[1447]: E1214 11:20:00.647922    1447 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Dec 14 11:20:01.140452 ip-10-0-66-41 hyperkube[1447]: W1214 11:20:01.133168    1447 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable poda4cd6587-d0b6-4c2a-bc7f-4700789d5456] : unable to destroy cgroup paths for cgroup [kubepods burstable poda4cd6587-d0b6-4c2a-bc7f-4700789d5456] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice]

[ "failed to delete cgroup paths" repeats a lot ]

From storage perspective, the only one Secret volume got mounted and unmounted without any issues. I let node team dig what else went wrong with the pod.

Comment 4 Harshal Patil 2022-01-06 04:15:36 UTC

$ grep csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator kubelet_service.log 
Dec 14 11:19:57.236489 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.236450    1447 kubelet.go:1919] SyncLoop (ADD, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.317868 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.317832    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.570613 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.570006    1447 kuberuntime_manager.go:443] No sandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" can be found. Need to start a new one
Dec 14 11:19:59.628645 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.627969    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:20:00.648121 ip-10-0-66-41 hyperkube[1447]: E1214 11:20:00.647922    1447 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF


That sounds very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1952224 (quickly create and delete pods). Here they are using 4.8.24, which should have all the required fixes, https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c57

Comment 5 Tom Sweeney 2022-01-06 16:19:03 UTC

Not completed this sprint.

Comment 6 Harshal Patil 2022-01-07 04:11:06 UTC

Sorry my bad, 4.8 won't have all the fixes. Only a set of fixes where backported 4.8 to fix an urgent bug. But the original fix won't be backported to 4.8 due to it's complexity and lack of support in the upstream as pointed out by 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c56
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c57

*** This bug has been marked as a duplicate of bug 1952224 ***

Note You need to log in before you can comment on or make changes to this bug.