Bug 2033278

Summary:	Upgrade[4.8->4.9] failure happens because the drain node failed with a storage pod
Product:	OpenShift Container Platform	Reporter:	pmali
Component:	Node	Assignee:	Harshal Patil <harpatil>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, ehashman, harpatil, jsafrane, tsweeney
Version:	4.8
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-07 04:11:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description pmali 2021-12-16 12:24:13 UTC

Description of problem:
Upgrade from 4.8 to 4.9 failed.
The failure happens because the drain node failed with a storage pod(csi-snapshot-controller-55b748c55b-hrsbq).

Version-Release number of selected component (if applicable):

4.8.24-x86_64
4.9.0-0.nightly-2021-12-13-195043

How reproducible:
Only once occurred.

Steps to Reproduce:
1.
2.
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade Successful

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 3 Jan Safranek 2021-12-17 14:04:38 UTC

From Kubernetes perspective, csi-snapshot-controller-55b748c55b-hrsbq is just a very innocent pod that does not do anything special on the host, esp. it does *not* manipulate with any storage on the node. It only talks to the API server and watches and manipulates some CRDs, like any other operator (but upstream they call it "controller").

In the node logs I can see that kubelet is not able to clean up some cgroups:

$ grep "controller-55b748c55b-hrsbq\|a4cd6587-d0b6-4c2a-bc7f-4700789d5456" kubelet_service.log

Dec 14 11:19:57.236489 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.236450    1447 kubelet.go:1919] SyncLoop (ADD, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.317868 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.317832    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.364152 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.363970    1447 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-contro
ller-55b748c55b-hrsbq" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:57.466646 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.465930    1447 reconciler.go:269] operationExecutor.MountVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-controller-55b748c55b-hrs
bq" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:57.534028 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.533933    1447 operation_generator.go:672] MountVolume.SetUp succeeded for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "csi-snapshot-controller-55b748c55b-hrsb
q" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:57.570613 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.570006    1447 kuberuntime_manager.go:443] No sandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" can be found. Need to start a new one
Dec 14 11:19:59.148855 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.135912    1447 reconciler.go:196] operationExecutor.UnmountVolume started for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") pod "a4cd6587-d0b6-4c2a-bc7f-4700789d5456" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456")
Dec 14 11:19:59.222740 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.222704    1447 operation_generator.go:797] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb" (OuterVolumeSpecName: "kube-api-access-znzcb") pod "a4cd6587-d0b6-4c2a-bc7f-4700789d5456" (UID: "a4cd6587-d0b6-4c2a-bc7f-4700789d5456"). InnerVolumeSpecName "kube-api-access-znzcb". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 14 11:19:59.237607 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.237582    1447 reconciler.go:319] Volume detached for volume "kube-api-access-znzcb" (UniqueName: "kubernetes.io/projected/a4cd6587-d0b6-4c2a-bc7f-4700789d5456-kube-api-access-znzcb") on node "ip-10-0-66-41.us-east-2.compute.internal" DevicePath ""
Dec 14 11:19:59.628645 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.627969    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:20:00.648121 ip-10-0-66-41 hyperkube[1447]: E1214 11:20:00.647922    1447 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Dec 14 11:20:01.140452 ip-10-0-66-41 hyperkube[1447]: W1214 11:20:01.133168    1447 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable poda4cd6587-d0b6-4c2a-bc7f-4700789d5456] : unable to destroy cgroup paths for cgroup [kubepods burstable poda4cd6587-d0b6-4c2a-bc7f-4700789d5456] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda4cd6587_d0b6_4c2a_bc7f_4700789d5456.slice]

[ "failed to delete cgroup paths" repeats a lot ]

From storage perspective, the only one Secret volume got mounted and unmounted without any issues. I let node team dig what else went wrong with the pod.

Comment 4 Harshal Patil 2022-01-06 04:15:36 UTC

$ grep csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator kubelet_service.log 
Dec 14 11:19:57.236489 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.236450    1447 kubelet.go:1919] SyncLoop (ADD, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.317868 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.317832    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:19:57.570613 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:57.570006    1447 kuberuntime_manager.go:443] No sandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" can be found. Need to start a new one
Dec 14 11:19:59.628645 ip-10-0-66-41 hyperkube[1447]: I1214 11:19:59.627969    1447 kubelet.go:1935] SyncLoop (DELETE, "api"): "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)"
Dec 14 11:20:00.648121 ip-10-0-66-41 hyperkube[1447]: E1214 11:20:00.647922    1447 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "csi-snapshot-controller-55b748c55b-hrsbq_openshift-cluster-storage-operator(a4cd6587-d0b6-4c2a-bc7f-4700789d5456)" failed: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF


That sounds very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1952224 (quickly create and delete pods). Here they are using 4.8.24, which should have all the required fixes, https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c57

Comment 5 Tom Sweeney 2022-01-06 16:19:03 UTC

Not completed this sprint.

Comment 6 Harshal Patil 2022-01-07 04:11:06 UTC

Sorry my bad, 4.8 won't have all the fixes. Only a set of fixes where backported 4.8 to fix an urgent bug. But the original fix won't be backported to 4.8 due to it's complexity and lack of support in the upstream as pointed out by 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c56
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1952224#c57

*** This bug has been marked as a duplicate of bug 1952224 ***