Description of problem: Hit this while trying to verify that https://bugzilla.redhat.com/show_bug.cgi?id=1912521 was fixed upstream. In a 4.7 reliability cluster, pods and namespaces are stuck in Terminating after the cluster was running fine for ~7 days. Cluster started on 8-Feb and stuck pods and namespaces were noted on 16-Feb. The user workload is create/delete projects apps, scale apps up/down, build apps among other activities. Opening a new bug to start fresh on this issue. oc adm must-gather and oc adm node-logs of a node with a stuck pod will be provided. The node is ip-10-0-217-250.us-east-2.compute.internal The Terminating pod is rails-pgsql-persistent-21-bh446 in namespace rails-pgsql-persistent-697 Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-02-08-164120
Starting a new cluster with the build in comment 2. It will be some days before we have results.
Deleting the pod with --force verified as a workaround.
I don't know of any other tests trying to run a simulated user workload over multiple days. I would not expect to see this in CI
*** Bug 1920700 has been marked as a duplicate of this bug. ***
Mike, PRs in Comment 10 are merged in 4.8. Is this test running against 4.8 to give us signal that once the PRs are merged in 4.7, we could expect better behavior?
Will start a 4.8 run this week.
Linked patches landed 24 days ago, please advise if we are still seeing this behaviour on either 4.8 or on the 4.7 reliability cluster. If not I will close.
Hit this issue during the run mentioned in comment 16. Details in comment 18.
This actually prevents applying machineconfig in SNO as the node won't drain
13m Warning FailedToDrain node/openshift-master-0.qe3.kni.lab.eng.bos.redhat.com 5 tries: [error when waiting for pod "prometheus-adapter-59466c4d7c-jvgzb" terminating: global timeout reached: 1m30s, error when waiting for pod "prometheus-adapter-59466c4d7c-lsqp9" terminating: global timeout reached: 1m30s] 3m30s Warning FailedToDrain node/openshift-master-0.qe3.kni.lab.eng.bos.redhat.com 5 tries: [error when waiting for pod "prometheus-adapter-59466c4d7c-lsqp9" terminating: global timeout reached: 1m30s, error when waiting for pod "prometheus-adapter-59466c4d7c-jvgzb" terminating: global timeout reached: 1m30s] 60s Normal OSUpdateStarted node/openshift-master-0.qe3.kni.lab.eng.bos.redhat.com Changing kernel type 60s Normal InClusterUpgrade node/openshift-master-0.qe3.kni.lab.eng.bos.redhat.com Updating from oscontainer quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a14e99259a4615edaa033913ff64468f20bf67340ad964eb62672d45422a0010 [kni@r640-u09 sno-bm]$
changed the title as this can happen on a new deployment (i.e. upon completing a deployment).
@ehashman Do you want to access the cluster or shall I terminate it? Kubeconfig is at same link as in comment 8
Ryan has dug up some logs for the SNO case and it appears that the prometheus-adapter is not handling shutdowns correctly which is causing it to get stuck in terminating. There doesn't appear to be a signal handler in the repo: https://github.com/kubernetes-sigs/prometheus-adapter/search?q=signal.notify I suggest filing a bug against the monitoring team as this failure seems unrelated.
https://bugzilla.redhat.com/show_bug.cgi?id=1920700, (which was opened on SNO and was later closed as duplicate of this bug) has the must-gather attached. Do you need anything else? IIUC, I can just re-open that bug and assign it to the monitoring team. Could you please confirm it's not a duplicate?
Confirmed that I don't think it's a duplicate, and it should go back to the monitoring team.
$ KUBECONFIG=Downloads/kubeconfig.bz1929463 oc get pod -A -o wide | grep Terminating dancer-mysql-persistent-1070 dancer-mysql-persistent-1-4t7x9 0/1 Terminating 0 2d7h <none> ip-10-0-207-60.us-west-2.compute.internal <none> <none> django-psql-persistent-653 django-psql-persistent-2-2sgkz 0/1 Terminating 0 5d20h <none> ip-10-0-207-60.us-west-2.compute.internal <none> <none> django-psql-persistent-692 django-psql-persistent-3-x5sgk 0/1 Terminating 0 5d12h <none> ip-10-0-207-60.us-west-2.compute.internal <none> <none> nodejs-postgresql-persistent-1052 nodejs-postgresql-persistent-1-b6wzk 0/1 Terminating 0 2d10h <none> ip-10-0-188-114.us-west-2.compute.internal <none> <none> nodejs-postgresql-persistent-405 nodejs-postgresql-persistent-2-8pj6z 0/1 Terminating 0 7d23h <none> ip-10-0-131-45.us-west-2.compute.internal <none> <none> nodejs-postgresql-persistent-740 nodejs-postgresql-persistent-5-mx6xz 0/1 Terminating 0 4d23h <none> ip-10-0-188-114.us-west-2.compute.internal <none> <none> nodejs-postgresql-persistent-805 nodejs-postgresql-persistent-13-w6pl9 0/1 Terminating 0 4d6h <none> ip-10-0-207-60.us-west-2.compute.internal <none> <none> rails-pgsql-persistent-215 rails-pgsql-persistent-27-ftj4g 0/1 Terminating 0 8d <none> ip-10-0-188-114.us-west-2.compute.internal <none> <none> Looking at the journal on ip-10-0-207-60.us-west-2.compute.internal, Mar 21 05:34:26 ip-10-0-207-60 hyperkube[1428]: W0321 05:34:26.718880 1428 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable podd8cab9a3-5dd0-418b-8c71-5648e07e7380] : unable to destroy cgroup paths for cgroup [kubepods burstable podd8cab9a3-5dd0-418b-8c71-5648e07e7380] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd8cab9a3_5dd0_418b_8c71_5648e07e7380.slice] Mar 21 05:34:26 ip-10-0-207-60 hyperkube[1428]: W0321 05:34:26.718977 1428 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable pod1a8a3b1b-a5e5-47f8-9a89-194141d40e4d] : unable to destroy cgroup paths for cgroup [kubepods burstable pod1a8a3b1b-a5e5-47f8-9a89-194141d40e4d] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1a8a3b1b_a5e5_47f8_9a89_194141d40e4d.slice] Mar 21 05:34:26 ip-10-0-207-60 hyperkube[1428]: W0321 05:34:26.719213 1428 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable pod6eb848d7-f88f-4b42-9ea3-85e30cc01144] : unable to destroy cgroup paths for cgroup [kubepods burstable pod6eb848d7-f88f-4b42-9ea3-85e30cc01144] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6eb848d7_f88f_4b42_9ea3_85e30cc01144.slice] Mar 21 05:34:26 ip-10-0-207-60 hyperkube[1428]: W0321 05:34:26.721335 1428 pod_container_manager_linux.go:198] failed to delete cgroup paths for [kubepods burstable podb5f39178-ee7a-41fd-bc6e-51ee3eae3e3e] : unable to destroy cgroup paths for cgroup [kubepods burstable podb5f39178-ee7a-41fd-bc6e-51ee3eae3e3e] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb5f39178_ee7a_41fd_bc6e_51ee3eae3e3e.slice] Mar 21 05:34:28 ip-10-0-207-60 hyperkube[1428]: E0321 05:34:28.539946 1428 kubelet_volumes.go:154] orphaned pod "3a0ff8ac-8e94-4c64-b93a-29023d13e0f1" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them. This is https://github.com/kubernetes/kubernetes/issues/97497 I thought we had another BZ already open for that but looks like they might have been closed. Confirmed the same symptoms on ip-10-0-188-114.us-west-2.compute.internal and ip-10-0-131-45.us-west-2.compute.internal.
iirc we needed a runc bump to get to the bottom of this. https://github.com/openshift/kubernetes/blob/5f82cdb9d3d2018383d0989e9b8df5d7344eba40/go.mod#L401 runc in OpenShift is still on rc92, we need rc93. Once https://github.com/openshift/kubernetes/pull/641 lands we should get this fix.
The rebase has now landed, can you please check this again after upgrade?
Reproducing in 4.8.0-0.nightly-2021-04-15-074503 oc get pod -A|grep Term openshift-monitoring prometheus-adapter-69cfc86595-7m5bf 0/1 Terminating 0 25m oc logs -n openshift-monitoring prometheus-adapter-69cfc86595-7m5bf unable to retrieve container logs for cri-o://61613eb8167bc4c33792b866304c24622b13dc2f6b1a9d2b2563fb88429cbeca[kni@r640-u09 ~
Hi Alexander, That reproduces #1920700 which does not appear to be fixed yet, are there any pods other than the prometheus-adapter that are stuck?
No issues after 6 days on 4.8.0-0.nightly-2021-04-15-074503 which contains the kubelet rebase. Will continue to monitor through this weekend.
Reproduced: [kni@r640-u09 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-21-084059 True False 57m Cluster version is 4.8.0-0.nightly-2021-04-21-084059 [kni@r640-u09 ~]$ oc get pod -A|grep Term openshift-monitoring prometheus-adapter-6b7474585-2drrw 0/1 Terminating 0 77m [kni@r640-u09 ~]$
Thanks Mike! I will mark this as closed for now.
@Alexander Chuzhoy: the issue you are commenting is not the issue being tracked in this bug. Please follow in https://bugzilla.redhat.com/show_bug.cgi?id=1920700
*** Bug 1945739 has been marked as a duplicate of this bug. ***
Might be https://bugzilla.redhat.com/show_bug.cgi?id=1952224 - let me try to confirm.
I can't dial the cluster to investigate further, and neither pod is referenced by name in the attached logs, however, the node is demonstrating the same symptoms as above. 8fd12299-aee7-4184-898d-28fb2c91fdea] err="unable to destroy cgroup paths for cgroup [kubepods burstable pod8fd12299-aee7-4184-898d-28fb2c91fdea] : Failed to remove paths: map[blkio:/sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice cpu:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice cpuacct:/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice cpuset:/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice devices:/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice freezer:/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice hugetlb:/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice memory:/sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice net_cls:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice net_prio:/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice perf_event:/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice pids:/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice systemd:/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8fd12299_aee7_4184_898d_28fb2c91fdea.slice]" ... I think it is safe to close this as a dupe of #1952224, which is on QA and should have a fix available now. *** This bug has been marked as a duplicate of bug 1952224 ***