Description of problem: Running node density testing on a bare metal cluster on ibmcloud hardware, I have had a test get stuck where 3 namespaces are stuck terminating. The test rapidly deployed 4 namespaces, each has a deployment with 250 pod replicas thus a total of 1000 pods. The cluster has two worker node each configured with a max-pods of 530 to allow capacity for the 1000 pods. The goal was to validate the 500pods/node limit in additional dimensions than previously tested. Several variations of this same test ran earlier without any namespaces stuck in terminating suggesting this is a race condition of some sort. The namespaces reported a similar error in the yaml output under conditions: # oc get ns jetlag-4 -o yaml apiVersion: v1 kind: Namespace ... status: conditions: - lastTransitionTime: "2021-09-10T14:05:46Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2021-09-10T14:05:46Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2021-09-10T14:06:09Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: jetlag-4 for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2021-09-10T14:05:46Z" message: 'Some resources are remaining: pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2021-09-10T14:05:46Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating The issue appears to be a pod stuck terminating that was not cleaned up: # oc get all -n jetlag-4 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/jetlag-4-1-jetlag-595f699975-2z7w5 0/1 Terminating 0 86m 10.130.1.2 jetlag-bm5 <none> <none> The deployment object appears to be missing as well. After force deleting the pod the namespace does clean up. Version-Release number of selected component (if applicable): 4.8.10 with OpenShiftSDN How reproducible: Unknown because test has only been run a few times with different versions Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Must-gather was taken after 2 of the 3 namespaces were cleaned up, the remaining error namepsace is (jetlag-4) in the must-gather data. I selected node/kubelet as the component for this bug since this testing is centered around validating the 500pods/node limits.
Reopening because, I ran into this issue (Or something very similar to it) again now on 4.9.0rc.4. In this case I was creating a lot of pods(500/node x 2 nodes = 1000pods with exec probes). This time I collected a must-gather and oc node-logs of worker type since it seems must-gather might not have an option to collect worker node host logs which are relevant to the issue. In this particular case I had attempted to test creating 500pods/node where the pods had an exec probe for startup/liveness/readiness and it seems one node even went not ready and many pods were stuck in container creating. It seems the initial node not ready and subsequent crash left some pods stuck in terminating while trying to clean up the job that was clearly stuck. Here is the namespace stuck in terminating: [root@jetlag-bm0 ~]# oc get ns boatload-1 NAME STATUS AGE boatload-1 Terminating 5h41m [root@jetlag-bm0 ~]# oc get ns boatload-1 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c175,c140 openshift.io/sa.scc.supplemental-groups: 1030730000/10000 openshift.io/sa.scc.uid-range: 1030730000/10000 creationTimestamp: "2021-09-28T11:18:15Z" deletionTimestamp: "2021-09-28T13:59:56Z" labels: kube-burner-job: boatload kube-burner-uuid: 374ac83f-7e65-4f18-9b43-d74e9be18e0f kubernetes.io/metadata.name: boatload-1 name: boatload-1 name: boatload-1 resourceVersion: "3493124" uid: dfab0368-58cf-45a9-9c85-001df14dffe7 spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2021-09-28T14:19:22Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2021-09-28T14:00:28Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2021-09-28T14:00:28Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: boatload-1 for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2021-09-28T14:00:28Z" message: 'Some resources are remaining: pods. has 5 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2021-09-28T14:00:28Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating Pods stuck in terminating: [root@jetlag-bm0 ~]# oc get po -n boatload-1 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES boatload-1-1-boatload-6b89557b59-24fhq 0/1 Terminating 0 5h42m 10.130.2.148 jetlag-bm5 <none> <none> boatload-1-1-boatload-6b89557b59-5hd86 0/1 Terminating 0 5h42m 10.130.2.204 jetlag-bm5 <none> <none> boatload-1-1-boatload-6b89557b59-6xf9j 0/1 Terminating 0 5h42m 10.130.2.174 jetlag-bm5 <none> <none> boatload-1-1-boatload-6b89557b59-c65gd 0/1 Terminating 0 5h42m 10.130.2.175 jetlag-bm5 <none> <none> boatload-1-1-boatload-6b89557b59-t42tz 0/1 Terminating 0 5h42m 10.130.2.181 jetlag-bm5 <none> <none> These log lines from the kubelet seem most relevant: Sep 28 15:17:31.307762 jetlag-bm5 hyperkube[2775]: E0928 15:17:31.307730 2775 pod_workers.go:787] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"575bd67f-89e1-44e7-94cd-5d2b9c23a085\" with KillPodSandboxError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"" pod="boatload-1/boatload-1-1-boatload-6b89557b59-24fhq" podUID=575bd67f-89e1-44e7-94cd-5d2b9c23a085 Sep 28 15:19:31.930437 jetlag-bm5 hyperkube[2775]: E0928 15:19:31.930388 2775 pod_workers.go:787] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"575bd67f-89e1-44e7-94cd-5d2b9c23a085\" with KillPodSandboxError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"" pod="boatload-1/boatload-1-1-boatload-6b89557b59-24fhq" podUID=575bd67f-89e1-44e7-94cd-5d2b9c23a085 I'll provide links to the logs in a separate comment.
I ran into this issue another time this time without any probes and with SNO. In this case, I am creating 418 namespaces each with one deployment, and one pod each. Each pod contains 4 containers of the same image running a simple golang http server. There is no workload being applied to the application running in the containers. After all pods are created and running, we run a measurement period followed by a cleanup period. The clean up is hanging with the namespace in terminating and 17 of the 418pods are stuck in terminating as well. # oc get deploy,rs,po -n boatload-1 NAME READY STATUS RESTARTS AGE pod/boatload-1-1-boatload-55dcc9f695-4l8rx 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-4v2tv 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-5gtvw 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-75qq8 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-ccvbp 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-gbmhj 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-jq4c9 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-ls8df 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-plvk5 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-qpps7 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-sb8h9 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-sblrx 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-vgdjd 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-vrnxq 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-wt4ww 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-xlmgb 0/4 Terminating 0 123m pod/boatload-1-1-boatload-55dcc9f695-zvw2m 0/4 Terminating 0 123m The deployment objects and replicaset objects have been cleaned and the namespace reports the following issue: # oc get ns boatload-1 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c27,c24 openshift.io/sa.scc.supplemental-groups: 1000750000/10000 openshift.io/sa.scc.uid-range: 1000750000/10000 creationTimestamp: "2021-10-04T15:21:31Z" deletionTimestamp: "2021-10-04T15:29:47Z" labels: kube-burner-job: boatload kube-burner-uuid: 57643fb4-36c6-492f-9be5-aa35c876f56c kubernetes.io/metadata.name: boatload-1 name: boatload-1 name: boatload-1 resourceVersion: "586588" uid: 60067bf9-2ac3-4cfe-b7fe-ec9748d11937 spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2021-10-04T15:30:01Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2021-10-04T15:30:01Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2021-10-04T15:35:59Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: boatload-1 for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2021-10-04T15:30:01Z" message: 'Some resources are remaining: pods. has 17 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2021-10-04T15:30:01Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating The node is reporting as ready: # oc get no NAME STATUS ROLES AGE VERSION jetlag-bm8 Ready master,worker 36h v1.22.0-rc.0+af080cb I went to grab a must-gather after this and tghe mustg-gather displayed a large number of failures related to gathing logs: one or more errors ocurred while gathering container data for pod service-ca-745ccbb578-rhcg9: [Get "https://10.5.190.12:10250/containerLogs/openshift-service-ca/service-ca-745ccbb578-rhcg9/service-ca-controller?previous=true×tamps=true": remote error: tls: internal error, Ge t "https://10.5.190.12:10250/containerLogs/openshift-service-ca/service-ca-745ccbb578-rhcg9/service-ca-controller?timestamps=true": remote error: tls: internal error]]error: unable to downlo ad output from pod must-gather-56szd: No available strategies to copy. I then attempted a reboot of the node to no avail to gather logs until i examined for unapproved certificates to which I found many, I approived those and logs seems restored now. It seems this might be related to certificate rotation (at least in this case)
it sounds like it's related to certs/isn't really an issue. can this be closed?
(In reply to Peter Hunt from comment #8) > it sounds like it's related to certs/isn't really an issue. can this be > closed? I am still running into this occasionally. I do not think it is fair to close it or chalk it up to a "bad cert". It seems there are situations in which a namespace can be stuck terminating possible through some race condition when the cluster is under high load, shouldn't this clear out when the cluster is no longer under high load? I have an active 4.9.0 SNO "cluster" with many namespaces stuck in terminating at this moment and I am happy to share this cluster directly to a dev to take a look. There are no pending csrs just namespaces that are stuck in terminating with pods stuck in terminating. # oc get ns | grep terminating -i | wc -l 69 # oc get ns | grep terminating -i boatload-132 Terminating 5h5m boatload-136 Terminating 5h5m boatload-157 Terminating 5h5m boatload-158 Terminating 5h5m ... # oc get po -n boatload-132 NAME READY STATUS RESTARTS AGE boatload-132-1-boatload-6657dc456b-b2t5b 0/3 Terminating 0 5h6m # oc get ns boatload-132 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c102,c14 openshift.io/sa.scc.supplemental-groups: 1010330000/10000 openshift.io/sa.scc.uid-range: 1010330000/10000 creationTimestamp: "2021-10-21T10:20:39Z" deletionTimestamp: "2021-10-21T10:29:01Z" labels: kube-burner-job: boatload kube-burner-uuid: 8507923b-7c72-4f35-9ee0-d5fdc3d264ef kubernetes.io/metadata.name: boatload-132 name: boatload-132 name: boatload-132 resourceVersion: "1569169" uid: ca38b6bd-80e6-42a8-b3fd-1e8fbd832a56 spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2021-10-21T10:29:08Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2021-10-21T10:29:08Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2021-10-21T10:29:55Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: boatload-132 for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2021-10-21T10:29:08Z" message: 'Some resources are remaining: pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2021-10-21T10:29:08Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating This namespace has been stuck in terminating for about 5 hours now. I'll collect another must-gather, and grab logs when I can from the SNO, and keep this available for a dev that could take a look. Please reach out to me over slack so we can have someone take a closer look.
Hey Alex, I've been swamped lately and haven't had a moment to take a look at this (until now!). can I ask you to reproduce this situation and get the crio goroutine stacks (described https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md)? It will help me find where cri-o is stuck
(In reply to Peter Hunt from comment #11) > Hey Alex, I've been swamped lately and haven't had a moment to take a look > at this (until now!). can I ask you to reproduce this situation and get the > crio goroutine stacks (described > https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md)? It will > help me find where cri-o is stuck I believe to have captured the crio goroutine stacks. I have the cluster online still if you want to take a live look at it.
Pr merged
config sno node with a max-pods of 500, create 420 namespaces, each with one deployment, and one pod each. Each pod contains 4 containers, no workload being applied to the application running in the containers. test 3 times, don't reproduce this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056