Description of problem: While running multiple benchmarks to check the 250pods/node max-pods limit on a bare-metal cluster with OVN-kubernetes, several namespaces get stuck in terminating state. In this testing we are attempting to validate a bare metal cluster with OVN-kubernetes being able to create/run/delete up to max-pods capacity of several worker nodes concurrently. The cluster is 13 nodes (3 control-plane nodes, 10 worker nodes). Currently 5 of the 10 worker nodes are labeled for the workload (jetlag=true) and host an additional node-role label of nodedensity to more easily parse the pod and resource capacity of those nodes. Prior to running the benchmark, the 5 workload nodes are drained and then uncordoned so that they only host pods related to keeping the node running and connected to the cluster. (This is 14pods/node) Thus the capacity we test is: 5 nodes * 250 max-pods = 1250 total capacity 5 nodes * 14 openshift pods = 70 steady-state openshift pods 5 nodes * 234 workload pods = 1170 workload pods Sum of steady state and workload pods (70+1170 = 1240 pods) allows for 10 extra pods incase there is a job pod that decides to schedule on the workload nodes. ** Thus we are actually testing to just below the 250 max-pods a node count ** Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-12-23-153012 How reproducible: "Always" in the sense it typically requires a few benchmark runs before a node becomes stuck with many pods stuck terminating and subsequent namespaces stuck terminating. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Node status while multiple namespaces are stuck terminating # oc get no NAME STATUS ROLES AGE VERSION jetlag-bm10 Ready master 2d4h v1.22.1+6859754 jetlag-bm11 Ready master 2d4h v1.22.1+6859754 jetlag-bm12 Ready master 2d4h v1.22.1+6859754 jetlag-bm13 Ready nodedensity,worker 2d4h v1.22.1+6859754 jetlag-bm14 Ready nodedensity,worker 2d4h v1.22.1+6859754 jetlag-bm15 Ready nodedensity,worker 2d4h v1.22.1+6859754 jetlag-bm16 Ready nodedensity,worker 2d4h v1.22.1+6859754 jetlag-bm17 Ready nodedensity,worker 2d4h v1.22.1+6859754 jetlag-bm18 Ready worker 2d4h v1.22.1+6859754 jetlag-bm19 Ready worker 2d4h v1.22.1+6859754 jetlag-bm20 Ready worker 2d4h v1.22.1+6859754 jetlag-bm21 Ready worker 2d4h v1.22.1+6859754 jetlag-bm22 Ready worker 2d4h v1.22.1+6859754 Namespaces stuck terminating: # oc get ns | head NAME STATUS AGE assisted-installer Active 2d4h boatload-1037 Terminating 160m boatload-1039 Terminating 160m boatload-1051 Terminating 159m boatload-1055 Terminating 159m boatload-1060 Terminating 159m boatload-1063 Terminating 159m boatload-1066 Terminating 159m boatload-1068 Terminating 159m # oc get ns | grep "Terminating" -c 115 # oc get po -A | grep boatload |grep Terminating -c 115 # oc get po -A | grep boatload | head boatload-1037 boatload-1037-1-boatload-774d9fb978-4sf6g 0/1 Terminating 0 161m boatload-1039 boatload-1039-1-boatload-84d9bf6964-8xr22 0/1 Terminating 0 160m boatload-1051 boatload-1051-1-boatload-67dd588b74-vqr9n 0/1 Terminating 0 160m boatload-1055 boatload-1055-1-boatload-9f5d6b6c8-7qbqx 0/1 Terminating 0 160m boatload-1060 boatload-1060-1-boatload-9bdd489c8-tqqbj 0/1 Terminating 0 160m boatload-1063 boatload-1063-1-boatload-6d5d96cc89-lbnm9 0/1 Terminating 0 160m boatload-1066 boatload-1066-1-boatload-6dc666fbdd-mbztf 0/1 Terminating 0 160m boatload-1068 boatload-1068-1-boatload-569fbb78f8-zkxtr 0/1 Terminating 0 160m boatload-113 boatload-113-1-boatload-7cc9f7887b-kv9r4 0/1 Terminating 0 175m boatload-1130 boatload-1130-1-boatload-6476bd98dd-999w9 0/1 Terminating 0 159m Distribution of pods across nodes: # oc get po -A -o wide | grep boatload | awk '{print $8}' | sort | uniq -c 115 jetlag-bm13 * Just one node is hosting the pods that are stuck Looking at one stuck pod: # oc get po -n boatload-113 NAME READY STATUS RESTARTS AGE boatload-113-1-boatload-7cc9f7887b-kv9r4 0/1 Terminating 0 176m # oc describe po -n boatload-113 Name: boatload-113-1-boatload-7cc9f7887b-kv9r4 Namespace: boatload-113 Priority: 0 Node: jetlag-bm13/10.5.190.42 Start Time: Fri, 07 Jan 2022 17:13:21 -0600 Labels: app=boatload-113-1 pod-template-hash=7cc9f7887b Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.130.19.53/21"],"mac_address":"0a:58:0a:82:13:35","gateway_ips":["10.130.16.1"],"ip_address":"10.130.19.53/... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.130.19.53" ], "mac": "0a:58:0a:82:13:35", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.130.19.53" ], "mac": "0a:58:0a:82:13:35", "default": true, "dns": {} }] openshift.io/scc: restricted Status: Terminating (lasts 128m) Termination Grace Period: 30s IP: 10.130.19.53 IPs: IP: 10.130.19.53 Controlled By: ReplicaSet/boatload-113-1-boatload-7cc9f7887b Containers: boatload-1: Container ID: cri-o://6d677870614812541683d7ef2d9766c025eb7c575a9fc8e09925a41f5103e36b Image: quay.io/redhat-performance/test-gohttp-probe:v0.0.2 Image ID: 99b026db9534b7ede003ab26a626e1ce90e0a5d41ba2191f615601695afccfae Port: 8000/TCP Host Port: 0/TCP State: Terminated Reason: Error Exit Code: 2 Started: Fri, 07 Jan 2022 17:13:24 -0600 Finished: Fri, 07 Jan 2022 17:37:20 -0600 Ready: False Restart Count: 0 Environment: PORT: 8000 LISTEN_DELAY_SECONDS: 0 LIVENESS_DELAY_SECONDS: 0 READINESS_DELAY_SECONDS: 0 RESPONSE_DELAY_MILLISECONDS: 0 LIVENESS_SUCCESS_MAX: 0 READINESS_SUCCESS_MAX: 0 Mounts: /etc/cm-1 from cm-1 (rw) /etc/cm-2 from cm-2 (rw) /etc/cm-3 from cm-3 (rw) /etc/cm-4 from cm-4 (rw) /etc/cm-5 from cm-5 (rw) /etc/cm-6 from cm-6 (rw) /etc/cm-7 from cm-7 (rw) /etc/cm-8 from cm-8 (rw) /etc/secret-1 from secret-1 (rw) /etc/secret-2 from secret-2 (rw) /etc/secret-3 from secret-3 (rw) /etc/secret-4 from secret-4 (rw) /etc/secret-5 from secret-5 (rw) /etc/secret-6 from secret-6 (rw) /etc/secret-7 from secret-7 (rw) /etc/secret-8 from secret-8 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b6gbq (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: cm-1: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-1-boatload Optional: false cm-2: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-2-boatload Optional: false cm-3: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-3-boatload Optional: false cm-4: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-4-boatload Optional: false cm-5: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-5-boatload Optional: false cm-6: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-6-boatload Optional: false cm-7: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-7-boatload Optional: false cm-8: Type: ConfigMap (a volume populated by a ConfigMap) Name: boatload-113-8-boatload Optional: false secret-1: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-1-boatload Optional: false secret-2: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-2-boatload Optional: false secret-3: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-3-boatload Optional: false secret-4: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-4-boatload Optional: false secret-5: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-5-boatload Optional: false secret-6: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-6-boatload Optional: false secret-7: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-7-boatload Optional: false secret-8: Type: Secret (a volume populated by a Secret) SecretName: boatload-113-8-boatload Optional: false kube-api-access-b6gbq: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: jetlag=true Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none> Namespace of pod stuck: # oc get ns boatload-113 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c532,c54 openshift.io/sa.scc.supplemental-groups: 1282600000/10000 openshift.io/sa.scc.uid-range: 1282600000/10000 creationTimestamp: "2022-01-07T23:13:20Z" deletionTimestamp: "2022-01-07T23:35:25Z" labels: kube-burner-job: boatload kube-burner-uuid: dc493669-588d-4e4f-85f0-544a76c21d4d kubernetes.io/metadata.name: boatload-113 name: boatload-113 name: boatload-113 resourceVersion: "8187751" uid: cc374d37-1e93-4c6a-a39d-6e8fc0ad841d spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2022-01-07T23:35:31Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2022-01-07T23:35:31Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2022-01-07T23:36:02Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: boatload-113 for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2022-01-07T23:35:31Z" message: 'Some resources are remaining: pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2022-01-07T23:35:31Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating
FWIW, I have taken a look at a node that Alex put together that had this issue happening. I have concluded that it is not a cri-o problem. The container that ends up stuck in terminating is never referenced in cri-o after it's removed, nor is there a runtime process or any container storage artifacts left. however, the kubelet seems to think the pod still has some resources that still need cleaning up. Tossing over to Harshal for further investigation
Hello Alex, From the logs, I see the issue is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2038780 Sample log that shows the pod in an indefinite loop while mounting the required volumes. Setting up of the volumes required the configmaps which are already deleted. Following are some of the upstream links which are helpful. Issue Link: https://github.com/kubernetes/kubernetes/issues/96635 A PR having the possible resolution is also open. PR Link: https://github.com/kubernetes/kubernetes/pull/96790 Thanks, Ramesh
(In reply to Sai Ramesh Vanka from comment #10) > Hello Alex, > > From the logs, I see the issue is a duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=2038780 > > Sample log that shows the pod in an indefinite loop while mounting the > required volumes. > Setting up of the volumes required the configmaps which are already deleted. > > Following are some of the upstream links which are helpful. > > Issue Link: https://github.com/kubernetes/kubernetes/issues/96635 > > A PR having the possible resolution is also open. > PR Link: https://github.com/kubernetes/kubernetes/pull/96790 > > Thanks, > Ramesh Thanks for sharing, seems like this has been an issue for a long time that is difficult to reproduce. FWIW, I was able to reproduce on 4.10.0-0.nightly-2022-01-15-092722 and once it reproduced, simply restarting kubelet on the affected node resolves the pods that were stuck and allows the namespaces to terminate. (This also occurs when crio is restarted, but this is probably because crio causes kubelet to restart as well)
*** This bug has been marked as a duplicate of bug 2038780 ***