Created attachment 1835540 [details] must gather and node log Description of problem: Observe slow pod deletion when cluster has load, NamespaceContentRemaining + NamespaceDeletionContentFailure: unexpected items still remain in namespace. Version-Release number of selected component (if applicable): 4.9.0-rc.7 How reproducible: Always when cluster has load Steps to Reproduce: 1. First run from the below commands from different user(e.g. kubeadmin) using the 1st terminal. for i in {0..200}; do oc new-project test-$i --description=test-debug-$i --display-name test-debug-$i;done And use clusterbuster tool to put extra load on cluster. 2. Run below command from a different user(e.g. testuser-0) in the 2nd terminal. oc new-project 3pndd; oc new-app --docker-image=quay.io/openshifttest/hello-openshift:openshift; oc delete projects 3pndd;oc get projects 3pndd --output=yaml 3. Run below command from a same user(e.g. testuser-0) which you run above command in the 3rd terminal using context admin to monitor pod deletion NS=$(oc --context=admin get ns -o jsonpath='{.items[?(@.metadata.annotations.openshift\.io/requester=="testuser-0")].metadata.name}') while true; do date echo $NS oc --context=admin get ns $NS oc --context=admin get ns $NS -o yaml | grep -A 1000 "^status" oc --context=admin get po -n $NS oc --context=admin get po -n $NS -o yaml sleep 1 done Actual results: Observe slow pod deletion. oc get ns <PROJECT_NAME> -o yaml shows: unexpected items still remain in namespace. Timestamp when pod get start deleting Wed Oct 20 21:05:11 IST 2021 3pndd NAME STATUS AGE 3pndd Terminating 25s oc --context=admin get ns <project name> -o yaml | grep -A 1000 "^status" status: conditions: - lastTransitionTime: "2021-10-20T15:35:00Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2021-10-20T15:35:00Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2021-10-20T15:35:00Z" message: All content successfully deleted, may be waiting on finalization reason: ContentDeleted status: "False" type: NamespaceDeletionContentFailure - lastTransitionTime: "2021-10-20T15:35:00Z" message: 'Some resources are remaining: pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2021-10-20T15:35:00Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating Till 10mins pod still deleting. Wed Oct 20 21:13:54 IST 2021 3pndd NAME STATUS AGE 3pndd Terminating 9m8s status: conditions: - lastTransitionTime: "2021-10-20T15:35:00Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2021-10-20T15:35:00Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2021-10-20T15:35:39Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: 3pndd for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2021-10-20T15:35:00Z" message: 'Some resources are remaining: pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2021-10-20T15:35:00Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating Expected results: Should not show this error and pod deletion should be fast. Additional info: Must gather and node logs - https://drive.google.com/file/d/1Zs0JG0V3ohPt89UU9c_HrW4V14ja4cbI/view?usp=sharing Slack discussion link with Mike Fiedler for reference https://coreos.slack.com/archives/CS05TR7BK/p1634746479498400?thread_ts=1634739477.494700&cid=CS05TR7BK
Is this reproducible with the released 4.9?
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2003206 can we get the crio goroutines to confirm? (see https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md)
@Peter Hunt, Shared must gather logs. Are must-gather logs are not useful?
Peter, yes it is with the released 4.9.
must-gather doesn't yet contain the information requested in https://bugzilla.redhat.com/show_bug.cgi?id=2016339#c2
Not completed this sprint.
*** This bug has been marked as a duplicate of bug 2021432 ***
Harsha still issue I can reproduced on 4.10 and it is not duplicate. oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-17-182202 True False 136m Cluster version is 4.10.0-0.nightly-2022-01-17-182202 Tue Jan 18 12:08:51 IST 2022 3pndd NAME STATUS AGE 3pndd Terminating 2m24s status: conditions: - lastTransitionTime: "2022-01-18T06:36:40Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2022-01-18T06:36:40Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2022-01-18T06:37:20Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: 3pndd for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2022-01-18T06:36:40Z" message: 'Some resources are remaining: pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2022-01-18T06:36:40Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating
CRIO log https://drive.google.com/file/d/18Qz7S1AmAzA3l8Sa0jc9o6MftaAbdqWg/view?usp=sharing
Hi Rahul, I couldn't find any kubelet logs related to the namespace "3pndd" in the attached must-gather output - https://bugzilla.redhat.com/show_bug.cgi?id=2016339#c10 [svanka@svanka masters]$ ls -lrt total 10696 -rw-r--r--. 1 svanka svanka 435691 Jan 18 12:34 ovs-vswitchd_service.log -rw-r--r--. 1 svanka svanka 10332 Jan 18 12:34 ovsdb-server_service.log -rw-r--r--. 1 svanka svanka 50834 Jan 18 12:34 ovs-configuration_service.log -rw-r--r--. 1 svanka svanka 2348 Jan 18 12:34 openvswitch_service.log -rw-r--r--. 1 svanka svanka 157923 Jan 18 12:34 NetworkManager_service.log -rw-r--r--. 1 svanka svanka 612 Jan 18 12:34 machine-config-daemon-host_service.log -rw-r--r--. 1 svanka svanka 30215 Jan 18 12:34 machine-config-daemon-firstboot_service.log -rw-r--r--. 1 svanka svanka 6221625 Jan 18 12:34 kubelet_service.log -rw-r--r--. 1 svanka svanka 4026143 Jan 18 12:34 crio_service.log [svanka@svanka masters]$ grep -nri "3pndd" * [svanka@svanka masters]$ Could you attach the kubelet log from the node on which the pods in the above namespace are running? This would really help in debugging the root cause of the issue. Moreover, I didn't find any abnormal behavior in the kubelet and crio services' logs attached in the must-gather output. Thanks, Ramesh
must-gather only gathers from a random master by default. we'll need the must-gather or sos_report from the node with the slowly deleted pod
must-gather of all masters https://drive.google.com/file/d/1ZqyNX2kBs-Dga_AMPYmSNE8wiVnz_vEN/view?usp=sharing
sorry if it was vague. I presume this workload is landing on a worker, so we'll need the must-gather of the worker this workload is stuck on
Hey Rahul, I still couldn't find any logs relevant to the "3pndd" namespace in the must-gather logs attached in https://bugzilla.redhat.com/show_bug.cgi?id=2016339#c17 As Peter mentioned, could you attach the kubelet logs of the worker node on which the workload is present? Thanks, Ramesh
All node logs https://drive.google.com/file/d/1LaYBFoipWcot8Ml9b5IzK1JKRrZoQEyc/view?usp=sharing
Hi Rahul, The attached must-gather logs aren't really helpful in debugging the actual root cause for the issue. Could you attach the kubelet logs of the worker node on which the workload is present as requested in the above comments? Following command helps in gathering the kubelet logs from a node. `oc adm node-logs -u kubelet <node-name>` Thanks, Ramesh
kubelet logs of all nodes https://drive.google.com/file/d/1rikI8nfDNKOJZ53y7S9ZOgr9GzdA9siN/view?usp=sharing. If this is not help you, I can guide you for replication of issue.
*** This bug has been marked as a duplicate of bug 2038780 ***