Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2016339

Summary:

slow pod deletion during load: NamespaceContentRemaining + NamespaceDeletionContentFailure: unexpected items still remain in namespace

Product:

OpenShift Container Platform

Reporter:

Rahul Gangwar <rgangwar>

Component:

Node

Assignee:

Sai Ramesh Vanka <svanka>

Node sub component:

Kubelet

QA Contact:

Sunil Choudhary <schoudha>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

aos-bugs, harpatil, pehunt, tsweeney

Version:

4.9

Keywords:

Reopened

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-02-09 15:16:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
must gather and node log	none

Description Rahul Gangwar 2021-10-21 10:33:46 UTC

Created attachment 1835540 [details]
must gather and node log

Description of problem:
Observe slow pod deletion when cluster has load, NamespaceContentRemaining + NamespaceDeletionContentFailure: unexpected items still remain in namespace.

Version-Release number of selected component (if applicable):
4.9.0-rc.7

How reproducible:
Always when cluster has load

Steps to Reproduce:
1. First run from the below commands from different user(e.g. kubeadmin) using the 1st terminal.
for i in {0..200}; do oc new-project test-$i --description=test-debug-$i --display-name test-debug-$i;done
And use clusterbuster tool to put extra load on cluster.

2. Run below command from a different user(e.g. testuser-0) in the 2nd terminal.
oc new-project 3pndd; oc new-app --docker-image=quay.io/openshifttest/hello-openshift:openshift; oc delete projects 3pndd;oc get projects 3pndd --output=yaml

3. Run below command from a same user(e.g. testuser-0) which you run above command in the 3rd terminal using context admin to monitor pod deletion   
    NS=$(oc --context=admin get ns -o jsonpath='{.items[?(@.metadata.annotations.openshift\.io/requester=="testuser-0")].metadata.name}')
    while true; do
        date
        echo $NS
        oc --context=admin get ns $NS
        oc --context=admin get ns $NS -o yaml | grep -A 1000 "^status"
        oc --context=admin get po -n $NS
        oc --context=admin get po -n $NS -o yaml
    sleep 1
    done

Actual results:
Observe slow pod deletion. oc get ns <PROJECT_NAME> -o yaml shows: unexpected items still remain in namespace.

Timestamp when pod get start deleting
Wed Oct 20 21:05:11 IST 2021
3pndd
NAME    STATUS        AGE
3pndd   Terminating   25s
oc --context=admin get ns <project name> -o yaml | grep -A 1000 "^status"
status:
  conditions:
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: 'Some resources are remaining: pods. has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: All content-preserving finalizers finished
    reason: ContentHasNoFinalizers
    status: "False"
    type: NamespaceFinalizersRemaining
  phase: Terminating

Till 10mins pod still deleting.
Wed Oct 20 21:13:54 IST 2021
3pndd
NAME    STATUS        AGE
3pndd   Terminating   9m8s
status:
  conditions:
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2021-10-20T15:35:39Z"
    message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: 3pndd for gvr: /v1, Resource=pods'
    reason: ContentDeletionFailed
    status: "True"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: 'Some resources are remaining: pods. has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2021-10-20T15:35:00Z"
    message: All content-preserving finalizers finished
    reason: ContentHasNoFinalizers
    status: "False"
    type: NamespaceFinalizersRemaining
  phase: Terminating

Expected results:
Should not show this error and pod deletion should be fast.

Additional info:
Must gather and node logs - https://drive.google.com/file/d/1Zs0JG0V3ohPt89UU9c_HrW4V14ja4cbI/view?usp=sharing
Slack discussion link with Mike Fiedler  for reference https://coreos.slack.com/archives/CS05TR7BK/p1634746479498400?thread_ts=1634739477.494700&cid=CS05TR7BK

Comment 1 Ryan Phillips 2021-11-04 21:15:48 UTC

Is this reproducible with the released 4.9?

Comment 2 Peter Hunt 2021-11-05 15:37:24 UTC

This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2003206

can we get the crio goroutines to confirm? (see https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md)

Comment 3 Rahul Gangwar 2021-11-08 04:15:14 UTC

@Peter Hunt, Shared must gather logs. Are must-gather logs are not useful?

Comment 4 Rahul Gangwar 2021-11-08 04:48:37 UTC

Peter, yes it is  with the released 4.9.

Comment 5 Peter Hunt 2021-11-08 14:47:56 UTC

must-gather doesn't yet contain the information requested in https://bugzilla.redhat.com/show_bug.cgi?id=2016339#c2

Comment 7 Tom Sweeney 2022-01-06 16:17:33 UTC

Not completed this sprint.

Comment 8 Harshal Patil 2022-01-10 04:26:45 UTC


*** This bug has been marked as a duplicate of bug 2021432 ***

Comment 9 Rahul Gangwar 2022-01-18 07:02:51 UTC

Harsha still issue I can reproduced on 4.10 and it is not duplicate.

oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-17-182202   True        False         136m    Cluster version is 4.10.0-0.nightly-2022-01-17-182202

Tue Jan 18 12:08:51 IST 2022
3pndd
NAME    STATUS        AGE
3pndd   Terminating   2m24s
status:
  conditions:
  - lastTransitionTime: "2022-01-18T06:36:40Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2022-01-18T06:36:40Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2022-01-18T06:37:20Z"
    message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: 3pndd for gvr: /v1, Resource=pods'
    reason: ContentDeletionFailed
    status: "True"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2022-01-18T06:36:40Z"
    message: 'Some resources are remaining: pods. has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2022-01-18T06:36:40Z"
    message: All content-preserving finalizers finished
    reason: ContentHasNoFinalizers
    status: "False"
    type: NamespaceFinalizersRemaining
  phase: Terminating

Comment 13 Rahul Gangwar 2022-01-18 09:08:08 UTC

CRIO log https://drive.google.com/file/d/18Qz7S1AmAzA3l8Sa0jc9o6MftaAbdqWg/view?usp=sharing

Comment 15 Sai Ramesh Vanka 2022-01-18 15:33:52 UTC

Hi Rahul,

I couldn't find any kubelet logs related to the namespace "3pndd" in the attached must-gather output - https://bugzilla.redhat.com/show_bug.cgi?id=2016339#c10

[svanka@svanka masters]$ ls -lrt
total 10696
-rw-r--r--. 1 svanka svanka  435691 Jan 18 12:34 ovs-vswitchd_service.log
-rw-r--r--. 1 svanka svanka   10332 Jan 18 12:34 ovsdb-server_service.log
-rw-r--r--. 1 svanka svanka   50834 Jan 18 12:34 ovs-configuration_service.log
-rw-r--r--. 1 svanka svanka    2348 Jan 18 12:34 openvswitch_service.log
-rw-r--r--. 1 svanka svanka  157923 Jan 18 12:34 NetworkManager_service.log
-rw-r--r--. 1 svanka svanka     612 Jan 18 12:34 machine-config-daemon-host_service.log
-rw-r--r--. 1 svanka svanka   30215 Jan 18 12:34 machine-config-daemon-firstboot_service.log
-rw-r--r--. 1 svanka svanka 6221625 Jan 18 12:34 kubelet_service.log
-rw-r--r--. 1 svanka svanka 4026143 Jan 18 12:34 crio_service.log
[svanka@svanka masters]$ grep -nri "3pndd" *
[svanka@svanka masters]$ 

Could you attach the kubelet log from the node on which the pods in the above namespace are running? This would really help in debugging the root cause of the issue.
Moreover, I didn't find any abnormal behavior in the kubelet and crio services' logs attached in the must-gather output.

Thanks,
Ramesh

Comment 16 Peter Hunt 2022-01-18 19:02:19 UTC

must-gather only gathers from a random master by default. we'll need the must-gather or sos_report from the node with the slowly deleted pod

Comment 17 Rahul Gangwar 2022-01-19 05:53:24 UTC

must-gather of all masters https://drive.google.com/file/d/1ZqyNX2kBs-Dga_AMPYmSNE8wiVnz_vEN/view?usp=sharing

Comment 18 Peter Hunt 2022-01-19 14:38:19 UTC

sorry if it was vague. I presume this workload is landing on a worker, so we'll need the must-gather of the worker this workload is stuck on

Comment 19 Sai Ramesh Vanka 2022-01-20 06:34:19 UTC

Hey Rahul,

I still couldn't find any logs relevant to the "3pndd" namespace in the must-gather logs attached in https://bugzilla.redhat.com/show_bug.cgi?id=2016339#c17

As Peter mentioned, could you attach the kubelet logs of the worker node on which the workload is present?

Thanks,
Ramesh

Comment 20 Rahul Gangwar 2022-01-20 07:43:01 UTC

All node logs https://drive.google.com/file/d/1LaYBFoipWcot8Ml9b5IzK1JKRrZoQEyc/view?usp=sharing

Comment 21 Sai Ramesh Vanka 2022-02-02 15:20:29 UTC

Hi Rahul,

The attached must-gather logs aren't really helpful in debugging the actual root cause for the issue.

Could you attach the kubelet logs of the worker node on which the workload is present as requested in the above comments?

Following command helps in gathering the kubelet logs from a node.

`oc adm node-logs -u kubelet <node-name>`

Thanks,
Ramesh

Comment 22 Rahul Gangwar 2022-02-03 07:11:37 UTC

kubelet logs of all nodes https://drive.google.com/file/d/1rikI8nfDNKOJZ53y7S9ZOgr9GzdA9siN/view?usp=sharing. If this is not help you, I can guide you for replication of issue.

Comment 24 Sai Ramesh Vanka 2022-02-09 15:16:20 UTC


*** This bug has been marked as a duplicate of bug 2038780 ***