Bug 2030029

Summary:	[4.10][goroutine]Namespace stuck terminating: Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace
Product:	OpenShift Container Platform	Reporter:	Christoffer Back <cback>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	akrzos, aos-bugs, bzhai, djuran, fminafra, harpatil, igreen, joboyer, kkarampo, mavazque, minmli, mmethot, nagrawal, nchhabra, oarribas, openshift-bugs-escalate, pehunt, schoudha
Version:	4.8	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	2021431
Clones:	2040711 2040712 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:32:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2003206
Bug Blocks:	2021431, 2021432, 2040711, 2040712

Comment 1 Peter Hunt 2021-12-07 22:10:41 UTC

I've asked for a new bug in https://bugzilla.redhat.com/show_bug.cgi?id=2021431#c8 to investigate the new set of deadlocks on pod stop we're seeing

I have created a scratch build of cri-o that I'm interested in seeing whether it triggers the problem

http://brew-task-repos.usersys.redhat.com/repos/scratch/pehunt/cri-o/1.21.4/5.rhaos4.8.git84fa55d.el8/

Comment 2 Peter Hunt 2021-12-07 22:12:35 UTC

oh and if the issue reproduces even with the scratch build I'll want the info provided in https://bugzilla.redhat.com/show_bug.cgi?id=2021431#c6 again

Comment 7 Sascha Grunert 2021-12-09 10:58:36 UTC

Chris, the must-gather tells me that they run CRI-O 1.21.4-5.rhaos4.8.git84fa55d.el8, which is not the one Peter provided (1.21.4-6.rhaos4.8.gitc845cf4.el8). Can you ask them to override the rpm via rpm-ostree?

Comment 11 Sascha Grunert 2021-12-10 09:45:37 UTC

Hey Chris, I uploaded two modified test binaries for 4.8 and 4.9. Please request another test from the customer. Then I will sync with Peter on monday how to approach this issue.

Comment 12 Christoffer Back 2021-12-10 10:52:05 UTC

(In reply to Sascha Grunert from comment #11)
> Hey Chris, I uploaded two modified test binaries for 4.8 and 4.9. Please
> request another test from the customer. Then I will sync with Peter on
> monday how to approach this issue.


Hi Sascha, the binaries and instructions have been delievered to the customer. I will link a set of logs once their testing is complete. 

Instructions delievered: 
########################################################

Create a node debug container:

```
oc debug node/ci-ln-2myl9xb-f76d1-ck27t-master-0
```

Copy the tarball to the container:

```
kubectl cp crio.tar.gz ci-ln-2myl9xb-f76d1-ck27t-master-0-debug:/tmp/crio.tar.gz
```

In the container, move the tarball to the destination and verify that the executable works:

```
mv /tmp/crio.tar.gz /host/tmp
chroot /host
tar xf /tmp/crio.tar.gz -C /usr/local/bin/
/usr/local/bin/crio version
```

```
Version:       1.21.3
GitCommit:     51409e1b2dc9ccfbb7d7f4fd543a094097627ae2
GitTreeState:  dirty
BuildDate:     1980-01-01T00:00:00Z
GoVersion:     go1.15.7
Compiler:      gc
Platform:      linux/amd64
Linkmode:      static
```

Edit the crio unit file:

```
systemctl edit crio
```

Add the following override:

```
[Service]
ExecStart=
ExecStart=-/usr/local/bin/crio
```

Restart crio:

```
systemctl daemon-reload
systemctl restart crio
```
############################################################


Br, 
Chris

Comment 50 Peter Hunt 2022-01-17 17:59:43 UTC

*** Bug 2014083 has been marked as a duplicate of this bug. ***

Comment 51 Sunil Choudhary 2022-01-18 11:27:13 UTC

Tested on 4.10.0-0.nightly-2022-01-17-223655

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-17-223655   True        False         136m    Cluster version is 4.10.0-0.nightly-2022-01-17-223655

$ oc get nodes -o wide
NAME                                                STATUS   ROLES           AGE    VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
master-00.sunilc410bm.qe.devcluster.openshift.com   Ready    master,worker   156m   v1.23.0+60f5a1c   147.75.80.115   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201171746-0 (Ootpa)   4.18.0-305.30.1.el8_4.x86_64   cri-o://1.23.0-102.rhaos4.10.git9c23ef3.el8

Comment 52 Peter Hunt 2022-01-26 19:55:44 UTC

*** Bug 2040485 has been marked as a duplicate of this bug. ***

Comment 53 Peter Hunt 2022-02-01 14:52:24 UTC

*** Bug 2015412 has been marked as a duplicate of this bug. ***

Comment 56 errata-xmlrpc 2022-03-10 16:32:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056