Bug 2069058

Summary:	Pods hang on container creating after worker node update to 4.8.35
Product:	OpenShift Container Platform	Reporter:	Emmanuel Kasper <ekasprzy>
Component:	Node	Assignee:	Kir Kolyshkin <kir>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aivaraslaimikis, aos-bugs, bkulawia, bverschu, gpulido, harpatil, jolee, ldixon, pehunt, ssonigra, wweber
Version:	4.8
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-25 15:28:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Emmanuel Kasper 2022-03-28 07:29:07 UTC

Description of problem:
After upgrading to 4.8.35, worker not is not able to create pods, and status of node stays at unschedulable.

We can see the following message in the cri-o journal
Error reserving ctr name uuid0 for id : uuid1"

and in the kubelet journal
CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name uuid0 for id uuid1: name is reserved" podSandboxID="uuid2


Version-Release number of selected component (if applicable): 4.8.35


How reproducible: Difficult to reproduce, as this was noticed in a production environment.

Additional info:
This took place on a very large node (800GB RAM, set to 500 pods max)
My supposition is that after the node was node was upgraded and uncordoned, as massive pod reschedlung took place which overloaded the kubelet and cri-o on that node.

After draining the node, rebooting, and uncordoning, pods were scheduled to that node normally.

Comment 5 Emmanuel Kasper 2022-04-01 13:15:15 UTC

A bug producing a similar error message was fixed in OCP 4.7 https://bugzilla.redhat.com/show_bug.cgi?id=1785399

Comment 6 Peter Hunt 2022-04-12 14:28:06 UTC

note it is still possible to happen even with the patch that came in here https://bugzilla.redhat.com/show_bug.cgi?id=2050312. 
This just means the kubelet and cri-o are miscommunicating, but as long as the node eventually resolves, it's not an issue. Does the issue eventually stop?

Comment 7 Emmanuel Kasper 2022-04-12 14:36:56 UTC

After 24 hours and collected all the logs we could, many pods were still on ContainerCreating status, so we rebooted the culprit node.

Comment 15 Emmanuel Kasper 2022-05-11 09:34:54 UTC

Could it be that the containerCreating status is caused by the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=2082344 and https://bugzilla.redhat.com/show_bug.cgi?id=2082233 ?

Comment 19 Emmanuel Kasper 2022-05-23 13:58:42 UTC

Looking at the go stack strace attached here,
I see indeed a lot of goroutines where the call to StartTransientUnitContext is present, exactly similar to the stack strace we have in https://bugzilla.redhat.com/show_bug.cgi?id=2082344#c0


goroutine 12871381 [semacquire, 1436 minutes]:
sync.runtime_SemacquireMutex(0xc0043a3bbc, 0x0, 0x1)
        /usr/lib/golang/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc0043a3bb8)
        /usr/lib/golang/src/sync/mutex.go:138 +0x10f
sync.(*Mutex).Lock(...)
        /usr/lib/golang/src/sync/mutex.go:81
github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).startJob(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8180, 0x55cfd5dfd57b,
 0x33, 0xc011998000, 0x4, 0x4, 0x0, ...)
        /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/m
ethods.go:56 +0x3a5
github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnitContext(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8000, 0x52, 0x55cfd5d8d51c, 0x7, 0xc003eba360, 0x6, 0x6, ...)
        /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:199 +0x1dc
github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnit(...)
        /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:194

$ grep StartTransientUnitContext crio-goroutine-stacks-2022-03-25T095219Z.log | wc -l
411

If this is indeed a duplicate of 2082344, maybe we should then mark this as duplicate.

Comment 20 Peter Hunt 2022-05-25 15:28:52 UTC

yeah I think it's reasonable to believe this is a duplicate of that, so I'm setting it as such

*** This bug has been marked as a duplicate of bug 2082344 ***