2069058 – Pods hang on container creating after worker node update to 4.8.35

Bug 2069058 - Pods hang on container creating after worker node update to 4.8.35

Summary: Pods hang on container creating after worker node update to 4.8.35

Keywords:
Status:	CLOSED DUPLICATE of bug 2082344
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Kir Kolyshkin
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-28 07:29 UTC by Emmanuel Kasper
Modified:	2022-09-12 14:28 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-25 15:28:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Emmanuel Kasper 2022-03-28 07:29:07 UTC

Description of problem:
After upgrading to 4.8.35, worker not is not able to create pods, and status of node stays at unschedulable.

We can see the following message in the cri-o journal
Error reserving ctr name uuid0 for id : uuid1"

and in the kubelet journal
CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name uuid0 for id uuid1: name is reserved" podSandboxID="uuid2


Version-Release number of selected component (if applicable): 4.8.35


How reproducible: Difficult to reproduce, as this was noticed in a production environment.

Additional info:
This took place on a very large node (800GB RAM, set to 500 pods max)
My supposition is that after the node was node was upgraded and uncordoned, as massive pod reschedlung took place which overloaded the kubelet and cri-o on that node.

After draining the node, rebooting, and uncordoning, pods were scheduled to that node normally.

Comment 5 Emmanuel Kasper 2022-04-01 13:15:15 UTC

A bug producing a similar error message was fixed in OCP 4.7 https://bugzilla.redhat.com/show_bug.cgi?id=1785399

Comment 6 Peter Hunt 2022-04-12 14:28:06 UTC

note it is still possible to happen even with the patch that came in here https://bugzilla.redhat.com/show_bug.cgi?id=2050312. 
This just means the kubelet and cri-o are miscommunicating, but as long as the node eventually resolves, it's not an issue. Does the issue eventually stop?

Comment 7 Emmanuel Kasper 2022-04-12 14:36:56 UTC

After 24 hours and collected all the logs we could, many pods were still on ContainerCreating status, so we rebooted the culprit node.

Comment 15 Emmanuel Kasper 2022-05-11 09:34:54 UTC

Could it be that the containerCreating status is caused by the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=2082344 and https://bugzilla.redhat.com/show_bug.cgi?id=2082233 ?

Comment 19 Emmanuel Kasper 2022-05-23 13:58:42 UTC

Looking at the go stack strace attached here,
I see indeed a lot of goroutines where the call to StartTransientUnitContext is present, exactly similar to the stack strace we have in https://bugzilla.redhat.com/show_bug.cgi?id=2082344#c0


goroutine 12871381 [semacquire, 1436 minutes]:
sync.runtime_SemacquireMutex(0xc0043a3bbc, 0x0, 0x1)
        /usr/lib/golang/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc0043a3bb8)
        /usr/lib/golang/src/sync/mutex.go:138 +0x10f
sync.(*Mutex).Lock(...)
        /usr/lib/golang/src/sync/mutex.go:81
github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).startJob(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8180, 0x55cfd5dfd57b,
 0x33, 0xc011998000, 0x4, 0x4, 0x0, ...)
        /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/m
ethods.go:56 +0x3a5
github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnitContext(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8000, 0x52, 0x55cfd5d8d51c, 0x7, 0xc003eba360, 0x6, 0x6, ...)
        /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:199 +0x1dc
github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnit(...)
        /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:194

$ grep StartTransientUnitContext crio-goroutine-stacks-2022-03-25T095219Z.log | wc -l
411

If this is indeed a duplicate of 2082344, maybe we should then mark this as duplicate.

Comment 20 Peter Hunt 2022-05-25 15:28:52 UTC

yeah I think it's reasonable to believe this is a duplicate of that, so I'm setting it as such

*** This bug has been marked as a duplicate of bug 2082344 ***

Note You need to log in before you can comment on or make changes to this bug.