Description of problem: After upgrading to 4.8.35, worker not is not able to create pods, and status of node stays at unschedulable. We can see the following message in the cri-o journal Error reserving ctr name uuid0 for id : uuid1" and in the kubelet journal CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name uuid0 for id uuid1: name is reserved" podSandboxID="uuid2 Version-Release number of selected component (if applicable): 4.8.35 How reproducible: Difficult to reproduce, as this was noticed in a production environment. Additional info: This took place on a very large node (800GB RAM, set to 500 pods max) My supposition is that after the node was node was upgraded and uncordoned, as massive pod reschedlung took place which overloaded the kubelet and cri-o on that node. After draining the node, rebooting, and uncordoning, pods were scheduled to that node normally.
A bug producing a similar error message was fixed in OCP 4.7 https://bugzilla.redhat.com/show_bug.cgi?id=1785399
note it is still possible to happen even with the patch that came in here https://bugzilla.redhat.com/show_bug.cgi?id=2050312. This just means the kubelet and cri-o are miscommunicating, but as long as the node eventually resolves, it's not an issue. Does the issue eventually stop?
After 24 hours and collected all the logs we could, many pods were still on ContainerCreating status, so we rebooted the culprit node.
Could it be that the containerCreating status is caused by the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=2082344 and https://bugzilla.redhat.com/show_bug.cgi?id=2082233 ?
Looking at the go stack strace attached here, I see indeed a lot of goroutines where the call to StartTransientUnitContext is present, exactly similar to the stack strace we have in https://bugzilla.redhat.com/show_bug.cgi?id=2082344#c0 goroutine 12871381 [semacquire, 1436 minutes]: sync.runtime_SemacquireMutex(0xc0043a3bbc, 0x0, 0x1) /usr/lib/golang/src/runtime/sema.go:71 +0x49 sync.(*Mutex).lockSlow(0xc0043a3bb8) /usr/lib/golang/src/sync/mutex.go:138 +0x10f sync.(*Mutex).Lock(...) /usr/lib/golang/src/sync/mutex.go:81 github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).startJob(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8180, 0x55cfd5dfd57b, 0x33, 0xc011998000, 0x4, 0x4, 0x0, ...) /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/m ethods.go:56 +0x3a5 github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnitContext(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8000, 0x52, 0x55cfd5d8d51c, 0x7, 0xc003eba360, 0x6, 0x6, ...) /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:199 +0x1dc github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnit(...) /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:194 $ grep StartTransientUnitContext crio-goroutine-stacks-2022-03-25T095219Z.log | wc -l 411 If this is indeed a duplicate of 2082344, maybe we should then mark this as duplicate.
yeah I think it's reasonable to believe this is a duplicate of that, so I'm setting it as such *** This bug has been marked as a duplicate of bug 2082344 ***