Bug 2069058
| Summary: | Pods hang on container creating after worker node update to 4.8.35 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Emmanuel Kasper <ekasprzy> |
| Component: | Node | Assignee: | Kir Kolyshkin <kir> |
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aivaraslaimikis, aos-bugs, bkulawia, bverschu, gpulido, harpatil, jolee, ldixon, pehunt, ssonigra, wweber |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-25 15:28:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Emmanuel Kasper
2022-03-28 07:29:07 UTC
A bug producing a similar error message was fixed in OCP 4.7 https://bugzilla.redhat.com/show_bug.cgi?id=1785399 note it is still possible to happen even with the patch that came in here https://bugzilla.redhat.com/show_bug.cgi?id=2050312. This just means the kubelet and cri-o are miscommunicating, but as long as the node eventually resolves, it's not an issue. Does the issue eventually stop? After 24 hours and collected all the logs we could, many pods were still on ContainerCreating status, so we rebooted the culprit node. Could it be that the containerCreating status is caused by the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=2082344 and https://bugzilla.redhat.com/show_bug.cgi?id=2082233 ? Looking at the go stack strace attached here, I see indeed a lot of goroutines where the call to StartTransientUnitContext is present, exactly similar to the stack strace we have in https://bugzilla.redhat.com/show_bug.cgi?id=2082344#c0 goroutine 12871381 [semacquire, 1436 minutes]: sync.runtime_SemacquireMutex(0xc0043a3bbc, 0x0, 0x1) /usr/lib/golang/src/runtime/sema.go:71 +0x49 sync.(*Mutex).lockSlow(0xc0043a3bb8) /usr/lib/golang/src/sync/mutex.go:138 +0x10f sync.(*Mutex).Lock(...) /usr/lib/golang/src/sync/mutex.go:81 github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).startJob(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8180, 0x55cfd5dfd57b, 0x33, 0xc011998000, 0x4, 0x4, 0x0, ...) /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/m ethods.go:56 +0x3a5 github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnitContext(0xc0043a3b80, 0x55cfd67058b0, 0xc00005e018, 0xc0001f8000, 0x52, 0x55cfd5d8d51c, 0x7, 0xc003eba360, 0x6, 0x6, ...) /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:199 +0x1dc github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus.(*Conn).StartTransientUnit(...) /builddir/build/BUILD/cri-o-af6493181d9d32685678d5162f9462e70684ba86/_output/src/github.com/cri-o/cri-o/vendor/github.com/coreos/go-systemd/v22/dbus/methods.go:194 $ grep StartTransientUnitContext crio-goroutine-stacks-2022-03-25T095219Z.log | wc -l 411 If this is indeed a duplicate of 2082344, maybe we should then mark this as duplicate. yeah I think it's reasonable to believe this is a duplicate of that, so I'm setting it as such *** This bug has been marked as a duplicate of bug 2082344 *** |