Description of problem:
Seen in https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1305774896491532288
It is pretty frequent and looks like this CI job is at 50% because some containers failed to start.
Sep 15 08:32:24.393 W ns/e2e-provisioning-9532 pod/csi-hostpath-snapshotter-0 node/ci-op-p8tbzkwf-1a9c0-4wmdc-worker-b-kkxqz reason/Failed Error: container create failed: time="2020-09-15T08:32:21Z" level=warning msg="Timed out while waiting for StartTransientUnit(crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope) completion signal from dbus. Continuing..."\ntime="2020-09-15T08:32:23Z" level=warning msg="signal: killed"\ntime="2020-09-15T08:32:24Z" level=error msg="container_linux.go:348: starting container process caused \"process_linux.go:438: container init caused \\\"process_linux.go:404: setting cgroup config for procHooks process caused \\\\\\\"Unit crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope not found.\\\\\\\"\\\"\""\ncontainer_linux.go:348: starting container process caused "process_linux.go:438: container init caused \"process_linux.go:404: setting cgroup config for procHooks process caused \\\"Unit crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope not found.\\\"\""\n
Starting with high severity as ~50% of the realtime jobs are affected.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
We have been seeing an increasing number of systemd timeout bugs recently
https://bugzilla.redhat.com/show_bug.cgi?id=1819868 (with 2 dups)
Not sure what to do about them.
Although, in previous cases, it was typically scale stress tests that trigger it. This is just a normal e2e run on an RT kernel. So.. that is strange.
dup'ing this one as well. There is nothing kubelet/crio can do about this.
https://bugzilla.redhat.com/show_bug.cgi?id=1819868#c14
The issue is an overloading of a single thread in systemd. The issue is related to the number of mount points on the system.
On rt-kernel, systemd has even more of an issue since the kernel is optimized for scheduling latency and not throughput.
*** This bug has been marked as a duplicate of bug 1819868 ***