Description of problem: Seen in https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1305774896491532288 It is pretty frequent and looks like this CI job is at 50% because some containers failed to start. Sep 15 08:32:24.393 W ns/e2e-provisioning-9532 pod/csi-hostpath-snapshotter-0 node/ci-op-p8tbzkwf-1a9c0-4wmdc-worker-b-kkxqz reason/Failed Error: container create failed: time="2020-09-15T08:32:21Z" level=warning msg="Timed out while waiting for StartTransientUnit(crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope) completion signal from dbus. Continuing..."\ntime="2020-09-15T08:32:23Z" level=warning msg="signal: killed"\ntime="2020-09-15T08:32:24Z" level=error msg="container_linux.go:348: starting container process caused \"process_linux.go:438: container init caused \\\"process_linux.go:404: setting cgroup config for procHooks process caused \\\\\\\"Unit crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope not found.\\\\\\\"\\\"\""\ncontainer_linux.go:348: starting container process caused "process_linux.go:438: container init caused \"process_linux.go:404: setting cgroup config for procHooks process caused \\\"Unit crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope not found.\\\"\""\n Starting with high severity as ~50% of the realtime jobs are affected. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We have been seeing an increasing number of systemd timeout bugs recently https://bugzilla.redhat.com/show_bug.cgi?id=1819868 (with 2 dups) Not sure what to do about them. Although, in previous cases, it was typically scale stress tests that trigger it. This is just a normal e2e run on an RT kernel. So.. that is strange.
Lukas, Nykryn, We are seeing StartTransientUnit failures when using rt kernel. Are there any known issues with systemd and rt kernel?
cc'd Valentin Rothberg in case he has any thoughts/insights on the systemd angle.
*** Bug 1875045 has been marked as a duplicate of this bug. ***
dup'ing this one as well. There is nothing kubelet/crio can do about this. https://bugzilla.redhat.com/show_bug.cgi?id=1819868#c14 The issue is an overloading of a single thread in systemd. The issue is related to the number of mount points on the system. On rt-kernel, systemd has even more of an issue since the kernel is optimized for scheduling latency and not throughput. *** This bug has been marked as a duplicate of bug 1819868 ***