1879152 – [ci][kernel-rt] Timed out while waiting for StartTransientUnit(crio-....scope) completion signal from dbus. Continuing..."

Bug 1879152 - [ci][kernel-rt] Timed out while waiting for StartTransientUnit(crio-....scope) completion signal from dbus. Continuing..."

Summary: [ci][kernel-rt] Timed out while waiting for StartTransientUnit(crio-....scope...

Keywords:
Status:	CLOSED DUPLICATE of bug 1819868
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:	TechnicalReleaseBlocker
Duplicates (1):	1875045 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-15 14:28 UTC by Michal Fojtik
Modified:	2020-10-20 14:28 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-28 15:59:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michal Fojtik 2020-09-15 14:28:49 UTC

Description of problem:

Seen in https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1305774896491532288

It is pretty frequent and looks like this CI job is at 50% because some containers failed to start.

Sep 15 08:32:24.393 W ns/e2e-provisioning-9532 pod/csi-hostpath-snapshotter-0 node/ci-op-p8tbzkwf-1a9c0-4wmdc-worker-b-kkxqz reason/Failed Error: container create failed: time="2020-09-15T08:32:21Z" level=warning msg="Timed out while waiting for StartTransientUnit(crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope) completion signal from dbus. Continuing..."\ntime="2020-09-15T08:32:23Z" level=warning msg="signal: killed"\ntime="2020-09-15T08:32:24Z" level=error msg="container_linux.go:348: starting container process caused \"process_linux.go:438: container init caused \\\"process_linux.go:404: setting cgroup config for procHooks process caused \\\\\\\"Unit crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope not found.\\\\\\\"\\\"\""\ncontainer_linux.go:348: starting container process caused "process_linux.go:438: container init caused \"process_linux.go:404: setting cgroup config for procHooks process caused \\\"Unit crio-8496af9a19c4f8bf6af7df83728d5546e1047c9937bf6064ccd3708e3f5a35f5.scope not found.\\\"\""\n

Starting with high severity as ~50% of the realtime jobs are affected.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Seth Jennings 2020-09-15 14:42:51 UTC

We have been seeing an increasing number of systemd timeout bugs recently
https://bugzilla.redhat.com/show_bug.cgi?id=1819868 (with 2 dups)

Not sure what to do about them.

Although, in previous cases, it was typically scale stress tests that trigger it.  This is just a normal e2e run on an RT kernel.  So.. that is strange.

Comment 2 Mrunal Patel 2020-09-15 14:56:13 UTC

Lukas, Nykryn,
We are seeing StartTransientUnit failures when using rt kernel. Are there any known issues with systemd and rt kernel?

Comment 3 Tom Sweeney 2020-09-15 15:50:44 UTC

cc'd Valentin Rothberg in case he has any thoughts/insights on the systemd angle.

Comment 4 Tomas Smetana 2020-09-22 15:16:34 UTC

*** Bug 1875045 has been marked as a duplicate of this bug. ***

Comment 5 Seth Jennings 2020-09-28 15:59:33 UTC

dup'ing this one as well.  There is nothing kubelet/crio can do about this.

https://bugzilla.redhat.com/show_bug.cgi?id=1819868#c14

The issue is an overloading of a single thread in systemd.  The issue is related to the number of mount points on the system.

On rt-kernel, systemd has even more of an issue since the kernel is optimized for scheduling latency and not throughput.

*** This bug has been marked as a duplicate of bug 1819868 ***

Note You need to log in before you can comment on or make changes to this bug.