1888565 – [OSP] machine-config-daemon-firstboot.service failed with "error reading osImageURL from rpm-ostree"

Bug 1888565 - [OSP] machine-config-daemon-firstboot.service failed with "error reading osImageURL from rpm-ostree"

Summary: [OSP] machine-config-daemon-firstboot.service failed with "error reading osIm...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-15 08:33 UTC by weiwei jiang
Modified:	2023-09-15 00:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:26:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must_gather log (3.28 MB, application/gzip) 2020-10-15 08:33 UTC, weiwei jiang	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2291	None	closed	Bug 1888565: daemon: Explicitly start rpm-ostreed	2021-02-16 11:55:02 UTC
Github	openshift machine-config-operator pull 2296	None	closed	Revert "Bug 1888565: daemon: Explicitly start rpm-ostreed"	2021-02-16 11:55:02 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:26:32 UTC

Description weiwei jiang 2020-10-15 08:33:05 UTC

Description of problem:
When trying to launch one 4.7 IPI on OSP, 1/3 masters not start as expected and got following logs:
   1 -- Logs begin at Thu 2020-10-15 07:04:33 UTC, end at Thu 2020-10-15 07:34:42 UTC. --                                                                                                                                                     
   2 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 systemd[1]: Starting Machine Config Daemon Firstboot...
   3 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: I1015 07:06:22.320043    2507 rpm-ostree.go:261] Running captured: rpm-ostree status --json
   4 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: error: error reading osImageURL from rpm-ostree: error running rpm-ostree status --json: error: Failed to activate service 'org.projectatomic.rpmostree1': timed        out (service_start_timeout=25000ms)
   5 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: : exit status 1
   6 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
   7 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
   8 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: Failed to start Machine Config Daemon Firstboot.
   9 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Consumed 153ms CPU time



Version-Release number of selected component (if applicable):
./openshift-install 4.7.0-0.nightly-2020-10-14-214107
built from commit 7fd66477e7aa23d7a7ed5d0b8973e74cdae819ea
release image registry.svc.ci.openshift.org/ocp/release@sha256:575d71576dbcbc98d7df22c422938b56ed6234862c7a3c2938978f643826136f

How reproducible:
Always

Steps to Reproduce:
1. Launch one IPI on OSP
2. 
3.

Actual results:
1/3 masters not join to OCP Cluster and machine-config-daemon-firstboot got failed to start.

Expected results:
It should work well.

Additional info:

Comment 1 weiwei jiang 2020-10-15 08:33:49 UTC

Created attachment 1721778 [details]
must_gather log

Comment 2 Colin Walters 2020-10-19 20:15:57 UTC

Unfortunately must-gather doesn't include the units we need here; can you get the output of at least:

`journalctl -b -u rpm-ostreed -u polkit -u dbus`

from the host?

Comment 3 Colin Walters 2020-10-19 20:18:55 UTC

These timeouts can often happen when the OS is provisioned on slow storage medium.  That failure case has happened on the past on e.g. live systems being run from physical CDROM or slow USB sticks.

I bet this case is an OpenStack cluster with something like slow Ceph or other persistent storage.

Comment 4 Micah Abbott 2020-10-25 18:40:16 UTC

Higher priority items prevented work from happening on this issue; labeling for UpcomingSprint

Comment 6 Colin Walters 2020-10-30 14:18:46 UTC

If this is failing, it's highly likely that you wouldn't be able to run etcd either (persisting to the target disk).

Today, OpenShift CI by default basically disables etcd persistence on OpenStack and RHV:
https://github.com/openshift/release/blob/7180d60d8ceb277ea24989099e2df5dc54b866a4/ci-operator/templates/openshift/installer/cluster-launch-installer-openstack-e2e.yaml#L369

This is also related to the long-running "etcd on Azure" threads, see e.g.
https://bugzilla.redhat.com/show_bug.cgi?id=1877435

Personally I think we need a high level feature knob to use "instance local disks":
http://post-office.corp.redhat.com/archives/aos-devel/2020-August/msg00047.html

For a lot of our CI jobs and testing (and I'm guessing the test you're doing here) we're mostly interested in "sanity testing" and functionality testing - those clusters would be totally fine with a lower level of redundancy.

I'm closing this as DEFERRED because it needs to be fixed at a higher level.

Comment 7 Colin Walters 2020-12-09 21:26:13 UTC

Decided to reopen this since we can at least increase the timeout to match the global systemd one, and it does seem like we need to better ensure the MCO is reliably talking to rpm-ostreed. The PR here is a small step towards that.

Comment 11 Ronnie Lazar 2020-12-15 10:48:40 UTC

amurdaca the status of this should be "POSt", no?
I think that merging the revert moved it to "ON_QA"

Comment 13 Michael Nguyen 2021-01-05 15:43:57 UTC

The revert has made it into registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-05-055003

Comment 16 errata-xmlrpc 2021-02-24 15:26:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 17 Red Hat Bugzilla 2023-09-15 00:49:45 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.