Bug 1888565 - [OSP] machine-config-daemon-firstboot.service failed with "error reading osImageURL from rpm-ostree" [NEEDINFO]
Summary: [OSP] machine-config-daemon-firstboot.service failed with "error reading osIm...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
Depends On:
TreeView+ depends on / blocked
Reported: 2020-10-15 08:33 UTC by weiwei jiang
Modified: 2021-02-24 15:26 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-02-24 15:26:15 UTC
Target Upstream Version:
mnguyen: needinfo? (walters)

Attachments (Terms of Use)
must_gather log (3.28 MB, application/gzip)
2020-10-15 08:33 UTC, weiwei jiang
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2291 0 None closed Bug 1888565: daemon: Explicitly start rpm-ostreed 2021-02-16 11:55:02 UTC
Github openshift machine-config-operator pull 2296 0 None closed Revert "Bug 1888565: daemon: Explicitly start rpm-ostreed" 2021-02-16 11:55:02 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:26:32 UTC

Description weiwei jiang 2020-10-15 08:33:05 UTC
Description of problem:
When trying to launch one 4.7 IPI on OSP, 1/3 masters not start as expected and got following logs:
   1 -- Logs begin at Thu 2020-10-15 07:04:33 UTC, end at Thu 2020-10-15 07:34:42 UTC. --                                                                                                                                                     
   2 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 systemd[1]: Starting Machine Config Daemon Firstboot...
   3 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: I1015 07:06:22.320043    2507 rpm-ostree.go:261] Running captured: rpm-ostree status --json
   4 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: error: error reading osImageURL from rpm-ostree: error running rpm-ostree status --json: error: Failed to activate service 'org.projectatomic.rpmostree1': timed        out (service_start_timeout=25000ms)
   5 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: : exit status 1
   6 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
   7 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
   8 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: Failed to start Machine Config Daemon Firstboot.
   9 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Consumed 153ms CPU time

Version-Release number of selected component (if applicable):
./openshift-install 4.7.0-0.nightly-2020-10-14-214107
built from commit 7fd66477e7aa23d7a7ed5d0b8973e74cdae819ea
release image registry.svc.ci.openshift.org/ocp/release@sha256:575d71576dbcbc98d7df22c422938b56ed6234862c7a3c2938978f643826136f

How reproducible:

Steps to Reproduce:
1. Launch one IPI on OSP

Actual results:
1/3 masters not join to OCP Cluster and machine-config-daemon-firstboot got failed to start.

Expected results:
It should work well.

Additional info:

Comment 1 weiwei jiang 2020-10-15 08:33:49 UTC
Created attachment 1721778 [details]
must_gather log

Comment 2 Colin Walters 2020-10-19 20:15:57 UTC
Unfortunately must-gather doesn't include the units we need here; can you get the output of at least:

`journalctl -b -u rpm-ostreed -u polkit -u dbus`

from the host?

Comment 3 Colin Walters 2020-10-19 20:18:55 UTC
These timeouts can often happen when the OS is provisioned on slow storage medium.  That failure case has happened on the past on e.g. live systems being run from physical CDROM or slow USB sticks.

I bet this case is an OpenStack cluster with something like slow Ceph or other persistent storage.

Comment 4 Micah Abbott 2020-10-25 18:40:16 UTC
Higher priority items prevented work from happening on this issue; labeling for UpcomingSprint

Comment 6 Colin Walters 2020-10-30 14:18:46 UTC
If this is failing, it's highly likely that you wouldn't be able to run etcd either (persisting to the target disk).

Today, OpenShift CI by default basically disables etcd persistence on OpenStack and RHV:

This is also related to the long-running "etcd on Azure" threads, see e.g.

Personally I think we need a high level feature knob to use "instance local disks":

For a lot of our CI jobs and testing (and I'm guessing the test you're doing here) we're mostly interested in "sanity testing" and functionality testing - those clusters would be totally fine with a lower level of redundancy.

I'm closing this as DEFERRED because it needs to be fixed at a higher level.

Comment 7 Colin Walters 2020-12-09 21:26:13 UTC
Decided to reopen this since we can at least increase the timeout to match the global systemd one, and it does seem like we need to better ensure the MCO is reliably talking to rpm-ostreed. The PR here is a small step towards that.

Comment 11 Ronnie Lazar 2020-12-15 10:48:40 UTC
amurdaca@redhat.com the status of this should be "POSt", no?
I think that merging the revert moved it to "ON_QA"

Comment 13 Michael Nguyen 2021-01-05 15:43:57 UTC
The revert has made it into registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-05-055003

Comment 16 errata-xmlrpc 2021-02-24 15:26:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.