Bug 1888565

Summary:

[OSP] machine-config-daemon-firstboot.service failed with "error reading osImageURL from rpm-ostree"

Product:

OpenShift Container Platform

Reporter:

weiwei jiang <wjiang>

Component:

Machine Config Operator

Assignee:

Antonio Murdaca <amurdaca>

Status:

CLOSED ERRATA

QA Contact:

Michael Nguyen <mnguyen>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.7

CC:

alazar, bbreard, imcleod, jligon, miabbott, nstielau, walters, zzhao

Target Milestone:

---

Keywords:

Reopened

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 15:26:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
must_gather log	none

Description weiwei jiang 2020-10-15 08:33:05 UTC

Description of problem:
When trying to launch one 4.7 IPI on OSP, 1/3 masters not start as expected and got following logs:
   1 -- Logs begin at Thu 2020-10-15 07:04:33 UTC, end at Thu 2020-10-15 07:34:42 UTC. --                                                                                                                                                     
   2 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 systemd[1]: Starting Machine Config Daemon Firstboot...
   3 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: I1015 07:06:22.320043    2507 rpm-ostree.go:261] Running captured: rpm-ostree status --json
   4 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: error: error reading osImageURL from rpm-ostree: error running rpm-ostree status --json: error: Failed to activate service 'org.projectatomic.rpmostree1': timed        out (service_start_timeout=25000ms)
   5 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: : exit status 1
   6 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
   7 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
   8 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: Failed to start Machine Config Daemon Firstboot.
   9 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Consumed 153ms CPU time



Version-Release number of selected component (if applicable):
./openshift-install 4.7.0-0.nightly-2020-10-14-214107
built from commit 7fd66477e7aa23d7a7ed5d0b8973e74cdae819ea
release image registry.svc.ci.openshift.org/ocp/release@sha256:575d71576dbcbc98d7df22c422938b56ed6234862c7a3c2938978f643826136f

How reproducible:
Always

Steps to Reproduce:
1. Launch one IPI on OSP
2. 
3.

Actual results:
1/3 masters not join to OCP Cluster and machine-config-daemon-firstboot got failed to start.

Expected results:
It should work well.

Additional info:

Comment 1 weiwei jiang 2020-10-15 08:33:49 UTC

Created attachment 1721778 [details]
must_gather log

Comment 2 Colin Walters 2020-10-19 20:15:57 UTC

Unfortunately must-gather doesn't include the units we need here; can you get the output of at least:

`journalctl -b -u rpm-ostreed -u polkit -u dbus`

from the host?

Comment 3 Colin Walters 2020-10-19 20:18:55 UTC

These timeouts can often happen when the OS is provisioned on slow storage medium.  That failure case has happened on the past on e.g. live systems being run from physical CDROM or slow USB sticks.

I bet this case is an OpenStack cluster with something like slow Ceph or other persistent storage.

Comment 4 Micah Abbott 2020-10-25 18:40:16 UTC

Higher priority items prevented work from happening on this issue; labeling for UpcomingSprint

Comment 6 Colin Walters 2020-10-30 14:18:46 UTC

If this is failing, it's highly likely that you wouldn't be able to run etcd either (persisting to the target disk).

Today, OpenShift CI by default basically disables etcd persistence on OpenStack and RHV:
https://github.com/openshift/release/blob/7180d60d8ceb277ea24989099e2df5dc54b866a4/ci-operator/templates/openshift/installer/cluster-launch-installer-openstack-e2e.yaml#L369

This is also related to the long-running "etcd on Azure" threads, see e.g.
https://bugzilla.redhat.com/show_bug.cgi?id=1877435

Personally I think we need a high level feature knob to use "instance local disks":
http://post-office.corp.redhat.com/archives/aos-devel/2020-August/msg00047.html

For a lot of our CI jobs and testing (and I'm guessing the test you're doing here) we're mostly interested in "sanity testing" and functionality testing - those clusters would be totally fine with a lower level of redundancy.

I'm closing this as DEFERRED because it needs to be fixed at a higher level.

Comment 7 Colin Walters 2020-12-09 21:26:13 UTC

Decided to reopen this since we can at least increase the timeout to match the global systemd one, and it does seem like we need to better ensure the MCO is reliably talking to rpm-ostreed. The PR here is a small step towards that.

Comment 11 Ronnie Lazar 2020-12-15 10:48:40 UTC

amurdaca the status of this should be "POSt", no?
I think that merging the revert moved it to "ON_QA"

Comment 13 Michael Nguyen 2021-01-05 15:43:57 UTC

The revert has made it into registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-05-055003

Comment 16 errata-xmlrpc 2021-02-24 15:26:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 17 Red Hat Bugzilla 2023-09-15 00:49:45 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days