Bug 1888565
| Summary: | [OSP] machine-config-daemon-firstboot.service failed with "error reading osImageURL from rpm-ostree" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | weiwei jiang <wjiang> | ||||
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.7 | CC: | alazar, bbreard, imcleod, jligon, miabbott, nstielau, walters, zzhao | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | 4.7.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-02-24 15:26:15 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
weiwei jiang
2020-10-15 08:33:05 UTC
Created attachment 1721778 [details]
must_gather log
Unfortunately must-gather doesn't include the units we need here; can you get the output of at least: `journalctl -b -u rpm-ostreed -u polkit -u dbus` from the host? These timeouts can often happen when the OS is provisioned on slow storage medium. That failure case has happened on the past on e.g. live systems being run from physical CDROM or slow USB sticks. I bet this case is an OpenStack cluster with something like slow Ceph or other persistent storage. Higher priority items prevented work from happening on this issue; labeling for UpcomingSprint If this is failing, it's highly likely that you wouldn't be able to run etcd either (persisting to the target disk). Today, OpenShift CI by default basically disables etcd persistence on OpenStack and RHV: https://github.com/openshift/release/blob/7180d60d8ceb277ea24989099e2df5dc54b866a4/ci-operator/templates/openshift/installer/cluster-launch-installer-openstack-e2e.yaml#L369 This is also related to the long-running "etcd on Azure" threads, see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1877435 Personally I think we need a high level feature knob to use "instance local disks": http://post-office.corp.redhat.com/archives/aos-devel/2020-August/msg00047.html For a lot of our CI jobs and testing (and I'm guessing the test you're doing here) we're mostly interested in "sanity testing" and functionality testing - those clusters would be totally fine with a lower level of redundancy. I'm closing this as DEFERRED because it needs to be fixed at a higher level. Decided to reopen this since we can at least increase the timeout to match the global systemd one, and it does seem like we need to better ensure the MCO is reliably talking to rpm-ostreed. The PR here is a small step towards that. amurdaca the status of this should be "POSt", no? I think that merging the revert moved it to "ON_QA" The revert has made it into registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-05-055003 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |