Description of problem: When trying to launch one 4.7 IPI on OSP, 1/3 masters not start as expected and got following logs: 1 -- Logs begin at Thu 2020-10-15 07:04:33 UTC, end at Thu 2020-10-15 07:34:42 UTC. -- 2 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 systemd[1]: Starting Machine Config Daemon Firstboot... 3 Oct 15 07:06:22 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: I1015 07:06:22.320043 2507 rpm-ostree.go:261] Running captured: rpm-ostree status --json 4 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: error: error reading osImageURL from rpm-ostree: error running rpm-ostree status --json: error: Failed to activate service 'org.projectatomic.rpmostree1': timed out (service_start_timeout=25000ms) 5 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 machine-config-daemon[2507]: : exit status 1 6 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE 7 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'. 8 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: Failed to start Machine Config Daemon Firstboot. 9 Oct 15 07:07:07 qeci-9718-7lsl7-master-0 systemd[1]: machine-config-daemon-firstboot.service: Consumed 153ms CPU time Version-Release number of selected component (if applicable): ./openshift-install 4.7.0-0.nightly-2020-10-14-214107 built from commit 7fd66477e7aa23d7a7ed5d0b8973e74cdae819ea release image registry.svc.ci.openshift.org/ocp/release@sha256:575d71576dbcbc98d7df22c422938b56ed6234862c7a3c2938978f643826136f How reproducible: Always Steps to Reproduce: 1. Launch one IPI on OSP 2. 3. Actual results: 1/3 masters not join to OCP Cluster and machine-config-daemon-firstboot got failed to start. Expected results: It should work well. Additional info:
Created attachment 1721778 [details] must_gather log
Unfortunately must-gather doesn't include the units we need here; can you get the output of at least: `journalctl -b -u rpm-ostreed -u polkit -u dbus` from the host?
These timeouts can often happen when the OS is provisioned on slow storage medium. That failure case has happened on the past on e.g. live systems being run from physical CDROM or slow USB sticks. I bet this case is an OpenStack cluster with something like slow Ceph or other persistent storage.
Higher priority items prevented work from happening on this issue; labeling for UpcomingSprint
If this is failing, it's highly likely that you wouldn't be able to run etcd either (persisting to the target disk). Today, OpenShift CI by default basically disables etcd persistence on OpenStack and RHV: https://github.com/openshift/release/blob/7180d60d8ceb277ea24989099e2df5dc54b866a4/ci-operator/templates/openshift/installer/cluster-launch-installer-openstack-e2e.yaml#L369 This is also related to the long-running "etcd on Azure" threads, see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1877435 Personally I think we need a high level feature knob to use "instance local disks": http://post-office.corp.redhat.com/archives/aos-devel/2020-August/msg00047.html For a lot of our CI jobs and testing (and I'm guessing the test you're doing here) we're mostly interested in "sanity testing" and functionality testing - those clusters would be totally fine with a lower level of redundancy. I'm closing this as DEFERRED because it needs to be fixed at a higher level.
Decided to reopen this since we can at least increase the timeout to match the global systemd one, and it does seem like we need to better ensure the MCO is reliably talking to rpm-ostreed. The PR here is a small step towards that.
amurdaca the status of this should be "POSt", no? I think that merging the revert moved it to "ON_QA"
The revert has made it into registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-05-055003
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days