Bug 1998035

Summary:

openstack IPI CI: custom var-lib-etcd.mount (ramdisk) unit is racing due to incomplete After/Before order

Product:

OpenShift Container Platform

Reporter:

Matthew Booth <mbooth>

Component:

Cloud Compute

Assignee:

Matthew Booth <mbooth>

Cloud Compute sub component:

OpenStack Provider

QA Contact:

Jon Uriarte <juriarte>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

low

Priority:

low

CC:

bgilbert, dornelas, jligon, m.andre, mfedosin, miabbott, mrussell, nstielau, pprinett

Version:

4.9

Keywords:

Triaged

Target Milestone:

---

Target Release:

4.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-03-12 04:37:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
master-1 failure to boot	none

Description Matthew Booth 2021-08-26 10:18:59 UTC

Created attachment 1817838 [details]
master-1 failure to boot

OCP Version at Install Time: 4.9.0-0.ci.test-2021-08-26-081628-ci-op-btm7f0rp-latest
RHCOS Version at Install Time: rhcos-49.84.202108221651-0-openstack.x86_64.qcow2.gz
Platform: OpenStack
Architecture: x86_64

A CI job failed:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/21117/rehearse-21117-pull-ci-openshift-cloud-provider-openstack-master-e2e-openstack-ccm-install/1430804969698627584

We can see from the openstack console logs that it failed because one of the 3 masters failed to boot:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/21117/rehearse-21117-pull-ci-openshift-cloud-provider-openstack-master-e2e-openstack-ccm-install/1430804969698627584/artifacts/e2e-openstack-ccm-install/openstack-gather/artifacts/nodes/console_btm7f0rp-13339-tjj6m-master-1.log

I have attached this log to the BZ in case it is reaped by prow.

Note that the other 2 masters both booted. The boot ends in:

You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Press Enter for maintenance
(or press Control-D to continue): 

I can see earlier in the logs:

         Starting OSTree Remount OS/ Bind Mounts...
[   91.471884] ostree-remount[1270]: ostree-remount: failed to remount(ro) /sysroot: Device or resource busy
         Mounting Mount etcd as a ramdisk...
[[0;1;31mFAILED[0m] Failed to start OSTree Remount OS/ Bind Mounts.

which may be relevant.

I assume this is a startup race? I am not able to reproduce this issue. I am reporting this in case it is of interest to you. As long as it doesn't become a regular flake it is not significantly impacting me. On the face of it it appears to be similar to:

https://github.com/coreos/fedora-coreos-tracker/issues/746

Comment 1 Benjamin Gilbert 2021-08-26 17:54:11 UTC

Indeed, it seems https://github.com/coreos/fedora-coreos-tracker/issues/746 was fixed by https://github.com/ostreedev/ostree/pull/2387 which is not currently in RHCOS.  Thanks for the report.

Comment 2 Benjamin Gilbert 2021-08-26 22:49:16 UTC

Possible duplicate of bug 1992618.

Comment 3 Micah Abbott 2021-08-27 19:16:36 UTC

Setting medium Pri/Sev and targeting for 4.9.  It may not be possible to land that patch in RHCOS in time for 4.9 GA, but seems like something we should pursue for future releases.

@Luca could you see that the ostree change in comment #1 makes it way into RHCOS?

Comment 4 Luca BRUNO 2021-09-08 15:44:55 UTC

Thanks for the report and the attached logs! The symptoms are the same as the Fedora CoreOS but I think the root cause is different, so I wouldn't start approaching it with a backport upfront.

My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing some After/Before relationship and thus racing with 'ostree-remount.service'.
That would indeed be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted.

Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and where it is coming from? I did a quick search around but I couldn't locate it, and I believe it is not part of OCP proper.

Comment 5 Matthew Booth 2021-09-08 16:10:19 UTC

(In reply to Luca BRUNO from comment #4)
> Thanks for the report and the attached logs! The symptoms are the same as
> the Fedora CoreOS but I think the root cause is different, so I wouldn't
> start approaching it with a backport upfront.
> 
> My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing
> some After/Before relationship and thus racing with 'ostree-remount.service'.
> That would indeed be similar to
> https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted.
> 
> Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and
> where it is coming from? I did a quick search around but I couldn't locate
> it, and I believe it is not part of OCP proper.

It's a hack used in CI for when a cloud's underlying storage isn't fast enough for etcd. We're currently using it by default for OpenStack, but starting to phase it out as we're no longer running everything on the one cloud that really needs it. It's defined via ignition here: https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/etcd/on-ramfs/ipi-conf-etcd-on-ramfs-commands.sh

Sounds like we need to fix the dependencies of that unit?

Comment 6 Matthew Booth 2021-09-08 16:12:22 UTC

Should have said: if the answer's yes please reassign it to Cloud Compute -> OpenStack Provider with my thanks and we'll fix it.

Comment 7 Luca BRUNO 2021-09-08 16:19:41 UTC

Yes, I think that a 'After=ostree-remount.service var.mount' on that mount unit should fix this race.

Re-assigning so that the CI can be directly fixed.

I'll keep working on the other ticket to see if we can make this less cumbersome for the users.

Comment 8 Pierre Prinetti 2021-09-22 15:22:40 UTC

Changing the severity to LOW because it doesn't happen often.

Comment 9 ShiftStack Bugwatcher 2021-11-25 16:12:14 UTC

Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 16 errata-xmlrpc 2022-03-12 04:37:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056