Bug 1998035 - openstack IPI CI: custom var-lib-etcd.mount (ramdisk) unit is racing due to incomplete After/Before order
Summary: openstack IPI CI: custom var-lib-etcd.mount (ramdisk) unit is racing due to i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.10.0
Assignee: Matthew Booth
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-26 10:18 UTC by Matthew Booth
Modified: 2022-03-12 04:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:37:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master-1 failure to boot (98.54 KB, text/plain)
2021-08-26 10:18 UTC, Matthew Booth
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift release pull 24210 0 None Merged Bug 1998035: Fix race mounting etcd on ramfs 2022-01-27 15:27:52 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:37:52 UTC

Description Matthew Booth 2021-08-26 10:18:59 UTC
Created attachment 1817838 [details]
master-1 failure to boot

OCP Version at Install Time: 4.9.0-0.ci.test-2021-08-26-081628-ci-op-btm7f0rp-latest
RHCOS Version at Install Time: rhcos-49.84.202108221651-0-openstack.x86_64.qcow2.gz
Platform: OpenStack
Architecture: x86_64

A CI job failed:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/21117/rehearse-21117-pull-ci-openshift-cloud-provider-openstack-master-e2e-openstack-ccm-install/1430804969698627584

We can see from the openstack console logs that it failed because one of the 3 masters failed to boot:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/21117/rehearse-21117-pull-ci-openshift-cloud-provider-openstack-master-e2e-openstack-ccm-install/1430804969698627584/artifacts/e2e-openstack-ccm-install/openstack-gather/artifacts/nodes/console_btm7f0rp-13339-tjj6m-master-1.log

I have attached this log to the BZ in case it is reaped by prow.

Note that the other 2 masters both booted. The boot ends in:

You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Press Enter for maintenance
(or press Control-D to continue): 

I can see earlier in the logs:

         Starting OSTree Remount OS/ Bind Mounts...
[   91.471884] ostree-remount[1270]: ostree-remount: failed to remount(ro) /sysroot: Device or resource busy
         Mounting Mount etcd as a ramdisk...
[[0;1;31mFAILED[0m] Failed to start OSTree Remount OS/ Bind Mounts.

which may be relevant.

I assume this is a startup race? I am not able to reproduce this issue. I am reporting this in case it is of interest to you. As long as it doesn't become a regular flake it is not significantly impacting me. On the face of it it appears to be similar to:

https://github.com/coreos/fedora-coreos-tracker/issues/746

Comment 1 Benjamin Gilbert 2021-08-26 17:54:11 UTC
Indeed, it seems https://github.com/coreos/fedora-coreos-tracker/issues/746 was fixed by https://github.com/ostreedev/ostree/pull/2387 which is not currently in RHCOS.  Thanks for the report.

Comment 2 Benjamin Gilbert 2021-08-26 22:49:16 UTC
Possible duplicate of bug 1992618.

Comment 3 Micah Abbott 2021-08-27 19:16:36 UTC
Setting medium Pri/Sev and targeting for 4.9.  It may not be possible to land that patch in RHCOS in time for 4.9 GA, but seems like something we should pursue for future releases.

@Luca could you see that the ostree change in comment #1 makes it way into RHCOS?

Comment 4 Luca BRUNO 2021-09-08 15:44:55 UTC
Thanks for the report and the attached logs! The symptoms are the same as the Fedora CoreOS but I think the root cause is different, so I wouldn't start approaching it with a backport upfront.

My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing some After/Before relationship and thus racing with 'ostree-remount.service'.
That would indeed be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted.

Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and where it is coming from? I did a quick search around but I couldn't locate it, and I believe it is not part of OCP proper.

Comment 5 Matthew Booth 2021-09-08 16:10:19 UTC
(In reply to Luca BRUNO from comment #4)
> Thanks for the report and the attached logs! The symptoms are the same as
> the Fedora CoreOS but I think the root cause is different, so I wouldn't
> start approaching it with a backport upfront.
> 
> My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing
> some After/Before relationship and thus racing with 'ostree-remount.service'.
> That would indeed be similar to
> https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted.
> 
> Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and
> where it is coming from? I did a quick search around but I couldn't locate
> it, and I believe it is not part of OCP proper.

It's a hack used in CI for when a cloud's underlying storage isn't fast enough for etcd. We're currently using it by default for OpenStack, but starting to phase it out as we're no longer running everything on the one cloud that really needs it. It's defined via ignition here: https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/etcd/on-ramfs/ipi-conf-etcd-on-ramfs-commands.sh

Sounds like we need to fix the dependencies of that unit?

Comment 6 Matthew Booth 2021-09-08 16:12:22 UTC
Should have said: if the answer's yes please reassign it to Cloud Compute -> OpenStack Provider with my thanks and we'll fix it.

Comment 7 Luca BRUNO 2021-09-08 16:19:41 UTC
Yes, I think that a 'After=ostree-remount.service var.mount' on that mount unit should fix this race.

Re-assigning so that the CI can be directly fixed.

I'll keep working on the other ticket to see if we can make this less cumbersome for the users.

Comment 8 Pierre Prinetti 2021-09-22 15:22:40 UTC
Changing the severity to LOW because it doesn't happen often.

Comment 9 ShiftStack Bugwatcher 2021-11-25 16:12:14 UTC
Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 16 errata-xmlrpc 2022-03-12 04:37:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.