Bug 1998035
| Summary: | openstack IPI CI: custom var-lib-etcd.mount (ramdisk) unit is racing due to incomplete After/Before order | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Matthew Booth <mbooth> | ||||
| Component: | Cloud Compute | Assignee: | Matthew Booth <mbooth> | ||||
| Cloud Compute sub component: | OpenStack Provider | QA Contact: | Jon Uriarte <juriarte> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | low | ||||||
| Priority: | low | CC: | bgilbert, dornelas, jligon, m.andre, mfedosin, miabbott, mrussell, nstielau, pprinett | ||||
| Version: | 4.9 | Keywords: | Triaged | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.10.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-03-12 04:37:30 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Matthew Booth
2021-08-26 10:18:59 UTC
Indeed, it seems https://github.com/coreos/fedora-coreos-tracker/issues/746 was fixed by https://github.com/ostreedev/ostree/pull/2387 which is not currently in RHCOS. Thanks for the report. Possible duplicate of bug 1992618. Setting medium Pri/Sev and targeting for 4.9. It may not be possible to land that patch in RHCOS in time for 4.9 GA, but seems like something we should pursue for future releases. @Luca could you see that the ostree change in comment #1 makes it way into RHCOS? Thanks for the report and the attached logs! The symptoms are the same as the Fedora CoreOS but I think the root cause is different, so I wouldn't start approaching it with a backport upfront. My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing some After/Before relationship and thus racing with 'ostree-remount.service'. That would indeed be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted. Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and where it is coming from? I did a quick search around but I couldn't locate it, and I believe it is not part of OCP proper. (In reply to Luca BRUNO from comment #4) > Thanks for the report and the attached logs! The symptoms are the same as > the Fedora CoreOS but I think the root cause is different, so I wouldn't > start approaching it with a backport upfront. > > My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing > some After/Before relationship and thus racing with 'ostree-remount.service'. > That would indeed be similar to > https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted. > > Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and > where it is coming from? I did a quick search around but I couldn't locate > it, and I believe it is not part of OCP proper. It's a hack used in CI for when a cloud's underlying storage isn't fast enough for etcd. We're currently using it by default for OpenStack, but starting to phase it out as we're no longer running everything on the one cloud that really needs it. It's defined via ignition here: https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/etcd/on-ramfs/ipi-conf-etcd-on-ramfs-commands.sh Sounds like we need to fix the dependencies of that unit? Should have said: if the answer's yes please reassign it to Cloud Compute -> OpenStack Provider with my thanks and we'll fix it. Yes, I think that a 'After=ostree-remount.service var.mount' on that mount unit should fix this race. Re-assigning so that the CI can be directly fixed. I'll keep working on the other ticket to see if we can make this less cumbersome for the users. Changing the severity to LOW because it doesn't happen often. Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |