Bug 1874747
Summary: | oVirt VMs fail to pass ignition in some installations | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Gal Zaidman <gzaidman> |
Component: | RHCOS | Assignee: | slowrie |
Status: | CLOSED ERRATA | QA Contact: | Lucie Leistnerova <lleistne> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.6 | CC: | bbreard, bgilbert, deads, eslutsky, imcleod, jligon, jzmeskal, lleistne, miabbott, michal.skrivanek, nstielau, walters |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:36:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Gal Zaidman
2020-09-02 07:15:48 UTC
Colin I know you took a look at it, any thoughts ? Gal, could this be the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ? (In reply to Jan Zmeskal from comment #2) > Gal, could this be the same issue as > https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ? I don't think it is, in our case it starts meaning the VM is UP, but the ignition fails updating priority to reflect that ovirt jobs are failing to start 40%ish of the time. @slowrie could you investigate further? It's weird that this is seemingly only impacting oVirt - at least, I am not sure we've seen this often on other platforms, though to be fair our telemetry here is really bad outside of qemu. The locking code in shadow-utils is terrifying; I don't understand why it's not using a standard flock()/fcntl(F_SETLK) and is doing a thing with manually writing pids to regular files. We don't have anything special around oVirt that's running in the initramfs that I can think of today. (And to validate, oVirt isn't modifying the initramfs right?) Wait...actually hmm, I see that oVirt is using the OpenStack image. I wonder if there's something strange going on here in Ignition due to the way we fetch userdata via both cdrom and the metadata service. Which does oVirt use? Debugging this "bottom up" is probably going to involve something like injecting a systemtap kernel module early on in the initramfs that is logging which processes write files or so. Possible avenues of investigation I can think of: - systemd-sysusers being run in the initramfs? - sssd (I am sure that the sssd nsswitch plugin is unaware of useradd --root, some weird interaction between that and the initramfs?) See also https://pagure.io/SSSD/sssd/pull-request/3959 (In reply to Colin Walters from comment #6) > We don't have anything special around oVirt that's running in the initramfs > that I can think of today. (And to validate, oVirt isn't modifying the > initramfs right?) correct > Wait...actually hmm, I see that oVirt is using the OpenStack image. I > wonder if there's something strange going on here in Ignition due to the way > we fetch userdata via both cdrom and the metadata service. Which does oVirt > use? only cdrom Update - we're looking into the terraform logic around the template creation. It seems we're booting the instance and tearing it down forcefully (dirty power off) which seems like a bad idea https://github.com/oVirt/terraform-provider-ovirt/blob/f2b0157c5bb5b29495acb1a2fac902e9005a9a4d/ovirt/resource_ovirt_vm_template.go#L340-L366 we want to change the template creation flow so the VM wont be started before the template is sealed. @Lucie would your team be able to verify this BZ? I think we can verifiy it with CI since we don't hit it anymore. It was around 50% of the failures According to Comment 13, verified in ocp 4.6.0-0.nightly-2020-09-21-030155 and rhv 4.4.0.3-1.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |