Description of problem: On oVirt CI we started seeing VMs that fail on ignition, this is very flaky and happens once every~10 runs. On those runs the VMs are not starting. We managed to get the journal from the VM: http://pastebin.test.redhat.com/897899 On the journal we see the following error: CRITICAL : files: ensureUsers: op(1): [failed] creating or modifying user "core": exit status 1: Cmd: "usermod" "--root" "/sysroot" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core" Stdout: "" Stderr: "usermod: existing lock file /etc/passwd.lock without a PID\nusermod: cannot lock /etc/passwd; try again later.\n" We don't understand why this is happening and why only part of the time This is an example of a run that hit this issue: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301044050652041216 How reproducible: Run an OCP installation on oVirt and cross your fingers
Colin I know you took a look at it, any thoughts ?
Gal, could this be the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ?
(In reply to Jan Zmeskal from comment #2) > Gal, could this be the same issue as > https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ? I don't think it is, in our case it starts meaning the VM is UP, but the ignition fails
updating priority to reflect that ovirt jobs are failing to start 40%ish of the time.
@slowrie could you investigate further?
It's weird that this is seemingly only impacting oVirt - at least, I am not sure we've seen this often on other platforms, though to be fair our telemetry here is really bad outside of qemu. The locking code in shadow-utils is terrifying; I don't understand why it's not using a standard flock()/fcntl(F_SETLK) and is doing a thing with manually writing pids to regular files. We don't have anything special around oVirt that's running in the initramfs that I can think of today. (And to validate, oVirt isn't modifying the initramfs right?) Wait...actually hmm, I see that oVirt is using the OpenStack image. I wonder if there's something strange going on here in Ignition due to the way we fetch userdata via both cdrom and the metadata service. Which does oVirt use? Debugging this "bottom up" is probably going to involve something like injecting a systemtap kernel module early on in the initramfs that is logging which processes write files or so.
Possible avenues of investigation I can think of: - systemd-sysusers being run in the initramfs? - sssd (I am sure that the sssd nsswitch plugin is unaware of useradd --root, some weird interaction between that and the initramfs?) See also https://pagure.io/SSSD/sssd/pull-request/3959
(In reply to Colin Walters from comment #6) > We don't have anything special around oVirt that's running in the initramfs > that I can think of today. (And to validate, oVirt isn't modifying the > initramfs right?) correct > Wait...actually hmm, I see that oVirt is using the OpenStack image. I > wonder if there's something strange going on here in Ignition due to the way > we fetch userdata via both cdrom and the metadata service. Which does oVirt > use? only cdrom
Update - we're looking into the terraform logic around the template creation. It seems we're booting the instance and tearing it down forcefully (dirty power off) which seems like a bad idea https://github.com/oVirt/terraform-provider-ovirt/blob/f2b0157c5bb5b29495acb1a2fac902e9005a9a4d/ovirt/resource_ovirt_vm_template.go#L340-L366
we want to change the template creation flow so the VM wont be started before the template is sealed.
@Lucie would your team be able to verify this BZ?
I think we can verifiy it with CI since we don't hit it anymore. It was around 50% of the failures
According to Comment 13, verified in ocp 4.6.0-0.nightly-2020-09-21-030155 and rhv 4.4.0.3-1.el8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196