Bug 1874747 - oVirt VMs fail to pass ignition in some installations
Summary: oVirt VMs fail to pass ignition in some installations
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: slowrie
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-02 07:15 UTC by Gal Zaidman
Modified: 2020-10-27 16:37 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:36:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt terraform-provider-ovirt pull 227 0 None closed Add auto_start flag to ovirt_vm resource 2020-12-01 08:18:47 UTC
Github openshift installer pull 4168 0 None closed Bug 1874747: ovirt: dont start the temp VM before template creation 2020-12-01 08:18:47 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:37:10 UTC

Description Gal Zaidman 2020-09-02 07:15:48 UTC
Description of problem:

On oVirt CI we started seeing VMs that fail on ignition, this is very flaky and happens once every~10 runs. On those runs the VMs are not starting.
We managed to get the journal from the VM:

http://pastebin.test.redhat.com/897899

On the journal we see the following error:

CRITICAL : files: ensureUsers: op(1): [failed]   creating or modifying user "core": exit status 1: Cmd: "usermod" "--root" "/sysroot" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core" Stdout: "" Stderr: "usermod: existing lock file /etc/passwd.lock without a PID\nusermod: cannot lock /etc/passwd; try again later.\n"

We don't understand why this is happening and why only part of the time

This is an example of a run that hit this issue:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301044050652041216

How reproducible:
Run an OCP installation on oVirt and cross your fingers

Comment 1 Gal Zaidman 2020-09-02 07:27:40 UTC
Colin I know you took a look at it, any thoughts ?

Comment 2 Jan Zmeskal 2020-09-02 07:34:32 UTC
Gal, could this be the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ?

Comment 3 Gal Zaidman 2020-09-02 09:48:51 UTC
(In reply to Jan Zmeskal from comment #2)
> Gal, could this be the same issue as
> https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ?

I don't think it is, in our case it starts meaning the VM is UP, but the ignition fails

Comment 4 David Eads 2020-09-02 11:52:11 UTC
updating priority to reflect that ovirt jobs are failing to start 40%ish of the time.

Comment 5 Micah Abbott 2020-09-02 16:39:46 UTC
@slowrie could you investigate further?

Comment 6 Colin Walters 2020-09-09 17:58:19 UTC
It's weird that this is seemingly only impacting oVirt - at least, I am not sure we've seen this often on other platforms, though to be fair our telemetry here is really bad outside of qemu.

The locking code in shadow-utils is terrifying; I don't understand why it's not using a standard flock()/fcntl(F_SETLK) and is doing a thing with manually writing pids to regular files.

We don't have anything special around oVirt that's running in the initramfs that I can think of today.  (And to validate, oVirt isn't modifying the initramfs right?)

Wait...actually hmm, I see that oVirt is using the OpenStack image.  I wonder if there's something strange going on here in Ignition due to the way we fetch userdata via both cdrom and the metadata service.  Which does oVirt use?

Debugging this "bottom up" is probably going to involve something like injecting a systemtap kernel module early on in the initramfs that is logging which processes write files or so.

Comment 7 Colin Walters 2020-09-09 18:01:29 UTC
Possible avenues of investigation I can think of:

- systemd-sysusers being run in the initramfs?
- sssd (I am sure that the sssd nsswitch plugin is unaware of useradd --root, some weird interaction between that and the initramfs?)
  See also https://pagure.io/SSSD/sssd/pull-request/3959

Comment 8 Michal Skrivanek 2020-09-10 06:26:09 UTC
(In reply to Colin Walters from comment #6)
> We don't have anything special around oVirt that's running in the initramfs
> that I can think of today.  (And to validate, oVirt isn't modifying the
> initramfs right?)

correct

> Wait...actually hmm, I see that oVirt is using the OpenStack image.  I
> wonder if there's something strange going on here in Ignition due to the way
> we fetch userdata via both cdrom and the metadata service.  Which does oVirt
> use?

only cdrom

Comment 9 Michal Skrivanek 2020-09-10 08:13:56 UTC
Update - we're looking into the terraform logic around the template creation. It seems we're booting the instance and tearing it down forcefully (dirty power off) which seems like a bad idea

https://github.com/oVirt/terraform-provider-ovirt/blob/f2b0157c5bb5b29495acb1a2fac902e9005a9a4d/ovirt/resource_ovirt_vm_template.go#L340-L366

Comment 10 Evgeny Slutsky 2020-09-13 12:28:16 UTC
we want to change the template creation flow so the VM wont be started before the template is sealed.

Comment 12 Micah Abbott 2020-09-19 15:14:26 UTC
@Lucie would your team be able to verify this BZ?

Comment 13 Gal Zaidman 2020-09-21 06:19:18 UTC
I think we can verifiy it with CI since we don't hit it anymore.
It was around 50% of the failures

Comment 14 Lucie Leistnerova 2020-09-21 12:43:04 UTC
According to Comment 13, verified in ocp 4.6.0-0.nightly-2020-09-21-030155 and rhv 4.4.0.3-1.el8

Comment 17 errata-xmlrpc 2020-10-27 16:36:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.