Bug 1874747

Summary:	oVirt VMs fail to pass ignition in some installations
Product:	OpenShift Container Platform	Reporter:	Gal Zaidman <gzaidman>
Component:	RHCOS	Assignee:	slowrie
Status:	CLOSED ERRATA	QA Contact:	Lucie Leistnerova <lleistne>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.6	CC:	bbreard, bgilbert, deads, eslutsky, imcleod, jligon, jzmeskal, lleistne, miabbott, michal.skrivanek, nstielau, walters
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:36:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gal Zaidman 2020-09-02 07:15:48 UTC

Description of problem:

On oVirt CI we started seeing VMs that fail on ignition, this is very flaky and happens once every~10 runs. On those runs the VMs are not starting.
We managed to get the journal from the VM:

http://pastebin.test.redhat.com/897899

On the journal we see the following error:

CRITICAL : files: ensureUsers: op(1): [failed]   creating or modifying user "core": exit status 1: Cmd: "usermod" "--root" "/sysroot" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core" Stdout: "" Stderr: "usermod: existing lock file /etc/passwd.lock without a PID\nusermod: cannot lock /etc/passwd; try again later.\n"

We don't understand why this is happening and why only part of the time

This is an example of a run that hit this issue:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/11520/rehearse-11520-pull-ci-openshift-cluster-api-provider-ovirt-master-e2e-ovirt/1301044050652041216

How reproducible:
Run an OCP installation on oVirt and cross your fingers

Comment 1 Gal Zaidman 2020-09-02 07:27:40 UTC

Colin I know you took a look at it, any thoughts ?

Comment 2 Jan Zmeskal 2020-09-02 07:34:32 UTC

Gal, could this be the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ?

Comment 3 Gal Zaidman 2020-09-02 09:48:51 UTC

(In reply to Jan Zmeskal from comment #2)
> Gal, could this be the same issue as
> https://bugzilla.redhat.com/show_bug.cgi?id=1867043 ?

I don't think it is, in our case it starts meaning the VM is UP, but the ignition fails

Comment 4 David Eads 2020-09-02 11:52:11 UTC

updating priority to reflect that ovirt jobs are failing to start 40%ish of the time.

Comment 5 Micah Abbott 2020-09-02 16:39:46 UTC

@slowrie could you investigate further?

Comment 6 Colin Walters 2020-09-09 17:58:19 UTC

It's weird that this is seemingly only impacting oVirt - at least, I am not sure we've seen this often on other platforms, though to be fair our telemetry here is really bad outside of qemu.

The locking code in shadow-utils is terrifying; I don't understand why it's not using a standard flock()/fcntl(F_SETLK) and is doing a thing with manually writing pids to regular files.

We don't have anything special around oVirt that's running in the initramfs that I can think of today.  (And to validate, oVirt isn't modifying the initramfs right?)

Wait...actually hmm, I see that oVirt is using the OpenStack image.  I wonder if there's something strange going on here in Ignition due to the way we fetch userdata via both cdrom and the metadata service.  Which does oVirt use?

Debugging this "bottom up" is probably going to involve something like injecting a systemtap kernel module early on in the initramfs that is logging which processes write files or so.

Comment 7 Colin Walters 2020-09-09 18:01:29 UTC

Possible avenues of investigation I can think of:

- systemd-sysusers being run in the initramfs?
- sssd (I am sure that the sssd nsswitch plugin is unaware of useradd --root, some weird interaction between that and the initramfs?)
  See also https://pagure.io/SSSD/sssd/pull-request/3959

Comment 8 Michal Skrivanek 2020-09-10 06:26:09 UTC

(In reply to Colin Walters from comment #6)
> We don't have anything special around oVirt that's running in the initramfs
> that I can think of today.  (And to validate, oVirt isn't modifying the
> initramfs right?)

correct

> Wait...actually hmm, I see that oVirt is using the OpenStack image.  I
> wonder if there's something strange going on here in Ignition due to the way
> we fetch userdata via both cdrom and the metadata service.  Which does oVirt
> use?

only cdrom

Comment 9 Michal Skrivanek 2020-09-10 08:13:56 UTC

Update - we're looking into the terraform logic around the template creation. It seems we're booting the instance and tearing it down forcefully (dirty power off) which seems like a bad idea

https://github.com/oVirt/terraform-provider-ovirt/blob/f2b0157c5bb5b29495acb1a2fac902e9005a9a4d/ovirt/resource_ovirt_vm_template.go#L340-L366

Comment 10 Evgeny Slutsky 2020-09-13 12:28:16 UTC

we want to change the template creation flow so the VM wont be started before the template is sealed.

Comment 12 Micah Abbott 2020-09-19 15:14:26 UTC

@Lucie would your team be able to verify this BZ?

Comment 13 Gal Zaidman 2020-09-21 06:19:18 UTC

I think we can verifiy it with CI since we don't hit it anymore.
It was around 50% of the failures

Comment 14 Lucie Leistnerova 2020-09-21 12:43:04 UTC

According to Comment 13, verified in ocp 4.6.0-0.nightly-2020-09-21-030155 and rhv 4.4.0.3-1.el8

Comment 17 errata-xmlrpc 2020-10-27 16:36:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196