Bug 2080504

Summary:

SNO stuck in "ignition" loop during install

Product:

OpenShift Container Platform

Reporter:

Alex Krzos <akrzos>

Component:

RHCOS

Assignee:

RHCOS Bug Triage <rhcos-triage>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Michael Nguyen <mnguyen>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.10

CC:

bgilbert, dornelas, imiller, jligon, kzak, miabbott, mrussell, nstielau, travier, walters

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-11 13:42:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sno00131 console looping	none

Description Alex Krzos 2022-04-29 20:32:46 UTC

Created attachment 1876079 [details]
sno00131 console looping

OCP Version at Install Time: 4.10.8
RHCOS Version at Install Time: 
Platform: libvirt
Architecture: x86_64


We are installing 2000 SNOs via ZTP/Assisted-installer hosted in an ACM Hub cluster via gitops. The use-case is for Telco managing many SNO clusters.

We witnessed a very tiny percentage of clusters that seem to get stuck in an ignition loop during install this seems to vary between 1-3 SNOs per 1780 installs which makes this failure condition exceedingly rare.

We can provide access to the SNOs displaying the failure in order to facilitate debugging this.
Please see capture screen record

The two captured failure screen records seem to have different behaviors when trying to view the console via virsh:

[root@e25-h05-000-r640 ~]# virsh console sno00131
Connected to domain sno00131
Escape character is ^]

[root@e25-h05-000-r640 ~]#

Nothing shows for sno00131 ^^^


[root@e30-h21-000-r640 ~]# virsh console sno00732                         
Connected to domain sno00732                                          
Escape character is ^]
               
Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.10
Ignition: ran on 2022/04/29 00:37:42 UTC (at least 1 boot ago)
Ignition: user-provided config was applied
SSH host key: SHA256:QA+co8Wrtol46ImZb9e/MDk7ofFsAIOZdE9FLGFSqEY (ECDSA)
SSH host key: SHA256:4PY3PVfq2LxReH8O945hoMtLqmPOIjEI3ux5k8C+nbU (ED25519)
SSH host key: SHA256:4bqIG9kBpJ5REWYDXxf6l9cpPnoigDYQclN2nCvhQ6c (RSA)
enp1s0:
sno00732 login:

As a login prompt is at least available for sno00732

Comment 2 Colin Walters 2022-04-29 21:15:07 UTC

These are clearly special SNOflakes.

OK so I see in there "systemd: Started emergency shell" - must be from something else failing and triggering OnFailure=emergency.target.

Ahh...scrolling a bit up frame by frame, I see the fatal error:

```
ostree-cmdline: mount: /proc/cmdline: mount(2) system call failed: No such file or directory
```

That's...weird.  `man 2 mount` says:
`ENOENT A pathname was empty or had a nonexistent component.`
But...that seems unlikely.

It might be some sort of kernel race.  

OK this is from ostree-prepare-root.c:
https://github.com/ostreedev/ostree/blob/main/src/switchroot/ostree-prepare-root.c

but the error message looks like it's from util-linux:
https://github.com/util-linux/util-linux/blob/24e896c1400c2328b8bdffde674a3d74429acdf1/libmount/src/context_mount.c#L2004

Comment 4 Colin Walters 2022-04-29 21:27:00 UTC

Confusing though...ostree-prepare-root does not use libmount from util-linux, only the plain libc mount().
I don't quite see how we'd be able to output that string...

Ah wait I see this is the Live ISO, where we have
https://github.com/coreos/fedora-coreos-config/blob/8d4d1ec1989d98a5a3d0c8e95215ae7c31688e4e/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/live-generator#L49
injected which does
https://github.com/coreos/fedora-coreos-config/blob/8d4d1ec1989d98a5a3d0c8e95215ae7c31688e4e/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/ostree-cmdline.sh#L15

which definitely uses libmount via /bin/mount.

OK.  But...we should clearly have /proc/cmdline at this point.

Comment 5 Colin Walters 2022-04-29 21:28:22 UTC

Adding Karel in the hope he has seen "unusual ENOENT from mount" races...

Comment 6 Alex Krzos 2022-04-30 00:20:38 UTC

Interestingly enough if I virsh destroy the SNOs, the hub cluster operators restart them back up (expected) and they will continue further for install to include becoming discovered and then installing.

Comment 9 Timothée Ravier 2022-05-12 13:57:13 UTC

Generally we need the full console log from the failing node to be able to diagnose.

Comment 10 Benjamin Gilbert 2022-06-03 04:17:41 UTC

The repeated `fetched {user,base} config from "system"` messages are unexpected.  Ignition doesn't have any loops that should cause that message to be printed repeatedly.

Comment 11 Micah Abbott 2022-06-03 13:54:30 UTC

@alex see comment #9; we need a complete console log showing the failing node booting and any error conditions observed during boot

Comment 12 Alex Krzos 2022-06-07 13:38:23 UTC

(In reply to Micah Abbott from comment #11)
> @alex see comment #9; we need a complete console log showing the failing
> node booting and any error conditions observed during boot

Hi, is there a method to obtain the full console log from a machine stuck in this condition, when I can not access it either from the console (virsh console or vnc into the console) or from ssh?  Would I need to enable something in the image before install?

Comment 13 Benjamin Gilbert 2022-06-07 13:51:04 UTC

libvirt can log the serial port output to a file, which will produce the console logs if `console=ttyS0,115200` is also specified on the kernel command line.  The latter is not the default if you're booting from the live ISO (though it currently is the default for the installed system).  You can manually add the argument in the bootloader, or use `coreos-installer iso kargs modify -a console=ttyS0,115200 <iso-file>` to modify the ISO to add the argument.

Comment 14 RHCOS Bug Triage 2022-08-11 13:42:08 UTC

We are unable to make progress on this bug without the requested information, so the bug is now being closed. If the problem persists, please provide the requested information and reopen the bug.