Bug 2080504 - SNO stuck in "ignition" loop during install
Summary: SNO stuck in "ignition" loop during install
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: RHCOS Bug Triage
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-29 20:32 UTC by Alex Krzos
Modified: 2022-10-20 12:21 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-11 13:42:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sno00131 console looping (7.60 MB, application/x-matroska)
2022-04-29 20:32 UTC, Alex Krzos
no flags Details

Description Alex Krzos 2022-04-29 20:32:46 UTC
Created attachment 1876079 [details]
sno00131 console looping

OCP Version at Install Time: 4.10.8
RHCOS Version at Install Time: 
Platform: libvirt
Architecture: x86_64


We are installing 2000 SNOs via ZTP/Assisted-installer hosted in an ACM Hub cluster via gitops. The use-case is for Telco managing many SNO clusters.

We witnessed a very tiny percentage of clusters that seem to get stuck in an ignition loop during install this seems to vary between 1-3 SNOs per 1780 installs which makes this failure condition exceedingly rare.

We can provide access to the SNOs displaying the failure in order to facilitate debugging this.
Please see capture screen record

The two captured failure screen records seem to have different behaviors when trying to view the console via virsh:

[root@e25-h05-000-r640 ~]# virsh console sno00131
Connected to domain sno00131
Escape character is ^]

[root@e25-h05-000-r640 ~]#

Nothing shows for sno00131 ^^^


[root@e30-h21-000-r640 ~]# virsh console sno00732                         
Connected to domain sno00732                                          
Escape character is ^]
               
Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.10
Ignition: ran on 2022/04/29 00:37:42 UTC (at least 1 boot ago)
Ignition: user-provided config was applied
SSH host key: SHA256:QA+co8Wrtol46ImZb9e/MDk7ofFsAIOZdE9FLGFSqEY (ECDSA)
SSH host key: SHA256:4PY3PVfq2LxReH8O945hoMtLqmPOIjEI3ux5k8C+nbU (ED25519)
SSH host key: SHA256:4bqIG9kBpJ5REWYDXxf6l9cpPnoigDYQclN2nCvhQ6c (RSA)
enp1s0:
sno00732 login:

As a login prompt is at least available for sno00732

Comment 2 Colin Walters 2022-04-29 21:15:07 UTC
These are clearly special SNOflakes.

OK so I see in there "systemd: Started emergency shell" - must be from something else failing and triggering OnFailure=emergency.target.

Ahh...scrolling a bit up frame by frame, I see the fatal error:

```
ostree-cmdline: mount: /proc/cmdline: mount(2) system call failed: No such file or directory
```

That's...weird.  `man 2 mount` says:
`ENOENT A pathname was empty or had a nonexistent component.`
But...that seems unlikely.

It might be some sort of kernel race.  

OK this is from ostree-prepare-root.c:
https://github.com/ostreedev/ostree/blob/main/src/switchroot/ostree-prepare-root.c

but the error message looks like it's from util-linux:
https://github.com/util-linux/util-linux/blob/24e896c1400c2328b8bdffde674a3d74429acdf1/libmount/src/context_mount.c#L2004

Comment 4 Colin Walters 2022-04-29 21:27:00 UTC
Confusing though...ostree-prepare-root does not use libmount from util-linux, only the plain libc mount().
I don't quite see how we'd be able to output that string...

Ah wait I see this is the Live ISO, where we have
https://github.com/coreos/fedora-coreos-config/blob/8d4d1ec1989d98a5a3d0c8e95215ae7c31688e4e/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/live-generator#L49
injected which does
https://github.com/coreos/fedora-coreos-config/blob/8d4d1ec1989d98a5a3d0c8e95215ae7c31688e4e/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/ostree-cmdline.sh#L15

which definitely uses libmount via /bin/mount.

OK.  But...we should clearly have /proc/cmdline at this point.

Comment 5 Colin Walters 2022-04-29 21:28:22 UTC
Adding Karel in the hope he has seen "unusual ENOENT from mount" races...

Comment 6 Alex Krzos 2022-04-30 00:20:38 UTC
Interestingly enough if I virsh destroy the SNOs, the hub cluster operators restart them back up (expected) and they will continue further for install to include becoming discovered and then installing.

Comment 9 Timothée Ravier 2022-05-12 13:57:13 UTC
Generally we need the full console log from the failing node to be able to diagnose.

Comment 10 Benjamin Gilbert 2022-06-03 04:17:41 UTC
The repeated `fetched {user,base} config from "system"` messages are unexpected.  Ignition doesn't have any loops that should cause that message to be printed repeatedly.

Comment 11 Micah Abbott 2022-06-03 13:54:30 UTC
@alex see comment #9; we need a complete console log showing the failing node booting and any error conditions observed during boot

Comment 12 Alex Krzos 2022-06-07 13:38:23 UTC
(In reply to Micah Abbott from comment #11)
> @alex see comment #9; we need a complete console log showing the failing
> node booting and any error conditions observed during boot

Hi, is there a method to obtain the full console log from a machine stuck in this condition, when I can not access it either from the console (virsh console or vnc into the console) or from ssh?  Would I need to enable something in the image before install?

Comment 13 Benjamin Gilbert 2022-06-07 13:51:04 UTC
libvirt can log the serial port output to a file, which will produce the console logs if `console=ttyS0,115200` is also specified on the kernel command line.  The latter is not the default if you're booting from the live ISO (though it currently is the default for the installed system).  You can manually add the argument in the bootloader, or use `coreos-installer iso kargs modify -a console=ttyS0,115200 <iso-file>` to modify the ISO to add the argument.

Comment 14 RHCOS Bug Triage 2022-08-11 13:42:08 UTC
We are unable to make progress on this bug without the requested information, so the bug is now being closed. If the problem persists, please provide the requested information and reopen the bug.


Note You need to log in before you can comment on or make changes to this bug.