Bug 2080504
Summary: | SNO stuck in "ignition" loop during install | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alex Krzos <akrzos> | ||||
Component: | RHCOS | Assignee: | RHCOS Bug Triage <rhcos-triage> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Michael Nguyen <mnguyen> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.10 | CC: | bgilbert, dornelas, imiller, jligon, kzak, miabbott, mrussell, nstielau, travier, walters | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-08-11 13:42:08 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
These are clearly special SNOflakes. OK so I see in there "systemd: Started emergency shell" - must be from something else failing and triggering OnFailure=emergency.target. Ahh...scrolling a bit up frame by frame, I see the fatal error: ``` ostree-cmdline: mount: /proc/cmdline: mount(2) system call failed: No such file or directory ``` That's...weird. `man 2 mount` says: `ENOENT A pathname was empty or had a nonexistent component.` But...that seems unlikely. It might be some sort of kernel race. OK this is from ostree-prepare-root.c: https://github.com/ostreedev/ostree/blob/main/src/switchroot/ostree-prepare-root.c but the error message looks like it's from util-linux: https://github.com/util-linux/util-linux/blob/24e896c1400c2328b8bdffde674a3d74429acdf1/libmount/src/context_mount.c#L2004 Confusing though...ostree-prepare-root does not use libmount from util-linux, only the plain libc mount(). I don't quite see how we'd be able to output that string... Ah wait I see this is the Live ISO, where we have https://github.com/coreos/fedora-coreos-config/blob/8d4d1ec1989d98a5a3d0c8e95215ae7c31688e4e/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/live-generator#L49 injected which does https://github.com/coreos/fedora-coreos-config/blob/8d4d1ec1989d98a5a3d0c8e95215ae7c31688e4e/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/ostree-cmdline.sh#L15 which definitely uses libmount via /bin/mount. OK. But...we should clearly have /proc/cmdline at this point. Adding Karel in the hope he has seen "unusual ENOENT from mount" races... Interestingly enough if I virsh destroy the SNOs, the hub cluster operators restart them back up (expected) and they will continue further for install to include becoming discovered and then installing. Generally we need the full console log from the failing node to be able to diagnose. The repeated `fetched {user,base} config from "system"` messages are unexpected. Ignition doesn't have any loops that should cause that message to be printed repeatedly. @alex see comment #9; we need a complete console log showing the failing node booting and any error conditions observed during boot (In reply to Micah Abbott from comment #11) > @alex see comment #9; we need a complete console log showing the failing > node booting and any error conditions observed during boot Hi, is there a method to obtain the full console log from a machine stuck in this condition, when I can not access it either from the console (virsh console or vnc into the console) or from ssh? Would I need to enable something in the image before install? libvirt can log the serial port output to a file, which will produce the console logs if `console=ttyS0,115200` is also specified on the kernel command line. The latter is not the default if you're booting from the live ISO (though it currently is the default for the installed system). You can manually add the argument in the bootloader, or use `coreos-installer iso kargs modify -a console=ttyS0,115200 <iso-file>` to modify the ISO to add the argument. We are unable to make progress on this bug without the requested information, so the bug is now being closed. If the problem persists, please provide the requested information and reopen the bug. |
Created attachment 1876079 [details] sno00131 console looping OCP Version at Install Time: 4.10.8 RHCOS Version at Install Time: Platform: libvirt Architecture: x86_64 We are installing 2000 SNOs via ZTP/Assisted-installer hosted in an ACM Hub cluster via gitops. The use-case is for Telco managing many SNO clusters. We witnessed a very tiny percentage of clusters that seem to get stuck in an ignition loop during install this seems to vary between 1-3 SNOs per 1780 installs which makes this failure condition exceedingly rare. We can provide access to the SNOs displaying the failure in order to facilitate debugging this. Please see capture screen record The two captured failure screen records seem to have different behaviors when trying to view the console via virsh: [root@e25-h05-000-r640 ~]# virsh console sno00131 Connected to domain sno00131 Escape character is ^] [root@e25-h05-000-r640 ~]# Nothing shows for sno00131 ^^^ [root@e30-h21-000-r640 ~]# virsh console sno00732 Connected to domain sno00732 Escape character is ^] Red Hat Enterprise Linux CoreOS 410.84.202203290245-0 (Ootpa) 4.10 Ignition: ran on 2022/04/29 00:37:42 UTC (at least 1 boot ago) Ignition: user-provided config was applied SSH host key: SHA256:QA+co8Wrtol46ImZb9e/MDk7ofFsAIOZdE9FLGFSqEY (ECDSA) SSH host key: SHA256:4PY3PVfq2LxReH8O945hoMtLqmPOIjEI3ux5k8C+nbU (ED25519) SSH host key: SHA256:4bqIG9kBpJ5REWYDXxf6l9cpPnoigDYQclN2nCvhQ6c (RSA) enp1s0: sno00732 login: As a login prompt is at least available for sno00732