Bug 2093486 - 4.11.fc.0 SNO install in ignition loop
Summary: 4.11.fc.0 SNO install in ignition loop
Keywords:
Status: CLOSED DUPLICATE of bug 2074483
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.11
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: RHCOS Bug Triage
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 2052124
TreeView+ depends on / blocked
 
Reported: 2022-06-03 20:33 UTC by Dwaine Gonyier
Modified: 2022-06-21 19:35 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-16 15:43:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
coreos-boot-edit-1.png (21.74 KB, image/png)
2022-06-14 20:42 UTC, Marius Cornea
no flags Details
coreos-boot-edit-2.png (51.39 KB, image/png)
2022-06-14 20:43 UTC, Marius Cornea
no flags Details

Description Dwaine Gonyier 2022-06-03 20:33:58 UTC
Description of problem:
Attempting to install 4.11.fc.0 on SNO metal node results in ignition loop.

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.11.0-fc.0-x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install this version of OCP 4.11 on SNO spoke cluster with DU profile (internal pipeline)
2. Observe console
3.

Actual results:
Continuous loop

Expected results:
Successful install.

Additional info:
See video capture attachment below for console log. It is fast scrolling so I recommend pausing and stepping through frames.

Comment 2 yliu1 2022-06-03 20:47:54 UTC
This happened during ZTP install, after CD boot, the server is rebooted to boot into Harddrive. The reboot loop happens during boot to HD.

This issue is reproducible with 4.11 builds, and the same server was installed successfully with 4.10 ocp build from the same hub cluster.

Comment 6 Ian Miller 2022-06-06 14:00:12 UTC
May relate to BZ 2080504?

Comment 7 Jonathan Lebon 2022-06-06 14:44:51 UTC
We need to see the logs from the start. In the video, Ignition is failing because it's being rerun and is unable to overwrite a file it wrote in a previous iteration. So those error messages are a red herring. The real error happened at the very start and may or may not have involved Ignition at all.

There's some funkiness going on with systemd/dracut where in some failure mode, instead of staying put it tries to rerun the whole transaction over and over again. I've experienced this as well in the past. Independently of this, we should look into how we can tighten this so that it doesn't happen.

Comment 8 Ken Young 2022-06-08 18:11:54 UTC
Setting this to blocker+ for OCP GA.  This is fundamental for the Telco GA for us to be able to complete our testing on top of OCP.  As well, SNO needs to be installable as a supported configuration.

Comment 9 Dwaine Gonyier 2022-06-08 19:36:36 UTC
(In reply to Jonathan Lebon from comment #7)
> We need to see the logs from the start. In the video, Ignition is failing
> because it's being rerun and is unable to overwrite a file it wrote in a
> previous iteration. So those error messages are a red herring. The real
> error happened at the very start and may or may not have involved Ignition
> at all.
> 
> There's some funkiness going on with systemd/dracut where in some failure
> mode, instead of staying put it tries to rerun the whole transaction over
> and over again. I've experienced this as well in the past. Independently of
> this, we should look into how we can tighten this so that it doesn't happen.

Added a new console video capture (337MB MP4)with the first ignition pass:
https://drive.google.com/file/d/1QbOK-yeip6itLy9FNd604iV6OCWHVqkY/view?usp=sharing

some timestamps
15:03 ignition boot from HD
26:11 second reboot to HD where ignition loop starts.

Note that there is a separate known boot order issue with this host that
requires manually setting the internal HD as the first boot option to
avoid booting from the CD install repeatedly. This is to insure the
install behavior is as intended.

You will see those efforts in the video :)

Comment 10 Jonathan Lebon 2022-06-08 20:07:02 UTC
Thanks for the recording! It helps a lot.

So what I'm seeing is that at 15:48, Ignition hangs for a long while trying to write `/sysroot/etc/kubernetes/static-pod-resources/etcd-member/etcd-all-certs/etcd-peer-worker-1...key`.

Then at 21:01, we just get a "reboot: Restarting system" and the machine reboots.

So we need to figure out why the initramfs started hanging and the machine subsequently rebooted. Is it possible something is interacting with the BMC at the same time?

Comment 11 Jonathan Lebon 2022-06-08 20:16:30 UTC
Adding `rd.debug` could provide more insight but the way the reboot message shows up without systemd unwinding the transaction makes it seem like a low-level request, e.g. triggered from the BMC or some service doing `systemctl reboot -ff`.

Comment 12 Dwaine Gonyier 2022-06-09 14:42:35 UTC
As far as I know, nothing was interacting with the BMC in the background at that point.

Comment 13 yliu1 2022-06-10 17:25:21 UTC
An update: this seems to be an issue in the 4.11 OS image used to boot the spoke in assisted install. 
https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/latest-4.11/

I was able to deploy 4.11.0-fc.0 OCP on my spoke cluster by using these 4.10 os image in agentserviceconfig: https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.10/latest/

Comment 14 Jonathan Lebon 2022-06-13 19:11:22 UTC
(In reply to yliu1 from comment #13)
> An update: this seems to be an issue in the 4.11 OS image used to boot the
> spoke in assisted install. 
> https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-
> release/latest-4.11/
> 
> I was able to deploy 4.11.0-fc.0 OCP on my spoke cluster by using these 4.10
> os image in agentserviceconfig:
> https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.10/
> latest/

Ack thanks. So this might be a RHEL 8.4 to 8.5 issue.

The RHCOS 4.11 images in the mirror location are still at 8.5, but we've recently moved (back) to 8.6. It'd be good to confirm that this is still an issue in 8.6. We don't currently have bootimages but will try to get some out soon.

Meanwhile, one thing I'm interested in knowing is whether this issue also happens if you do an RHCOS install directly, i.e. outside the context of OCP. Can you try just installing RHCOS using `coreos-installer install` manually with a simple Ignition config?

Comment 16 Marius Cornea 2022-06-14 20:42:17 UTC
After troubleshooting the reproducer system it seems that the failure is caused by coreos-boot-edit service which is failing with `Error: Expected one vendor dir on /dev/sda2, got 2`, screenshot attached. 

I found the same issue(seems Dell specific) reported in https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm the fix has landed in the RHCOS 4.11 images?

Thanks!

Comment 17 Marius Cornea 2022-06-14 20:42:56 UTC
Created attachment 1890026 [details]
coreos-boot-edit-1.png

Comment 18 Marius Cornea 2022-06-14 20:43:25 UTC
Created attachment 1890027 [details]
coreos-boot-edit-2.png

Comment 19 Marius Cornea 2022-06-14 20:53:55 UTC
(In reply to Marius Cornea from comment #16)
> After troubleshooting the reproducer system it seems that the failure is
> caused by coreos-boot-edit service which is failing with `Error: Expected
> one vendor dir on /dev/sda2, got 2`, screenshot attached. 
> 
> I found the same issue(seems Dell specific) reported in
> https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in
> https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm
> the fix has landed in the RHCOS 4.11 images?
> 
> Thanks!

/dev/sda2 content on the reproduer system:

find /mnt/
/mnt/
/mnt/EFI
/mnt/EFI/redhat
/mnt/EFI/redhat/fonts
/mnt/EFI/redhat/shimx64.efi
/mnt/EFI/redhat/BOOTX64.CSV
/mnt/EFI/redhat/grubx64.efi
/mnt/EFI/redhat/mmx64.efi
/mnt/EFI/redhat/shimx64-redhat.efi
/mnt/EFI/redhat/grub.cfg
/mnt/EFI/BOOT
/mnt/EFI/BOOT/BOOTX64.EFI
/mnt/EFI/BOOT/fbx64.efi
/mnt/EFI/Dell
/mnt/EFI/Dell/BootOptionCache
/mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat

Comment 20 Jonathan Lebon 2022-06-14 21:39:08 UTC
(In reply to Marius Cornea from comment #16)
> After troubleshooting the reproducer system it seems that the failure is
> caused by coreos-boot-edit service which is failing with `Error: Expected
> one vendor dir on /dev/sda2, got 2`, screenshot attached. 
> 
> I found the same issue(seems Dell specific) reported in
> https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in
> https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm
> the fix has landed in the RHCOS 4.11 images?
> 
> Thanks!

Yes, that patch is in coreos-installer v0.14.0 which is in the latest 4.11 bootimages. The RHBZ tracking that is https://bugzilla.redhat.com/show_bug.cgi?id=2074483.

When you say "the reproducer system", are you talking about the same system from which the video was captured in comment 1?
The error mode in the video capture there seems very different.

To confirm, can you or the reporter retry this again on the same system with the latest 4.11 bootimages? (See https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.json).

Comment 21 Marius Cornea 2022-06-15 08:49:53 UTC
(In reply to Jonathan Lebon from comment #20)
> (In reply to Marius Cornea from comment #16)
> > After troubleshooting the reproducer system it seems that the failure is
> > caused by coreos-boot-edit service which is failing with `Error: Expected
> > one vendor dir on /dev/sda2, got 2`, screenshot attached. 
> > 
> > I found the same issue(seems Dell specific) reported in
> > https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in
> > https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm
> > the fix has landed in the RHCOS 4.11 images?
> > 
> > Thanks!
> 
> Yes, that patch is in coreos-installer v0.14.0 which is in the latest 4.11
> bootimages. The RHBZ tracking that is
> https://bugzilla.redhat.com/show_bug.cgi?id=2074483.
> 
> When you say "the reproducer system", are you talking about the same system
> from which the video was captured in comment 1?
> The error mode in the video capture there seems very different.

It was not the exact same system but a different one with the same hardware which showed the same symptoms as in the video capture.

> To confirm, can you or the reporter retry this again on the same system with
> the latest 4.11 bootimages? (See
> https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.
> json).

I confirm the issue no longer reproduced with the latest 4.11 images and the node was able to boot without issues:

rhcos-411.85.202205101201-0-live-rootfs.x86_64.img
rhcos-411.85.202205101201-0-live.x86_64.iso

Comment 22 Jonathan Lebon 2022-06-15 13:17:08 UTC
Thanks for testing. Closing as dupe.

*** This bug has been marked as a duplicate of bug 2074483 ***

Comment 23 yliu1 2022-06-15 18:29:17 UTC
@jlebon the linked bz got a fix in the ocp build so it is closed. But we would need the same fix to be here to close this one: https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/latest-4.11/

Could you please follow up on that?

Comment 24 yliu1 2022-06-15 18:31:11 UTC
Reopen because the fix has not landed in the expected place yet.

Comment 25 Micah Abbott 2022-06-16 12:37:23 UTC
(In reply to yliu1 from comment #23)
> @jlebon the linked bz got a fix in the ocp build so it is closed.
> But we would need the same fix to be here to close this one:
> https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-
> release/latest-4.11/
> 
> Could you please follow up on that?

The CoreOS team is not responsible for updating the mirrors; you would need to contact ART about the frequency of updating the mirror.

Comment 28 Jonathan Lebon 2022-06-16 15:43:31 UTC
Re-closing. Please track any mirroring requests to ART in Jira as above.

*** This bug has been marked as a duplicate of bug 2074483 ***

Comment 29 yliu1 2022-06-21 19:35:58 UTC
Thank you Micah!


Note You need to log in before you can comment on or make changes to this bug.