Bug 2093486

Summary:

4.11.fc.0 SNO install in ignition loop

Product:

OpenShift Container Platform

Reporter:

Dwaine Gonyier <dgonyier>

Component:

RHCOS

Assignee:

RHCOS Bug Triage <rhcos-triage>

Status:

CLOSED DUPLICATE

QA Contact:

Michael Nguyen <mnguyen>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.11

CC:

bgilbert, bzvonar, ccrum, dornelas, jhou, jlebon, jligon, keyoung, mcornea, miabbott, mifiedle, mrussell, nstielau, pamoedom, yliu1

Target Milestone:

---

Keywords:

AutomationBlocker, Reopened, TestBlocker

Target Release:

4.11.0

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-06-16 15:43:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2052124

Attachments:

Description	Flags
coreos-boot-edit-1.png	none
coreos-boot-edit-2.png	none

Description Dwaine Gonyier 2022-06-03 20:33:58 UTC

Description of problem:
Attempting to install 4.11.fc.0 on SNO metal node results in ignition loop.

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.11.0-fc.0-x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install this version of OCP 4.11 on SNO spoke cluster with DU profile (internal pipeline)
2. Observe console
3.

Actual results:
Continuous loop

Expected results:
Successful install.

Additional info:
See video capture attachment below for console log. It is fast scrolling so I recommend pausing and stepping through frames.

Comment 2 yliu1 2022-06-03 20:47:54 UTC

This happened during ZTP install, after CD boot, the server is rebooted to boot into Harddrive. The reboot loop happens during boot to HD.

This issue is reproducible with 4.11 builds, and the same server was installed successfully with 4.10 ocp build from the same hub cluster.

Comment 6 Ian Miller 2022-06-06 14:00:12 UTC

May relate to BZ 2080504?

Comment 7 Jonathan Lebon 2022-06-06 14:44:51 UTC

We need to see the logs from the start. In the video, Ignition is failing because it's being rerun and is unable to overwrite a file it wrote in a previous iteration. So those error messages are a red herring. The real error happened at the very start and may or may not have involved Ignition at all.

There's some funkiness going on with systemd/dracut where in some failure mode, instead of staying put it tries to rerun the whole transaction over and over again. I've experienced this as well in the past. Independently of this, we should look into how we can tighten this so that it doesn't happen.

Comment 8 Ken Young 2022-06-08 18:11:54 UTC

Setting this to blocker+ for OCP GA.  This is fundamental for the Telco GA for us to be able to complete our testing on top of OCP.  As well, SNO needs to be installable as a supported configuration.

Comment 9 Dwaine Gonyier 2022-06-08 19:36:36 UTC

(In reply to Jonathan Lebon from comment #7)
> We need to see the logs from the start. In the video, Ignition is failing
> because it's being rerun and is unable to overwrite a file it wrote in a
> previous iteration. So those error messages are a red herring. The real
> error happened at the very start and may or may not have involved Ignition
> at all.
> 
> There's some funkiness going on with systemd/dracut where in some failure
> mode, instead of staying put it tries to rerun the whole transaction over
> and over again. I've experienced this as well in the past. Independently of
> this, we should look into how we can tighten this so that it doesn't happen.

Added a new console video capture (337MB MP4)with the first ignition pass:
https://drive.google.com/file/d/1QbOK-yeip6itLy9FNd604iV6OCWHVqkY/view?usp=sharing

some timestamps
15:03 ignition boot from HD
26:11 second reboot to HD where ignition loop starts.

Note that there is a separate known boot order issue with this host that
requires manually setting the internal HD as the first boot option to
avoid booting from the CD install repeatedly. This is to insure the
install behavior is as intended.

You will see those efforts in the video :)

Comment 10 Jonathan Lebon 2022-06-08 20:07:02 UTC

Thanks for the recording! It helps a lot.

So what I'm seeing is that at 15:48, Ignition hangs for a long while trying to write `/sysroot/etc/kubernetes/static-pod-resources/etcd-member/etcd-all-certs/etcd-peer-worker-1...key`.

Then at 21:01, we just get a "reboot: Restarting system" and the machine reboots.

So we need to figure out why the initramfs started hanging and the machine subsequently rebooted. Is it possible something is interacting with the BMC at the same time?

Comment 11 Jonathan Lebon 2022-06-08 20:16:30 UTC

Adding `rd.debug` could provide more insight but the way the reboot message shows up without systemd unwinding the transaction makes it seem like a low-level request, e.g. triggered from the BMC or some service doing `systemctl reboot -ff`.

Comment 12 Dwaine Gonyier 2022-06-09 14:42:35 UTC

As far as I know, nothing was interacting with the BMC in the background at that point.

Comment 13 yliu1 2022-06-10 17:25:21 UTC

An update: this seems to be an issue in the 4.11 OS image used to boot the spoke in assisted install. 
https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/latest-4.11/

I was able to deploy 4.11.0-fc.0 OCP on my spoke cluster by using these 4.10 os image in agentserviceconfig: https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.10/latest/

Comment 14 Jonathan Lebon 2022-06-13 19:11:22 UTC

(In reply to yliu1 from comment #13)
> An update: this seems to be an issue in the 4.11 OS image used to boot the
> spoke in assisted install. 
> https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-
> release/latest-4.11/
> 
> I was able to deploy 4.11.0-fc.0 OCP on my spoke cluster by using these 4.10
> os image in agentserviceconfig:
> https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.10/
> latest/

Ack thanks. So this might be a RHEL 8.4 to 8.5 issue.

The RHCOS 4.11 images in the mirror location are still at 8.5, but we've recently moved (back) to 8.6. It'd be good to confirm that this is still an issue in 8.6. We don't currently have bootimages but will try to get some out soon.

Meanwhile, one thing I'm interested in knowing is whether this issue also happens if you do an RHCOS install directly, i.e. outside the context of OCP. Can you try just installing RHCOS using `coreos-installer install` manually with a simple Ignition config?

Comment 16 Marius Cornea 2022-06-14 20:42:17 UTC

After troubleshooting the reproducer system it seems that the failure is caused by coreos-boot-edit service which is failing with `Error: Expected one vendor dir on /dev/sda2, got 2`, screenshot attached. 

I found the same issue(seems Dell specific) reported in https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm the fix has landed in the RHCOS 4.11 images?

Thanks!

Comment 17 Marius Cornea 2022-06-14 20:42:56 UTC

Created attachment 1890026 [details]
coreos-boot-edit-1.png

Comment 18 Marius Cornea 2022-06-14 20:43:25 UTC

Created attachment 1890027 [details]
coreos-boot-edit-2.png

Comment 19 Marius Cornea 2022-06-14 20:53:55 UTC

(In reply to Marius Cornea from comment #16)
> After troubleshooting the reproducer system it seems that the failure is
> caused by coreos-boot-edit service which is failing with `Error: Expected
> one vendor dir on /dev/sda2, got 2`, screenshot attached. 
> 
> I found the same issue(seems Dell specific) reported in
> https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in
> https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm
> the fix has landed in the RHCOS 4.11 images?
> 
> Thanks!

/dev/sda2 content on the reproduer system:

find /mnt/
/mnt/
/mnt/EFI
/mnt/EFI/redhat
/mnt/EFI/redhat/fonts
/mnt/EFI/redhat/shimx64.efi
/mnt/EFI/redhat/BOOTX64.CSV
/mnt/EFI/redhat/grubx64.efi
/mnt/EFI/redhat/mmx64.efi
/mnt/EFI/redhat/shimx64-redhat.efi
/mnt/EFI/redhat/grub.cfg
/mnt/EFI/BOOT
/mnt/EFI/BOOT/BOOTX64.EFI
/mnt/EFI/BOOT/fbx64.efi
/mnt/EFI/Dell
/mnt/EFI/Dell/BootOptionCache
/mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat

Comment 20 Jonathan Lebon 2022-06-14 21:39:08 UTC

(In reply to Marius Cornea from comment #16)
> After troubleshooting the reproducer system it seems that the failure is
> caused by coreos-boot-edit service which is failing with `Error: Expected
> one vendor dir on /dev/sda2, got 2`, screenshot attached. 
> 
> I found the same issue(seems Dell specific) reported in
> https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in
> https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm
> the fix has landed in the RHCOS 4.11 images?
> 
> Thanks!

Yes, that patch is in coreos-installer v0.14.0 which is in the latest 4.11 bootimages. The RHBZ tracking that is https://bugzilla.redhat.com/show_bug.cgi?id=2074483.

When you say "the reproducer system", are you talking about the same system from which the video was captured in comment 1?
The error mode in the video capture there seems very different.

To confirm, can you or the reporter retry this again on the same system with the latest 4.11 bootimages? (See https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.json).

Comment 21 Marius Cornea 2022-06-15 08:49:53 UTC

(In reply to Jonathan Lebon from comment #20)
> (In reply to Marius Cornea from comment #16)
> > After troubleshooting the reproducer system it seems that the failure is
> > caused by coreos-boot-edit service which is failing with `Error: Expected
> > one vendor dir on /dev/sda2, got 2`, screenshot attached. 
> > 
> > I found the same issue(seems Dell specific) reported in
> > https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in
> > https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm
> > the fix has landed in the RHCOS 4.11 images?
> > 
> > Thanks!
> 
> Yes, that patch is in coreos-installer v0.14.0 which is in the latest 4.11
> bootimages. The RHBZ tracking that is
> https://bugzilla.redhat.com/show_bug.cgi?id=2074483.
> 
> When you say "the reproducer system", are you talking about the same system
> from which the video was captured in comment 1?
> The error mode in the video capture there seems very different.

It was not the exact same system but a different one with the same hardware which showed the same symptoms as in the video capture.

> To confirm, can you or the reporter retry this again on the same system with
> the latest 4.11 bootimages? (See
> https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.
> json).

I confirm the issue no longer reproduced with the latest 4.11 images and the node was able to boot without issues:

rhcos-411.85.202205101201-0-live-rootfs.x86_64.img
rhcos-411.85.202205101201-0-live.x86_64.iso

Comment 22 Jonathan Lebon 2022-06-15 13:17:08 UTC

Thanks for testing. Closing as dupe.

*** This bug has been marked as a duplicate of bug 2074483 ***

Comment 23 yliu1 2022-06-15 18:29:17 UTC

@jlebon the linked bz got a fix in the ocp build so it is closed. But we would need the same fix to be here to close this one: https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/latest-4.11/

Could you please follow up on that?

Comment 24 yliu1 2022-06-15 18:31:11 UTC

Reopen because the fix has not landed in the expected place yet.

Comment 25 Micah Abbott 2022-06-16 12:37:23 UTC

(In reply to yliu1 from comment #23)
> @jlebon the linked bz got a fix in the ocp build so it is closed.
> But we would need the same fix to be here to close this one:
> https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-
> release/latest-4.11/
> 
> Could you please follow up on that?

The CoreOS team is not responsible for updating the mirrors; you would need to contact ART about the frequency of updating the mirror.

Comment 28 Jonathan Lebon 2022-06-16 15:43:31 UTC

Re-closing. Please track any mirroring requests to ART in Jira as above.

*** This bug has been marked as a duplicate of bug 2074483 ***

Comment 29 yliu1 2022-06-21 19:35:58 UTC

Thank you Micah!