Bug 2093486
Summary: | 4.11.fc.0 SNO install in ignition loop | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dwaine Gonyier <dgonyier> | ||||||
Component: | RHCOS | Assignee: | RHCOS Bug Triage <rhcos-triage> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.11 | CC: | bgilbert, bzvonar, ccrum, dornelas, jhou, jlebon, jligon, keyoung, mcornea, miabbott, mifiedle, mrussell, nstielau, pamoedom, yliu1 | ||||||
Target Milestone: | --- | Keywords: | AutomationBlocker, Reopened, TestBlocker | ||||||
Target Release: | 4.11.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2022-06-16 15:43:31 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 2052124 | ||||||||
Attachments: |
|
Description
Dwaine Gonyier
2022-06-03 20:33:58 UTC
This happened during ZTP install, after CD boot, the server is rebooted to boot into Harddrive. The reboot loop happens during boot to HD. This issue is reproducible with 4.11 builds, and the same server was installed successfully with 4.10 ocp build from the same hub cluster. May relate to BZ 2080504? We need to see the logs from the start. In the video, Ignition is failing because it's being rerun and is unable to overwrite a file it wrote in a previous iteration. So those error messages are a red herring. The real error happened at the very start and may or may not have involved Ignition at all. There's some funkiness going on with systemd/dracut where in some failure mode, instead of staying put it tries to rerun the whole transaction over and over again. I've experienced this as well in the past. Independently of this, we should look into how we can tighten this so that it doesn't happen. Setting this to blocker+ for OCP GA. This is fundamental for the Telco GA for us to be able to complete our testing on top of OCP. As well, SNO needs to be installable as a supported configuration. (In reply to Jonathan Lebon from comment #7) > We need to see the logs from the start. In the video, Ignition is failing > because it's being rerun and is unable to overwrite a file it wrote in a > previous iteration. So those error messages are a red herring. The real > error happened at the very start and may or may not have involved Ignition > at all. > > There's some funkiness going on with systemd/dracut where in some failure > mode, instead of staying put it tries to rerun the whole transaction over > and over again. I've experienced this as well in the past. Independently of > this, we should look into how we can tighten this so that it doesn't happen. Added a new console video capture (337MB MP4)with the first ignition pass: https://drive.google.com/file/d/1QbOK-yeip6itLy9FNd604iV6OCWHVqkY/view?usp=sharing some timestamps 15:03 ignition boot from HD 26:11 second reboot to HD where ignition loop starts. Note that there is a separate known boot order issue with this host that requires manually setting the internal HD as the first boot option to avoid booting from the CD install repeatedly. This is to insure the install behavior is as intended. You will see those efforts in the video :) Thanks for the recording! It helps a lot. So what I'm seeing is that at 15:48, Ignition hangs for a long while trying to write `/sysroot/etc/kubernetes/static-pod-resources/etcd-member/etcd-all-certs/etcd-peer-worker-1...key`. Then at 21:01, we just get a "reboot: Restarting system" and the machine reboots. So we need to figure out why the initramfs started hanging and the machine subsequently rebooted. Is it possible something is interacting with the BMC at the same time? Adding `rd.debug` could provide more insight but the way the reboot message shows up without systemd unwinding the transaction makes it seem like a low-level request, e.g. triggered from the BMC or some service doing `systemctl reboot -ff`. As far as I know, nothing was interacting with the BMC in the background at that point. An update: this seems to be an issue in the 4.11 OS image used to boot the spoke in assisted install. https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/latest-4.11/ I was able to deploy 4.11.0-fc.0 OCP on my spoke cluster by using these 4.10 os image in agentserviceconfig: https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.10/latest/ (In reply to yliu1 from comment #13) > An update: this seems to be an issue in the 4.11 OS image used to boot the > spoke in assisted install. > https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre- > release/latest-4.11/ > > I was able to deploy 4.11.0-fc.0 OCP on my spoke cluster by using these 4.10 > os image in agentserviceconfig: > https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.10/ > latest/ Ack thanks. So this might be a RHEL 8.4 to 8.5 issue. The RHCOS 4.11 images in the mirror location are still at 8.5, but we've recently moved (back) to 8.6. It'd be good to confirm that this is still an issue in 8.6. We don't currently have bootimages but will try to get some out soon. Meanwhile, one thing I'm interested in knowing is whether this issue also happens if you do an RHCOS install directly, i.e. outside the context of OCP. Can you try just installing RHCOS using `coreos-installer install` manually with a simple Ignition config? After troubleshooting the reproducer system it seems that the failure is caused by coreos-boot-edit service which is failing with `Error: Expected one vendor dir on /dev/sda2, got 2`, screenshot attached. I found the same issue(seems Dell specific) reported in https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm the fix has landed in the RHCOS 4.11 images? Thanks! Created attachment 1890026 [details]
coreos-boot-edit-1.png
Created attachment 1890027 [details]
coreos-boot-edit-2.png
(In reply to Marius Cornea from comment #16) > After troubleshooting the reproducer system it seems that the failure is > caused by coreos-boot-edit service which is failing with `Error: Expected > one vendor dir on /dev/sda2, got 2`, screenshot attached. > > I found the same issue(seems Dell specific) reported in > https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in > https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm > the fix has landed in the RHCOS 4.11 images? > > Thanks! /dev/sda2 content on the reproduer system: find /mnt/ /mnt/ /mnt/EFI /mnt/EFI/redhat /mnt/EFI/redhat/fonts /mnt/EFI/redhat/shimx64.efi /mnt/EFI/redhat/BOOTX64.CSV /mnt/EFI/redhat/grubx64.efi /mnt/EFI/redhat/mmx64.efi /mnt/EFI/redhat/shimx64-redhat.efi /mnt/EFI/redhat/grub.cfg /mnt/EFI/BOOT /mnt/EFI/BOOT/BOOTX64.EFI /mnt/EFI/BOOT/fbx64.efi /mnt/EFI/Dell /mnt/EFI/Dell/BootOptionCache /mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat (In reply to Marius Cornea from comment #16) > After troubleshooting the reproducer system it seems that the failure is > caused by coreos-boot-edit service which is failing with `Error: Expected > one vendor dir on /dev/sda2, got 2`, screenshot attached. > > I found the same issue(seems Dell specific) reported in > https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in > https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm > the fix has landed in the RHCOS 4.11 images? > > Thanks! Yes, that patch is in coreos-installer v0.14.0 which is in the latest 4.11 bootimages. The RHBZ tracking that is https://bugzilla.redhat.com/show_bug.cgi?id=2074483. When you say "the reproducer system", are you talking about the same system from which the video was captured in comment 1? The error mode in the video capture there seems very different. To confirm, can you or the reporter retry this again on the same system with the latest 4.11 bootimages? (See https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.json). (In reply to Jonathan Lebon from comment #20) > (In reply to Marius Cornea from comment #16) > > After troubleshooting the reproducer system it seems that the failure is > > caused by coreos-boot-edit service which is failing with `Error: Expected > > one vendor dir on /dev/sda2, got 2`, screenshot attached. > > > > I found the same issue(seems Dell specific) reported in > > https://github.com/coreos/fedora-coreos-tracker/issues/1116 and the fix in > > https://github.com/coreos/coreos-installer/pull/802 . Can someone confirm > > the fix has landed in the RHCOS 4.11 images? > > > > Thanks! > > Yes, that patch is in coreos-installer v0.14.0 which is in the latest 4.11 > bootimages. The RHBZ tracking that is > https://bugzilla.redhat.com/show_bug.cgi?id=2074483. > > When you say "the reproducer system", are you talking about the same system > from which the video was captured in comment 1? > The error mode in the video capture there seems very different. It was not the exact same system but a different one with the same hardware which showed the same symptoms as in the video capture. > To confirm, can you or the reporter retry this again on the same system with > the latest 4.11 bootimages? (See > https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos. > json). I confirm the issue no longer reproduced with the latest 4.11 images and the node was able to boot without issues: rhcos-411.85.202205101201-0-live-rootfs.x86_64.img rhcos-411.85.202205101201-0-live.x86_64.iso Thanks for testing. Closing as dupe. *** This bug has been marked as a duplicate of bug 2074483 *** @jlebon the linked bz got a fix in the ocp build so it is closed. But we would need the same fix to be here to close this one: https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/latest-4.11/ Could you please follow up on that? Reopen because the fix has not landed in the expected place yet. (In reply to yliu1 from comment #23) > @jlebon the linked bz got a fix in the ocp build so it is closed. > But we would need the same fix to be here to close this one: > https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre- > release/latest-4.11/ > > Could you please follow up on that? The CoreOS team is not responsible for updating the mirrors; you would need to contact ART about the frequency of updating the mirror. Re-closing. Please track any mirroring requests to ART in Jira as above. *** This bug has been marked as a duplicate of bug 2074483 *** Thank you Micah! |