Created attachment 1735426 [details] bootstrap-0.log Description of problem: The latest RHCOS builds from CI are failing to install on both zVM and zKVM. I will attach the redirected output from the guest console log for bootstrap-0(we see this on every guest), but these are the last lines reported before entering Emergency Mode: [ 138.420670] systemd[1]: Closed udev Control Socket. [ 138.517802] systemd-vconsole-setup[767]: KD_FONT_OP_GET failed while trying to get the font metadata: Function not implemented [ 138.517838] systemd-vconsole-setup[767]: Fonts will not be copied to remaining consoles [ OK ] Started Setup Virtual Console. [ 138.519442] systemd[1]: systemd-vconsole-setup.service: Succeeded. [[ 138.519474] systemd[1]: Started Setup Virtual Console. OK [ 138.519494] systemd[1]: Started Emergency Shell. ] Started Emergency Shell. [[ 138.519607] systemd[1]: Reached target Emergency Mode. OK ] Reached target Emergency Mode. [ 138.519695] systemd[1]: Startup finished in 4.318s (kernel) + 0 (initrd) + 2min 14.200s (userspace) = 2min 18.519s. Displaying logs from failed units: sysroot.mount coreos-livepxe-rootfs.service -- Logs begin at Tue 2020-12-01 23:03:14 UTC, end at Tue 2020-12-01 23:05:28 UTC. -- Dec 01 23:05:27 systemd[1]: Mounting /sysroot... Dec 01 23:05:27 mount[758]: mount: /sysroot: failed to setup loop device for /root.squashfs. Dec 01 23:05:27 systemd[1]: sysroot.mount: Mount process exited, code=exited status=32 Dec 01 23:05:27 systemd[1]: sysroot.mount: Failed with result 'exit-code'. Dec 01 23:05:27 systemd[1]: Failed to mount /sysroot. -- Logs begin at Tue 2020-12-01 23:03:14 UTC, end at Tue 2020-12-01 23:05:28 UTC. -- Dec 01 23:03:16 systemd[1]: Starting Acquire live PXE rootfs image... Dec 01 23:03:16 coreos-livepxe-rootfs[737]: Fetching rootfs image from http://9.12.23.79:8080/CI/rhcos-47.83.202012010110-0//rhcos-47.83.202012010110-0-live-rootfs.s390x.img... Dec 01 23:05:27 coreos-livepxe-rootfs[737]: Error: premature end of input data at offset 0 Dec 01 23:05:27 coreos-livepxe-rootfs[737]: Couldn't fetch, verify, and unpack image specified by coreos.live.rootfs_url= Dec 01 23:05:27 coreos-livepxe-rootfs[737]: Check that the URL is correct and that the rootfs version matches the initramfs. Dec 01 23:05:27 systemd[1]: coreos-livepxe-rootfs.service: Main process exited, code=exited, status=1/FAILURE Dec 01 23:05:27 systemd[1]: coreos-livepxe-rootfs.service: Failed with result 'exit-code'. Dec 01 23:05:27 systemd[1]: Failed to start Acquire live PXE rootfs image. Dec 01 23:05:27 systemd[1]: coreos-livepxe-rootfs.service: Triggering OnFailure= dependencies. Enter for emergency shell or wait 3 minutes for reboot. The last working RHCOS build we succeeded in installing was rhcos-47.82.202011040711-0. Version-Release number of selected component (if applicable): rhcos-47.83.202011251911-0 rhcos-47.83.202012010110-0 How reproducible: Consistently Steps to Reproduce: 1. Obtain the RHCOS kernel, initramfs and rootfs 2. Punch the files 3. Actual results: Failure reported with mounting root.squashfs Expected results: RHCOS to successfully install Additional info:
Folks, To add to what Phil has mentioned for the zVM OCP on Z 4.7 installation issue using either of these RHCOS 47.83 builds, rhcos-47.83.202011251911-0 and rhcos-47.83.202012010110-0, here is some pertinent failure information from a bootstrap node console: 12/01/20 00:33:32 [ 14.551207] systemd[1]: Mounting /sysroot... 12/01/20 00:33:32 [[0;32m OK [0m] Started Persist osmet files (PXE). 12/01/20 00:33:32 [ 14.567311] systemd[1]: Started Persist osmet files (PXE). 12/01/20 00:33:32 [ 14.674085] squashfs: version 4.0 (2009/01/31) Phillip Lougher 12/01/20 00:33:32 [ 14.679861] SQUASHFS error: zlib decompression failed, data probably corrupt 12/01/20 00:33:32 [ 14.679864] SQUASHFS error: squashfs_read_data failed to read block 0x2c09b67f 12/01/20 00:33:32 [ 14.679865] SQUASHFS error: Unable to read metadata cache entry [2c09b67f] 12/01/20 00:33:32 [ 14.679866] SQUASHFS error: Unable to read inode 0x60005025c 12/01/20 00:33:32 [ 14.681354] mount[852]: mount: /sysroot: can't read superblock on /dev/loop1. 12/01/20 00:33:32 [ 14.712087] systemd[1]: sysroot.mount: Mount process exited, code=exited status=32 12/01/20 00:33:32 [ 14.712160] systemd[1]: sysroot.mount: Failed with result 'exit-code'. 12/01/20 00:33:32 [ 14.712477] systemd[1]: Failed to mount /sysroot. 12/01/20 00:33:32 [[0;1;31mFAILED[0m] Failed to mount /sysroot. 12/01/20 00:33:32 See 'systemctl status sysroot.mount' for details. 12/01/20 00:33:32 [ 14.712597] systemd[1]: Dependency failed for OSTree Prepare OS/. 12/01/20 00:33:32 [[0;1;33mDEPEND[0m] Dependency failed for OSTree Prepare OS/. Thank you, Kyle
I'm seeing three errors here: 1. coreos-livepxe-rootfs[737]: Error: premature end of input data at offset 0 2. mount: /sysroot: failed to setup loop device for /root.squashfs. 3. SQUASHFS error: zlib decompression failed, data probably corrupt #2 could be the result of #1, but #1 and #3 seem mutually exclusive. We wouldn't see a squashfs error unless we had fetched the squashfs. We've seen #3 before. It was one of the issues reported in bug 1863466, and it went away when RHCOS 4.6 switched from an 8.3 development kernel to an 8.2 kernel. It's possible that there's an 8.3 kernel bug that made it into the final release.
Oh, also, between rhcos-47.82.202011040711-0 and rhcos-47.83.202011251911-0, RHCOS switched to RHEL 8.3.
Benjamin, Thank you for the information. Issue #3 "SQUASHFS error: zlib decompression failed, data probably corrupt" does indeed seem very familiar to the RHCOS 46.82 issue previously seen where there was an issue with a RHEL 8.3 kernel bug introduced into the RHCOS build. Thank you, Kyle
Hi Phil/Kyle, What zVM are you using. I tried it on my z14 and it works fine. Thanks, Prashanth
Prashanth, Thanks for the information -- this sounds very familiar to the RHCOS 4.6 issue from August/September using a RHEL 8.3 build. We are using a z15, which was the only server type (from z13, z14, and z15 servers) that had exhibited issue #3 for that RHCOS 4.6 issue. Thank you, Kyle
Using the RHCOS build 47.83.202012010110-0, I have seen the reported problem on the z13, z14 and z15. I also agree that this problem is reminiscent of the bug from RHCOS 4.6. When the boot fails, I pressed Enter twice and tried to issue the following commands, but there doesn't seem to be a /root.squashfs: Press Enter for emergency shell or wait 5 minutes for reboot. Generating "/run/initramfs/rdsosreport.txt" Entering emergency mode. Exit the shell to continue. Type "journalctl" to view system logs. You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot after mounting them and attach it to a bug report. :/# ls -l /root.squashfs ls: cannot access '/root.squashfs': No such file or directory
Philip, comment 7 sounds more like the fetch problem (#1). Can you curl the rootfs URL manually and see if an error is reported? https://github.com/coreos/fedora-coreos-config/pull/758 adds logging for fetch errors.
Hi all. This issue is only z15 specific and related or probably was introduced with https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa5b395b69b65450e008b95ec623b4fc4b175f9f I did inform people at IBM about this issue. For now you can add `dfltcc=off` kernel option to `boot.parm` and than installation should work.
@HaJo: please mirror to IBM
Hi @Nikita, we recreated this problem on z13 and z14 as well. I tried the 'dfltcc=off' workaround there and this did not help. We also found that this new DFLTCC instruction is only specific for z15? I also just tested this on a z15, and I did not see an improvement. Just to be sure this option is being defined correctly, this is what I have for the the cmdline under the domain libvirt xml to create the guest: <cmdline>rd.neednet=1 console=ttysclp0 coreos.inst=yes coreos.inst.install_dev=vda dfltcc=off coreos.live.rootfs_url=http://9.12.23.79:8080/CI/rhcos-47.83.202012010110-0//rhcos-47.83.202012010110-0-live-rootfs.s390x.img coreos.inst.ignition_url=http://192.168.79.1:8080/ignition/bootstrap.ign ip=dhcp nameserver=192.168.79.1</cmdline> Hi @Benjamin, I tried to fetch the rootfs file from within the Emergency shell, and it failed: :/# curl http://9.12.23.79:8080/CI/rhcos-47.83.202012010110-0//rhcos-47.83.202012010110-0-live-rootfs.s390x.img curl: (7) Couldn't connect to server But when in emergency mode, the network is not fully up? I performed the same fetch command successfully from a boostrap-0 node that is started on another cluster.
@Philip, on z15 `zlib` fails with smth similar to: [ 7.287372] SQUASHFS error: zlib decompression failed, data probably corrupt [ 7.287377] SQUASHFS error: squashfs_read_data failed to read block 0x2c0a2541 Regarding networking - in emergency mode it works, please check your machine with `ip a`. From bootstrap-0.log: [ 7.219814] NetworkManager[721]: <info> [1606863796.7666] dhcp4 (enc1): option ip_address => '192.168.79.20' [ 7.220128] NetworkManager[721]: <info> [1606863796.7667] dhcp4 (enc1): option routers => '192.168.79.1' [ 7.220144] NetworkManager[721]: <info> [1606863796.7667] dhcp4 (enc1): option subnet_mask => '255.255.255.0' Are you sure 9.12.23.79 is reachable?
To clarify, there appear to be two problems here, the zlib one and the networking one. @Philip, yes, networking should be automatically enabled in the initramfs when coreos.live.rootfs_url is specified.
Folks, When using the RHCOS 47.83.202012031612-0 build on the bootstrap nodes for both zVM and KVM, and the OCP 4.7.0-0.nightly-s390x-2020-12-03-200158 build: 1. When using ECKD disk storage for the bootstrap node we have successfully installed the bootstrap node. We're running some additional ECKD install tests with this RHCOS build to see how consistently this works. 2. When using FCP disk storage for the bootstrap node the install consistently fails, with the bootstrap node crashing. For the zVM environment, the attempted install of the bootstrap node fails with: 12/03/20 18:46:51 [[0;32m OK [0m] Started Persist osmet files (PXE). 12/03/20 18:46:51 [ 17.086066] squashfs: version 4.0 (2009/01/31) Phillip Lougher 12/03/20 18:46:51 [ 17.118863] SQUASHFS error: zlib decompression failed, data probably corrupt 12/03/20 18:46:51 [ 17.118870] SQUASHFS error: squashfs_read_data failed to read block 0x2c0ce338 12/03/20 18:46:51 [ 17.118872] SQUASHFS error: Unable to read metadata cache entry [2c0ce338] 12/03/20 18:46:51 [ 17.118875] SQUASHFS error: Unable to read inode 0x5ff110791 12/03/20 18:46:51 [ 17.122934] mount[847]: mount: /sysroot: can't read superblock on /dev/loop1. 12/03/20 18:46:51 [[0;1;31mFAILED[0m] Failed to mount /sysroot. 12/03/20 18:46:51 See 'systemctl status sysroot.mount' for details. 12/03/20 18:46:51 [[0;1;33mDEPEND[0m] Dependency failed for Initrd Root File System. 12/03/20 18:46:51 [[0;1;33mDEPEND[0m] Dependency failed for Reload Configuration from the Real Root. 12/03/20 18:46:51 [[0;1;33mDEPEND[0m] Dependency failed for sysroot-relabel.service. 12/03/20 18:46:51 [[0;1;33mDEPEND[0m] Dependency failed for /sysroot/etc. 12/03/20 18:46:51 [[0;1;33mDEPEND[0m] Dependency failed for /sysroot/var. 12/03/20 18:46:51 [[0;1;33mDEPEND[0m] Dependency failed for sysroot-xfs-ephemeral-setup.service. 12/03/20 18:46:51 [[0;1;33mDEPEND[0m] Dependency failed for OSTree Prepare OS/. 12/03/20 18:46:51 [[0;32m OK [0m] Stopped target Basic System. 12/03/20 18:46:51 [[0;32m OK [0m] Reached target Initrd File Systems. 12/03/20 18:46:51 [[0;32m OK [0m] Stopped dracut pre-mount hook. 12/03/20 18:46:51 [[0;32m OK [0m] Stopped target System Initialization. 12/03/20 18:46:51 [[0;32m OK [0m] Stopped target Subsequent (Not Ignition) boot complete. 12/03/20 18:46:51 [ 17.177439] systemd[1]: sysroot.mount: Mount process exited, code=exited status=32 12/03/20 18:46:51 [ 17.177911] systemd[1]: sysroot.mount: Failed with result 'exit-code'. 12/03/20 18:46:51 [ 17.177940] systemd[1]: Failed to mount /sysroot. Thank you, Kyle
@krmoser It doesn't matter which disk you use, `zlib` fails even before installation starts. That's interesting, that you managed to install rhcos-47-83 on z15 without turning `dfltcc` off. Or did you turn it off?
Nikita, Thanks. For the above mentioned tests, we did not turn `dfltcc` off when running on the z15 servers, and for those subset of RHCOS 47.83 install tests that succeeded. When turning `dfltcc` off when running on z15 servers: 1. For zVM OCP 4.7 installations using ECKD and FCP storage, the RHCOS 47.83 installs consistently succeed. 2. For KVM OCP 4.7 installations using ECKD storage, the RHCOS 47.83 installs consistently succeed. 3. For KVM OCP 4.7 installations using FCP storage, some RHCOS 47.83 installations succeed, and we believe that the others that are failing are not related to this z15 zlib decompression issue, but a downlevel RHEL 8.3 kernel as discussed in Bug 1899762. We will be conducting additional tests to determine and will update in that Bug and here as well with our subsequent KVM FCP install test results. Thank you, Kyle
since uncompression problem is in CoreOS based on a RHEL 8.3 kernel, then this might hit RHEL 8.3 as well. How do we find out that we have not the same problem there? This would hit customers in their prod. environments.
(In reply to Holger Wolf from comment #17) > since uncompression problem is in CoreOS based on a RHEL 8.3 kernel, then > this might hit RHEL 8.3 as well. > > How do we find out that we have not the same problem there? This would hit > customers in their prod. environments. Hey All, IBM KVM Solution Test Team here... We didn't observe any issues or have to make any customization surrounding zlib for our installation tests so I have a feeling this is somehow scoped to RHCOS. For RHEL 8.3 we ran through mostly GUI installations and are just now working on kickstart based installations. However this should just be a process difference and not affect the underlying installation procedure I would think.
@Max Bender On z15 issue was reproduced with: qemu-system-s390x -nographic -kernel rhcos-47.83.202012091751-0-live-kernel-s390x -initrd rhcos-47.83.202012091751-0-live-initramfs.s390x.img -append "rd.systemd.unit=emergency.target" -m 1024 -enable-kvm -hda root.squashfs Did you test squashfs?
@Nikita Dubrovskii Speaking for Max and the KVM team, the closest we've had to using squashfs is with kickstart installs and the install.img that comes with the RHEL install media. The automation for the kickstart installs is still under development so we have not tried it in large numbers. Of the attempts to reproduce Phil's issue with kickstart installs, I was able to reproduce a similar result just once with a scenario of 10 simultaneous kickstart installs. But it was not clear if there were other bottlenecks affecting the host at the time (ex: cpuset for machine.slice on the host was limited to 4 cpus). A subsequent attempt to rerun the scenario on the same as well as on another host both ran successfully and did not encounter error. KVM team has not ried to reproduce with the RHCOS image itself so far.
fix is ready: https://lore.kernel.org/lkml/20201215155551.894884-1-iii@linux.ibm.com/
Reducing severity, since the offending optimization can be disabled via karg.
(In reply to Nikita Dubrovskii (IBM) from comment #21) > fix is ready: > https://lore.kernel.org/lkml/20201215155551.894884-1-iii@linux.ibm.com/ Do we know when this fix will land in RHCOS?
Bug 1908011 requests a kernel backport, but there have been no updates on that bug.
(In reply to Benjamin Gilbert from comment #22) > Reducing severity, since the offending optimization can be disabled via karg. Given the above should this be removed from the blocker list?
> Given the above should this be removed from the blocker list? I would say so.
Given that this bug is dependent on a bug that targets RHEL 8.4, should we bump the target release to "---" or a later release?
Ideally the kernel fix would be backported to 8.3. The underlying bug is an 8.3 kernel regression.
Higher priority work has prevented this issue from being solved; adding UpcomingSprint keyword
(In reply to Dan Li from comment #27) > Given that this bug is dependent on a bug that targets RHEL 8.4, should we > bump the target release to "---" or a later release? Since it is not clear when the patch for the kernel fix (https://bugzilla.redhat.com/show_bug.cgi?id=1908011) will be included in a kernel build, I'm inclined to agree and target this for 4.8. If and when we are able to get that fix included in an 8.3.z kernel, we can clone this BZ for 4.7
Adding a note to the docs that `dfltcc=off` is required for IBM z15 and LinuxONE III. https://github.com/openshift/openshift-docs/pull/31861 Once this is fixed please inform me then I will remove it.
The fixed kernel is attached to the RHEL 8.4 GA errata, so we can expect this to be resolved once RHCOS 4.8 rebases to RHEL 8.4 GA. Will move it to MODIFIED once that happens.
RHCOS 4.8 moved to using RHEL 8.4 GA content with build 48.84.202105182219-0 which included `kernel-4.18.0-305.el8` This build and newer are available in the OCP 4.8 nightly payloads. Moving to MODIFIED.
Today did test again. Steps: 1) make sure you use Z15 (8561) $ cat /proc/cpuinfo | grep 8561 processor 0: version = FF, identification = 3F2E48, machine = 8561 processor 1: version = FF, identification = 3F2E48, machine = 8561 machine : 8561 machine : 8561 2) extract root.squashfs from live rootfs $ cpio -i < rhcos-48.84.202106231130-0-live-rootfs.s390x.img root.squashfs 3) run qemu $ sudo qemu-system-s390x -nographic -enable-kvm -kernel rhcos-48.84.202106231130-0-live-kernel-s390x -initrd rhcos-48.84.202106231130-0-live-initramfs.s390x.img -m 1024 --append "rd.systemd.unit=emergency.target" -hda root.squashfs 4) in emergency shell: :/# modprobe virtio_blk virtio_scsi [ 29.581698] virtio_blk: unknown parameter 'virtio_scsi' ignored [ 29.583311] virtio_blk virtio1: [vda] 1572488 512-byte logical blocks (805 MB/768 MiB) [ 29.583314] vda: detected capacity change from 0 to 805113856 :/# mkdir /mnt :/# mount -t squashfs /dev/vda /mnt :/# ls -la /mnt total 3 drwxr-xr-x. 4 root root 87 Jun 23 11:40 . drwxr-xr-x 13 root root 400 Jun 23 16:21 .. -rw-rw-r--. 1 root root 191 Jun 23 11:40 .coreos-aleph-version.json drwxr-xr-x. 5 root root 130 Jun 23 11:40 boot drwxr-xr-x. 5 root root 95 Jun 23 11:40 ostree :/# cat /mnt/.coreos-aleph-version.json { "build": "48.84.202106231130-0", "ref": "", "ostree-commit": "89b64bbc05a09586b083ee9e1465ede674cd1114ebcc9ef2ced8556d254ee892", "imgid": "rhcos-48.84.202106231130-0-metal.s390x.raw" } :/# uname -a Linux localhost 4.18.0-305.3.1.el8_4.s390x #1 SMP Mon May 17 10:16:29 EDT 2021 s390x s390x s390x GNU/Linux So works as expected
While Nikita was the assignee for this BZ, the fix landed in the kernel by another engineer, so his verification in comment #36 doesn't violate any notion of the "assignee cannot verify their own BZs". Marking this VERIFIED per comment #36
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438