Created attachment 1749392 [details] openshift_install.log Version: $ openshift-install version install_version: "4.7.0-0.nightly-s390x-2021-01-19-095800" rhcos_version: "4.7.0-fc.2-s390x" Platform: s390x Please specify: UPI What happened? trying to install ocp on a 4k scsi lun the installer detects the sector size correctly and writes the correct image to the disk. after that it will try to restart the node but fails. The s390x scsi bootloader fails to find the boot record on that disc. the first guess is that zipl fails to write it, although there are no obvious errors. ---- after image write to disk ---- Ý 91.859350¨ GPT:Primary header thinks Alt. header is not at the end of the d sk. Ý 91.859358¨ GPT:853247 != 31457279 Ý 91.859359¨ GPT:Alternate GPT header not at the end of the disk. Ý 91.859360¨ GPT:853247 != 31457279 Ý 91.859362¨ GPT: Use GNU Parted to correct GPT errors. Ý 91.859370¨ sda: sda3 sda4 Ý 92.128978¨ EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null) Ý 92.131577¨ coreos-installer-serviceÝ1157¨: Writing Ignition config Ý 92.131715¨ coreos-installer-serviceÝ1157¨: Writing first-boot kernel argume ts Ý 92.131782¨ coreos-installer-serviceÝ1157¨: Modifying kernel arguments Ý 92.133024¨ coreos-installer-serviceÝ1157¨: Installing bootloader Ý 92.247270¨ coreos-installer-serviceÝ1157¨: Updating re-IPL device Ý 92.301487¨ coreos-installer-serviceÝ1157¨: Install complete. Ý Ý0;32m OK Ý0m¨ Started CoreOS Installer. Ý Ý0;32m OK Ý0m¨ Reached target CoreOS Installer Target. Ý Ý0;32m OK Ý0m¨ Started Reboot after CoreOS Installer. Ý Ý0;32m OK Ý0m¨ Reached target Finalize CoreOS Installer Target. Ý 92.306811¨ GPT:Primary header thinks Alt. header is not at the end of the d sk. Ý 92.306816¨ GPT:853247 != 31457279 Ý 92.306818¨ GPT:Alternate GPT header not at the end of the disk. Ý 92.306820¨ GPT:853247 != 31457279 Ý 92.306823¨ GPT: Use GNU Parted to correct GPT errors. Ý 92.306831¨ sda: sda3 sda4 Stopping Login Service... Ý Ý0;32m OK Ý0m¨ Stopped target Network is Online. Ý Ý0;32m OK Ý0m¨ Stopped target Timers. Ý Ý0;32m OK Ý0m¨ Stopped Daily Cleanup of Temporary Directories. Ý Ý0;32m OK Ý0m¨ Stopped target Finalize CoreOS Installer Target. Unmounting /sysroot... Stopping Reboot after CoreOS Installer... Ý Ý0;32m OK Ý0m¨ Stopped Network Manager Wait Online. Ý Ý0;32m OK Ý0m¨ Stopped target Network. Stopping Restore /run/initramfs on shutdown... Stopping Network Manager... Ý Ý0;32m OK Ý0m¨ Closed LVM2 poll daemon socket. Ý Ý0;32m OK Ý0m¨ Stopped daily update of the root trust anchor for DNSSEC. Ý Ý0;32m OK Ý0m¨ Stopped Run update-ca-trust. Ý Ý0;32m OK Ý0m¨ Stopped Reboot after CoreOS Installer. Ý Ý0;32m OK Ý0m¨ Stopped Login Service. Ý Ý0;32m OK Ý0m¨ Stopped Restore /run/initramfs on shutdown. Unmounting /boot... Ý Ý0;32m OK Ý0m¨ Stopped target CoreOS Installer Target. Ý Ý0;32m OK Ý0m¨ Stopped target Prepare for CoreOS Installer Target. Ý Ý0;32m OK Ý0m¨ Unmounted /sysroot. Ý Ý0;32m OK Ý0m¨ Unmounted /boot. Ý Ý0;32m OK Ý0m¨ Stopped Network Manager. Stopping D-Bus System Message Bus... Ý Ý0;32m OK Ý0m¨ Stopped D-Bus System Message Bus. Ý Ý0;32m OK Ý0m¨ Stopped target Basic System. Ý Ý0;32m OK Ý0m¨ Stopped target Slices. Ý Ý0;32m OK Ý0m¨ Removed slice User and Session Slice. Ý Ý0;32m OK Ý0m¨ Stopped target Paths. Ý Ý0;32m OK Ý0m¨ Stopped target Sockets. Ý Ý0;32m OK Ý0m¨ Closed D-Bus System Message Bus Socket. Ý Ý0;32m OK Ý0m¨ Stopped target System Initialization. Ý Ý0;32m OK Ý0m¨ Stopped target Local Encrypted Volumes. Ý Ý0;32m OK Ý0m¨ Stopped Dispatch Password Requests to Console Directory Watc h. Stopping Load/Save Random Seed... Stopping Update UTMP about System Boot/Shutdown... Ý Ý0;32m OK Ý0m¨ Stopped Apply Kernel Variables. Ý Ý0;32m OK Ý0m¨ Stopped Load Kernel Modules. Ý Ý0;32m OK Ý0m¨ Stopped Update is Completed. Ý Ý0;32m OK Ý0m¨ Stopped Rebuild Hardware Database. Ý Ý0;32m OK Ý0m¨ Stopped Rebuild Dynamic Linker Cache. Ý Ý0;32m OK Ý0m¨ Stopped Load/Save Random Seed. Ý Ý0;32m OK Ý0m¨ Stopped Update UTMP about System Boot/Shutdown. Ý Ý0;32m OK Ý0m¨ Stopped Create Volatile Files and Directories. Ý Ý0;32m OK Ý0m¨ Stopped target Local File Systems. Unmounting /run/ephemeral... Unmounting Temporary Directory (/tmp)... Unmounting /etc... Unmounting /var... Ý Ý0;1;31mFAILED Ý0m¨ Failed unmounting /etc. Ý Ý0;1;31mFAILED Ý0m¨ Failed unmounting /var. Ý Ý0;32m OK Ý0m¨ Unmounted /run/ephemeral. Ý Ý0;32m OK Ý0m¨ Stopped target Local File Systems (Pre). Ý Ý0;32m OK Ý0m¨ Stopped Create Static Device Nodes in /dev. Ý Ý0;32m OK Ý0m¨ Stopped Create System Users. Stopping Monitoring of LVM2 mirrors ng dmeventd or progress polling.. . Ý Ý0;32m OK Ý0m¨ Unmounted Temporary Directory (/tmp). Ý Ý0;32m OK Ý0m¨ Reached target Unmount All Filesystems. Ý Ý0;32m OK Ý0m¨ Stopped target Swap. Ý Ý0;32m OK Ý0m¨ Stopped Monitoring of LVM2 mirrors, sing dmeventd or progr ess polling. Ý Ý0;32m OK Ý0m¨ Reached target Shutdown. Ý Ý0;32m OK Ý0m¨ Reached target Final Step. Starting Reboot... Ý 93.131336¨ printk: systemd-shutdow: 1 output lines suppressed due to ratelim iting Ý 93.137693¨ systemd-shutdownÝ1¨: Syncing filesystems and block devices. Ý 93.140114¨ systemd-shutdownÝ1¨: Sending SIGTERM to remaining processes... Ý 93.142954¨ systemd-journaldÝ1020¨: Received SIGTERM from PID 1 (systemd-shut dow). Ý 93.209056¨ systemd-shutdownÝ1¨: Sending SIGKILL to remaining processes... Ý 93.211032¨ systemd-shutdownÝ1¨: Unmounting file systems. Ý 93.211723¨ Ý1307¨: Remounting '/var' read-only in with options 'seclabel,att r2,discard,inode64,logbufs=8,logbsize=32k,noquota'. Ý 93.219215¨ Ý1308¨: Unmounting '/var'. Ý 93.299096¨ Ý1309¨: Remounting '/etc' read-only in with options 'seclabel,att r2,discard,inode64,logbufs=8,logbsize=32k,noquota'. Ý 93.299626¨ Ý1310¨: Unmounting '/etc'. Ý 93.349533¨ XFS (loop0): Unmounting Filesystem Ý 93.401728¨ systemd-shutdownÝ1¨: All filesystems unmounted. Ý 93.401732¨ systemd-shutdownÝ1¨: Deactivating swaps. Ý 93.401747¨ systemd-shutdownÝ1¨: All swaps deactivated. Ý 93.401749¨ systemd-shutdownÝ1¨: Detaching loop devices. Ý 93.402192¨ systemd-shutdownÝ1¨: Not all loop devices detached, 1 left. Ý 93.402194¨ systemd-shutdownÝ1¨: Detaching DM devices. 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00. 02: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00. 03: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00. 00: Storage cleared - system reset. 00: HCPLDI2816I Acquiring the machine loader from the processor controller. 00: HCPLDI2817I Load completed from the processor controller. 00: HCPLDI2817I Now starting the machine loader. 01: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and store status from CPU 00. 02: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and store status from CPU 00. 03: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and store status from CPU 00. ---- reboot ---- 00: MLOEVL012I: Machine loader up and running (version v2.4.7). 00: MLOLOA013E: Invalid Program Table, wrong or missing magic number. 00: MLOLOA041E: Unable to find the component table pointer. 00: MLOEVL010E: IPL failed. 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 0002887A
This is coreos installer not openshift installer that's in question here.
A number of changes around multiarch support landed in recent devel builds. If possible, please retry with the latest builds from https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.7-s390x
restested with install_version: "4.7.0-0.nightly-s390x-2021-01-22-032543" rhcos_version: "4.7.0-fc.3-s390x" exactly same behaviour.
@Alexander Klein - one thing you can try is: Boot without any coreos.inst.* arguments, and with "ignition.firstboot ignition.platform.id=metal ignition.config.url=http://..." login as core and then you can run the coreos-installer to write the image to disk. that way you will know what the installer is doing: sudo coreos-installer install <disk-device> --ignition-url <url> --insecure-ignition
Reproduced this: [ 57.327334] coreos-installer-service[1127]: Installing bootloader [ 57.433335] coreos-installer-service[1127]: Target device information [ 57.433418] coreos-installer-service[1127]: Device..........................: 08:00 [ 57.433451] coreos-installer-service[1127]: Partition.......................: 08:03 [ 57.433482] coreos-installer-service[1127]: Device name.....................: sda [ 57.433515] coreos-installer-service[1127]: Device driver name..............: sd [ 57.433547] coreos-installer-service[1127]: Type............................: disk partition [ 57.433580] coreos-installer-service[1127]: Disk layout.....................: SCSI disk layout [ 57.433613] coreos-installer-service[1127]: Geometry - start................: 2048 [ 57.433644] coreos-installer-service[1127]: File system block size..........: 4096 [ 57.433676] coreos-installer-service[1127]: Physical block size.............: 4096 [ 57.433711] coreos-installer-service[1127]: Device size in physical blocks..: 98304 [ 57.433745] coreos-installer-service[1127]: Building bootmap in '/tmp/coreos-installer-xFJooy' [ 57.433779] coreos-installer-service[1127]: Adding IPL section [ 57.433813] coreos-installer-service[1127]: initial ramdisk...: /tmp/coreos-installer-xFJooy/ostree/rhcos-475697892c8c54e5ee323 999298fea919e71e0c2e8824af7c6fab9e16b6bae14/initramfs-4.18.0-240.10.1.el8_3.s390x.img [ 57.433843] coreos-installer-service[1127]: kernel image......: /tmp/coreos-installer-xFJooy/ostree/rhcos-475697892c8c54e5ee323 999298fea919e71e0c2e8824af7c6fab9e16b6bae14/vmlinuz-4.18.0-240.10.1.el8_3.s390x [ 57.433875] coreos-installer-service[1127]: kernel parmline...: 'random.trust_cpu=on ignition.platform.id=metal ostree=/ostree/ boot.1/rhcos/475697892c8c54e5ee323999298fea919e71e0c2e8824af7c6fab9e16b6bae14/0 rd.znet=qeth,0.0.bdf0,0.0.bdf1,0.0.bdf2,layer2=1,po tno=0 zfcp.allow_lun_scan=0 cio_ignore=all,!condev rd.zfcp=0.0.1985,0x500507605e819cc2,0x0001000000000000 ignition.firstboot rd.need net=1 ip=172.18.142.4::172.18.0.1:255.254.0.0:coreos:encbdf0:off nameserver=172.18.0.1 ' [ 57.433912] coreos-installer-service[1127]: component address: [ 57.433944] coreos-installer-service[1127]: heap area.......: 0x00002000-0x00005fff [ 57.433975] coreos-installer-service[1127]: stack area......: 0x0000f000-0x0000ffff [ 57.434012] coreos-installer-service[1127]: internal loader.: 0x0000a000-0x0000dfff [ 57.434043] coreos-installer-service[1127]: parameters......: 0x00009000-0x00009fff [ 57.434073] coreos-installer-service[1127]: kernel image....: 0x00010000-0x00687fff [ 57.434104] coreos-installer-service[1127]: parmline........: 0x00689000-0x00689fff [ 57.434136] coreos-installer-service[1127]: initial ramdisk.: 0x00690000-0x04549fff [ 57.434166] coreos-installer-service[1127]: Preparing boot device: sda. [ 57.434196] coreos-installer-service[1127]: Detected SCSI PCBIOS disk layout. [ 57.434229] coreos-installer-service[1127]: Writing SCSI master boot record. [ 57.434261] coreos-installer-service[1127]: Syncing disks... [ 57.434294] coreos-installer-service[1127]: Done. [ 57.434398] coreos-installer-service[1127]: Updating re-IPL device [ 57.435428] coreos-installer-service[1127]: Re-IPL type: fcp [ 57.435460] coreos-installer-service[1127]: WWPN: 0x500507605e819cc2 [ 57.435494] coreos-installer-service[1127]: LUN: 0x0001000000000000 [ 57.435525] coreos-installer-service[1127]: Device: 0.0.1985 [ 57.435558] coreos-installer-service[1127]: bootprog: 0 [ 57.435589] coreos-installer-service[1127]: br_lba: 0 [ 57.435620] coreos-installer-service[1127]: Loadparm: "" [ 57.435652] coreos-installer-service[1127]: Bootparms: "" [ 57.480009] coreos-installer-service[1127]: Install complete. 00: Storage cleared - system reset. 00: HCPLDI2816I Acquiring the machine loader from the processor controller. 00: HCPLDI2817I Load completed from the processor controller. 00: HCPLDI2817I Now starting the machine loader. 01: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and store status from CPU 00. 00: MLOEVL012I: Machine loader up and running (version v2.4.7). 00: MLOLOA013E: Invalid Program Table, wrong or missing magic number. 00: MLOLOA041E: Unable to find the component table pointer. 00: MLOEVL010E: IPL failed. 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 0002887A Same image works on DASD (4k) , continue debugging
Short update: Wrote a small (~200 lines of C code) patcher tool and was able to boot the system. Now looking for the root-cause of data corruption
Fix is ready - https://github.com/ibm-s390-tools/s390-tools/pull/107
@hannsj_uhl.com please mirror this to IBM - this is a s390 patch needed to be integrated into RHEL 8
Discussed this with the multi-arch team and decided that this was not a blocker and would be targeted for 4.8 The summary is this problem has been present for some time now and even RHEL would be affected by this issue. Customers won't be able to do similar operations in RHEL or RHCOS until this fix is packaged and shipped as part RHEL. The fix is targeting 8.4, with a possibility to backport it to 8.3
Waiting on upstream patch to be merged and new RPM to be released
------- Comment From stefan.haberland.com 2021-02-12 04:16 EDT------- Fix is reviewed and tested and will be pulled.
@Dan: This is about the zipl issues that needs to go into RHEL as well (https://github.com/ibm-s390-linux/s390-tools/pull/107). I am not sure whether you will like to track this for RHEL separately. However, I think that we also need a z-stream for the s390-tools to rhel-8.4.z to be picked up by RHCOS. Let me know how to best approach this from your side. Thanks.
I believe we will need a RHEL clone for this item. There is still a chance to squeeze it to 8.4 GA as an exception. If not approved, then 8.5 + 8.4.0.z.
------- Comment From Jan.Hoeppner.com 2021-02-19 07:54 EDT------- The fix now available upstream: https://github.com/ibm-s390-linux/s390-tools/commit/4a3957fab5696cc410c5b495956859a424e3552a ("zipl: fix reading 4k disk's geometry")
(In reply to Dan Horák from comment #13) > I believe we will need a RHEL clone for this item. There is still a chance > to squeeze it to 8.4 GA as an exception. If not approved, then 8.5 + 8.4.0.z. ... which has now been opened as RHEL Bug 1933235 - RHEL8.4 - installer fails to write boot record on 4k scsi lun on s390x
------- Comment From tstaudt.com 2021-02-26 03:46 EDT------- (In reply to comment #22) > (In reply to Dan Hor?k from comment #13) > > I believe we will need a RHEL clone for this item. There is still a chance > > to squeeze it to 8.4 GA as an exception. If not approved, then 8.5 + 8.4.0.z. > > ... which has now been opened as > RHEL Bug 1933235 - RHEL8.4 - installer fails to write boot record on 4k scsi > lun on s390x which is LTC Bug 191653 - RH1933235- RHEL8.4 - installer fails to write boot record on 4k scsi lun on s390x (zipl) (s390utils/s390-tools)
RHCOS 4.8 moved to using RHEL 8.4 GA content with build 48.84.202105182219-0 which included `s390utils-2.15.1-5.el8` This build and newer are available in the OCP 4.8 nightly payloads. Moving to MODIFIED.
Verified on z/VM z13 Cluster: 48.84.202105191019-0 4.8.0-0.nightly-s390x-2021-05-19-141937 [core@master-0 ~]$ lsblk -o NAME,PHY-SeC NAME PHY-SEC sda 4096 ├─sda3 4096 └─sda4 4096 successfully installed.
------- Comment From tstaudt.com 2021-05-21 08:02 EDT------- Closing bug based on previous comment. Thanks.
Marking VERIFIED based on comment #19
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438