Bug 1918723 - installer fails to write boot record on 4k scsi lun on s390x
Summary: installer fails to write boot record on 4k scsi lun on s390x
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: s390x
OS: Linux
medium
high
Target Milestone: ---
: 4.8.0
Assignee: Nikita Dubrovskii (IBM)
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1933235
Blocks: ocp-47-z-tracker
TreeView+ depends on / blocked
 
Reported: 2021-01-21 13:26 UTC by Alexander Klein
Modified: 2021-07-27 22:36 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The zipl command configured the disk geometry assuming a sector size of 512 bytes. Consequence: On SCSI disks with 4K sectors, the zipl bootloader configuration contained incorrect offsets and therefore zVM wasn't able to boot. Fix: zipl takes the disk sector size into account. Result: zVM can be booted successfully.
Clone Of:
: 1933235 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:36:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift_install.log (75.91 KB, text/plain)
2021-01-21 13:26 UTC, Alexander Klein
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ibm-s390-tools s390-tools pull 107 0 None closed zipl: fix reading 4k disk's geometry 2021-02-19 13:57:20 UTC
IBM Linux Technology Center 191077 0 None None None 2021-01-28 13:52:01 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:36:33 UTC

Description Alexander Klein 2021-01-21 13:26:34 UTC
Created attachment 1749392 [details]
openshift_install.log

Version:

$ openshift-install version
install_version: "4.7.0-0.nightly-s390x-2021-01-19-095800"
rhcos_version: "4.7.0-fc.2-s390x"

Platform:

s390x

Please specify:
UPI

What happened?

trying to install ocp on a 4k scsi lun the installer detects the sector size correctly and writes the correct image to the disk. after that it will try to restart the node but fails. The s390x scsi bootloader fails to find the boot record on that disc.

the first guess is that zipl fails to write it, although there are no obvious errors.

---- after image write to disk ----

Ý   91.859350¨ GPT:Primary header thinks Alt. header is not at the end of the d
sk. 
Ý   91.859358¨ GPT:853247 != 31457279 
Ý   91.859359¨ GPT:Alternate GPT header not at the end of the disk. 
Ý   91.859360¨ GPT:853247 != 31457279 
Ý   91.859362¨ GPT: Use GNU Parted to correct GPT errors. 
Ý   91.859370¨  sda: sda3 sda4 
Ý   92.128978¨ EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts:
(null) 
Ý   92.131577¨ coreos-installer-serviceÝ1157¨: Writing Ignition config 
Ý   92.131715¨ coreos-installer-serviceÝ1157¨: Writing first-boot kernel argume
ts 
Ý   92.131782¨ coreos-installer-serviceÝ1157¨: Modifying kernel arguments 
Ý   92.133024¨ coreos-installer-serviceÝ1157¨: Installing bootloader 
Ý   92.247270¨ coreos-installer-serviceÝ1157¨: Updating re-IPL device 
Ý   92.301487¨ coreos-installer-serviceÝ1157¨: Install complete. 
Ý Ý0;32m  OK   Ý0m¨ Started CoreOS Installer. 
Ý Ý0;32m  OK   Ý0m¨ Reached target CoreOS Installer Target. 
Ý Ý0;32m  OK   Ý0m¨ Started Reboot after CoreOS Installer. 
Ý Ý0;32m  OK   Ý0m¨ Reached target Finalize CoreOS Installer Target. 
Ý   92.306811¨ GPT:Primary header thinks Alt. header is not at the end of the d
sk. 
Ý   92.306816¨ GPT:853247 != 31457279 
Ý   92.306818¨ GPT:Alternate GPT header not at the end of the disk. 
Ý   92.306820¨ GPT:853247 != 31457279 
Ý   92.306823¨ GPT: Use GNU Parted to correct GPT errors. 
Ý   92.306831¨  sda: sda3 sda4 
         Stopping Login Service... 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Network is Online. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Timers. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Daily Cleanup of Temporary Directories. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Finalize CoreOS Installer Target. 
         Unmounting /sysroot... 
         Stopping Reboot after CoreOS Installer... 
Ý Ý0;32m  OK   Ý0m¨ Stopped Network Manager Wait Online. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Network. 
         Stopping Restore /run/initramfs on shutdown... 
         Stopping Network Manager... 
Ý Ý0;32m  OK   Ý0m¨ Closed LVM2 poll daemon socket. 
Ý Ý0;32m  OK   Ý0m¨ Stopped daily update of the root trust anchor for DNSSEC. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Run update-ca-trust. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Reboot after CoreOS Installer. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Login Service. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Restore /run/initramfs on shutdown. 
         Unmounting /boot... 
Ý Ý0;32m  OK   Ý0m¨ Stopped target CoreOS Installer Target. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Prepare for CoreOS Installer Target. 
Ý Ý0;32m  OK   Ý0m¨ Unmounted /sysroot. 
Ý Ý0;32m  OK   Ý0m¨ Unmounted /boot. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Network Manager. 
         Stopping D-Bus System Message Bus... 
Ý Ý0;32m  OK   Ý0m¨ Stopped D-Bus System Message Bus. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Basic System. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Slices. 
Ý Ý0;32m  OK   Ý0m¨ Removed slice User and Session Slice. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Paths. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Sockets. 
Ý Ý0;32m  OK   Ý0m¨ Closed D-Bus System Message Bus Socket. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target System Initialization. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Local Encrypted Volumes. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Dispatch Password Requests to Console Directory Watc
h. 
         Stopping Load/Save Random Seed... 
         Stopping Update UTMP about System Boot/Shutdown... 
Ý Ý0;32m  OK   Ý0m¨ Stopped Apply Kernel Variables. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Load Kernel Modules. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Update is Completed. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Rebuild Hardware Database. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Rebuild Dynamic Linker Cache. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Load/Save Random Seed. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Update UTMP about System Boot/Shutdown. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Create Volatile Files and Directories. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Local File Systems. 
         Unmounting /run/ephemeral... 
         Unmounting Temporary Directory (/tmp)... 
         Unmounting /etc... 
         Unmounting /var... 
Ý Ý0;1;31mFAILED Ý0m¨ Failed unmounting /etc. 
Ý Ý0;1;31mFAILED Ý0m¨ Failed unmounting /var. 
Ý Ý0;32m  OK   Ý0m¨ Unmounted /run/ephemeral. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Local File Systems (Pre). 
Ý Ý0;32m  OK   Ý0m¨ Stopped Create Static Device Nodes in /dev. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Create System Users. 
         Stopping Monitoring of LVM2 mirrors   ng dmeventd or progress polling..
. 
Ý Ý0;32m  OK   Ý0m¨ Unmounted Temporary Directory (/tmp). 
Ý Ý0;32m  OK   Ý0m¨ Reached target Unmount All Filesystems. 
Ý Ý0;32m  OK   Ý0m¨ Stopped target Swap. 
Ý Ý0;32m  OK   Ý0m¨ Stopped Monitoring of LVM2 mirrors,   sing dmeventd or progr
ess polling. 
Ý Ý0;32m  OK   Ý0m¨ Reached target Shutdown. 
Ý Ý0;32m  OK   Ý0m¨ Reached target Final Step. 
         Starting Reboot... 
Ý   93.131336¨ printk: systemd-shutdow: 1 output lines suppressed due to ratelim
iting 
Ý   93.137693¨ systemd-shutdownÝ1¨: Syncing filesystems and block devices. 
Ý   93.140114¨ systemd-shutdownÝ1¨: Sending SIGTERM to remaining processes... 
Ý   93.142954¨ systemd-journaldÝ1020¨: Received SIGTERM from PID 1 (systemd-shut
dow). 
Ý   93.209056¨ systemd-shutdownÝ1¨: Sending SIGKILL to remaining processes... 
Ý   93.211032¨ systemd-shutdownÝ1¨: Unmounting file systems. 
Ý   93.211723¨ Ý1307¨: Remounting '/var' read-only in with options 'seclabel,att
r2,discard,inode64,logbufs=8,logbsize=32k,noquota'. 
Ý   93.219215¨ Ý1308¨: Unmounting '/var'. 
Ý   93.299096¨ Ý1309¨: Remounting '/etc' read-only in with options 'seclabel,att
r2,discard,inode64,logbufs=8,logbsize=32k,noquota'. 
Ý   93.299626¨ Ý1310¨: Unmounting '/etc'. 
Ý   93.349533¨ XFS (loop0): Unmounting Filesystem 
Ý   93.401728¨ systemd-shutdownÝ1¨: All filesystems unmounted. 
Ý   93.401732¨ systemd-shutdownÝ1¨: Deactivating swaps. 
Ý   93.401747¨ systemd-shutdownÝ1¨: All swaps deactivated. 
Ý   93.401749¨ systemd-shutdownÝ1¨: Detaching loop devices. 
Ý   93.402192¨ systemd-shutdownÝ1¨: Not all loop devices detached, 1 left. 
Ý   93.402194¨ systemd-shutdownÝ1¨: Detaching DM devices. 
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from
 CPU 00.
02: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from
 CPU 00.
03: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from
 CPU 00.
00: Storage cleared - system reset.
00: HCPLDI2816I Acquiring the machine loader from the processor controller.
00: HCPLDI2817I Load completed from the processor controller. 
00: HCPLDI2817I Now starting the machine loader.
01: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and 
store status from CPU 00.
02: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and 
store status from CPU 00.
03: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and 
store status from CPU 00.

---- reboot ----

00: MLOEVL012I: Machine loader up and running (version v2.4.7).
00: MLOLOA013E: Invalid Program Table, wrong or missing magic number.
00: MLOLOA041E: Unable to find the component table pointer.
00: MLOEVL010E: IPL failed.
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 0002887A

Comment 1 Scott Dodson 2021-01-21 13:29:25 UTC
This is coreos installer not openshift installer that's in question here.

Comment 2 Micah Abbott 2021-01-21 15:56:13 UTC
A number of changes around multiarch support landed in recent devel builds.

If possible, please retry with the latest builds from https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.7-s390x

Comment 3 Alexander Klein 2021-01-22 08:41:54 UTC
restested with
install_version: "4.7.0-0.nightly-s390x-2021-01-22-032543"
rhcos_version: "4.7.0-fc.3-s390x"

exactly same behaviour.

Comment 4 Prashanth Sundararaman 2021-01-22 14:34:04 UTC
@Alexander Klein - one thing you can try is:

Boot without any coreos.inst.* arguments, and with "ignition.firstboot ignition.platform.id=metal ignition.config.url=http://..."

login as core and then you can run the coreos-installer to write the image to disk. that way you will know what the installer is doing:

sudo coreos-installer install <disk-device> --ignition-url <url> --insecure-ignition

Comment 5 Nikita Dubrovskii (IBM) 2021-01-25 14:02:02 UTC
Reproduced this:

[   57.327334] coreos-installer-service[1127]: Installing bootloader 
[   57.433335] coreos-installer-service[1127]: Target device information 
[   57.433418] coreos-installer-service[1127]:   Device..........................: 08:00 
[   57.433451] coreos-installer-service[1127]:   Partition.......................: 08:03 
[   57.433482] coreos-installer-service[1127]:   Device name.....................: sda 
[   57.433515] coreos-installer-service[1127]:   Device driver name..............: sd 
[   57.433547] coreos-installer-service[1127]:   Type............................: disk partition 
[   57.433580] coreos-installer-service[1127]:   Disk layout.....................: SCSI disk layout 
[   57.433613] coreos-installer-service[1127]:   Geometry - start................: 2048 
[   57.433644] coreos-installer-service[1127]:   File system block size..........: 4096 
[   57.433676] coreos-installer-service[1127]:   Physical block size.............: 4096 
[   57.433711] coreos-installer-service[1127]:   Device size in physical blocks..: 98304 
[   57.433745] coreos-installer-service[1127]: Building bootmap in '/tmp/coreos-installer-xFJooy' 
[   57.433779] coreos-installer-service[1127]: Adding IPL section 
[   57.433813] coreos-installer-service[1127]:   initial ramdisk...: /tmp/coreos-installer-xFJooy/ostree/rhcos-475697892c8c54e5ee323
999298fea919e71e0c2e8824af7c6fab9e16b6bae14/initramfs-4.18.0-240.10.1.el8_3.s390x.img 
[   57.433843] coreos-installer-service[1127]:   kernel image......: /tmp/coreos-installer-xFJooy/ostree/rhcos-475697892c8c54e5ee323
999298fea919e71e0c2e8824af7c6fab9e16b6bae14/vmlinuz-4.18.0-240.10.1.el8_3.s390x 
[   57.433875] coreos-installer-service[1127]:   kernel parmline...: 'random.trust_cpu=on ignition.platform.id=metal ostree=/ostree/
boot.1/rhcos/475697892c8c54e5ee323999298fea919e71e0c2e8824af7c6fab9e16b6bae14/0 rd.znet=qeth,0.0.bdf0,0.0.bdf1,0.0.bdf2,layer2=1,po
tno=0 zfcp.allow_lun_scan=0 cio_ignore=all,!condev rd.zfcp=0.0.1985,0x500507605e819cc2,0x0001000000000000 ignition.firstboot rd.need
net=1 ip=172.18.142.4::172.18.0.1:255.254.0.0:coreos:encbdf0:off nameserver=172.18.0.1 ' 
[   57.433912] coreos-installer-service[1127]:   component address: 
[   57.433944] coreos-installer-service[1127]:     heap area.......: 0x00002000-0x00005fff 
[   57.433975] coreos-installer-service[1127]:     stack area......: 0x0000f000-0x0000ffff 
[   57.434012] coreos-installer-service[1127]:     internal loader.: 0x0000a000-0x0000dfff 
[   57.434043] coreos-installer-service[1127]:     parameters......: 0x00009000-0x00009fff 
[   57.434073] coreos-installer-service[1127]:     kernel image....: 0x00010000-0x00687fff 
[   57.434104] coreos-installer-service[1127]:     parmline........: 0x00689000-0x00689fff 
[   57.434136] coreos-installer-service[1127]:     initial ramdisk.: 0x00690000-0x04549fff 
[   57.434166] coreos-installer-service[1127]: Preparing boot device: sda. 
[   57.434196] coreos-installer-service[1127]: Detected SCSI PCBIOS disk layout. 
[   57.434229] coreos-installer-service[1127]: Writing SCSI master boot record. 
[   57.434261] coreos-installer-service[1127]: Syncing disks... 
[   57.434294] coreos-installer-service[1127]: Done. 
[   57.434398] coreos-installer-service[1127]: Updating re-IPL device 
[   57.435428] coreos-installer-service[1127]: Re-IPL type: fcp 
[   57.435460] coreos-installer-service[1127]: WWPN:        0x500507605e819cc2 
[   57.435494] coreos-installer-service[1127]: LUN:         0x0001000000000000 
[   57.435525] coreos-installer-service[1127]: Device:      0.0.1985 
[   57.435558] coreos-installer-service[1127]: bootprog:    0 
[   57.435589] coreos-installer-service[1127]: br_lba:      0 
[   57.435620] coreos-installer-service[1127]: Loadparm:    "" 
[   57.435652] coreos-installer-service[1127]: Bootparms:   "" 
[   57.480009] coreos-installer-service[1127]: Install complete. 
00: Storage cleared - system reset.
00: HCPLDI2816I Acquiring the machine loader from the processor controller.
00: HCPLDI2817I Load completed from the processor controller. 
00: HCPLDI2817I Now starting the machine loader.
01: HCPGSP2630I The virtual machine is placed in CP mode due to a SIGP stop and store status from CPU 00.
00: MLOEVL012I: Machine loader up and running (version v2.4.7).
00: MLOLOA013E: Invalid Program Table, wrong or missing magic number.
00: MLOLOA041E: Unable to find the component table pointer.
00: MLOEVL010E: IPL failed.
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 0002887A


Same image works on DASD (4k) , continue debugging

Comment 6 Nikita Dubrovskii (IBM) 2021-01-26 15:33:16 UTC
Short update:
Wrote a small (~200 lines of C code) patcher tool and was able to boot the system. Now looking for the root-cause of data corruption

Comment 7 Nikita Dubrovskii (IBM) 2021-01-27 15:46:48 UTC
Fix is ready - https://github.com/ibm-s390-tools/s390-tools/pull/107

Comment 8 Holger Wolf 2021-01-27 18:37:20 UTC
@hannsj_uhl.com please mirror this to IBM - this is a s390 patch needed to be integrated into RHEL 8

Comment 9 Micah Abbott 2021-01-29 14:25:23 UTC
Discussed this with the multi-arch team and decided that this was not a blocker and would be targeted for 4.8

The summary is this problem has been present for some time now and even RHEL would be affected by this issue.  Customers won't be able to do similar operations in RHEL or RHCOS until this fix is packaged and shipped as part RHEL.  The fix is targeting 8.4, with a possibility to backport it to 8.3

Comment 10 Micah Abbott 2021-02-07 20:31:56 UTC
Waiting on upstream patch to be merged and new RPM to be released

Comment 11 IBM Bug Proxy 2021-02-15 08:00:40 UTC
------- Comment From stefan.haberland.com 2021-02-12 04:16 EDT-------
Fix is reviewed and tested and will be pulled.

Comment 12 Hendrik Brueckner 2021-02-18 15:32:44 UTC
@Dan: This is about the zipl issues that needs to go into RHEL as well (https://github.com/ibm-s390-linux/s390-tools/pull/107). I am not sure whether you will like to track this for RHEL separately. However, I think that we also need a z-stream for the s390-tools to rhel-8.4.z to be picked up by RHCOS. Let me know how to best approach this from your side. Thanks.

Comment 13 Dan Horák 2021-02-18 15:43:07 UTC
I believe we will need a RHEL clone for this item. There is still a chance to squeeze it to 8.4 GA as an exception. If not approved, then 8.5 + 8.4.0.z.

Comment 14 IBM Bug Proxy 2021-02-19 13:00:28 UTC
------- Comment From Jan.Hoeppner.com 2021-02-19 07:54 EDT-------
The fix now available upstream:
https://github.com/ibm-s390-linux/s390-tools/commit/4a3957fab5696cc410c5b495956859a424e3552a ("zipl: fix reading 4k disk's geometry")

Comment 15 Thomas Staudt 2021-02-26 08:23:28 UTC
(In reply to Dan Horák from comment #13)
> I believe we will need a RHEL clone for this item. There is still a chance
> to squeeze it to 8.4 GA as an exception. If not approved, then 8.5 + 8.4.0.z.

... which has now been opened as
RHEL Bug 1933235 - RHEL8.4 - installer fails to write boot record on 4k scsi lun on s390x

Comment 16 IBM Bug Proxy 2021-02-26 08:50:36 UTC
------- Comment From tstaudt.com 2021-02-26 03:46 EDT-------
(In reply to comment #22)
> (In reply to Dan Hor?k from comment #13)
> > I believe we will need a RHEL clone for this item. There is still a chance
> > to squeeze it to 8.4 GA as an exception. If not approved, then 8.5 + 8.4.0.z.
>
> ... which has now been opened as
> RHEL Bug 1933235 - RHEL8.4 - installer fails to write boot record on 4k scsi
> lun on s390x

which is
LTC Bug 191653 - RH1933235- RHEL8.4 - installer fails to write boot record on 4k scsi lun on s390x (zipl) (s390utils/s390-tools)

Comment 17 Micah Abbott 2021-05-20 16:50:21 UTC
RHCOS 4.8 moved to using RHEL 8.4 GA content with build 48.84.202105182219-0 which included `s390utils-2.15.1-5.el8`

This build and newer are available in the OCP 4.8 nightly payloads.

Moving to MODIFIED.

Comment 19 Stefan Orth 2021-05-21 10:31:20 UTC
Verified on z/VM z13 Cluster:

48.84.202105191019-0			
4.8.0-0.nightly-s390x-2021-05-19-141937


[core@master-0 ~]$ lsblk -o NAME,PHY-SeC
NAME   PHY-SEC
sda       4096
├─sda3    4096
└─sda4    4096

successfully installed.

Comment 20 IBM Bug Proxy 2021-05-21 12:11:02 UTC
------- Comment From tstaudt.com 2021-05-21 08:02 EDT-------
Closing bug based on previous comment.
Thanks.

Comment 21 Micah Abbott 2021-06-14 16:12:39 UTC
Marking VERIFIED based on comment #19

Comment 24 errata-xmlrpc 2021-07-27 22:36:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.