Bug 1863466 - RHCOS build 46.82.202007071640-0 fails to install on new guest
Summary: RHCOS build 46.82.202007071640-0 fails to install on new guest
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: s390x
OS: Linux
medium
high
Target Milestone: ---
: 4.6.0
Assignee: Benjamin Gilbert
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-03 16:26 UTC by Philip Chan
Modified: 2020-10-27 16:22 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:22:34 UTC
Target Upstream Version:


Attachments (Terms of Use)
guest console output from bootstrap-0 (27.96 KB, text/plain)
2020-08-03 16:27 UTC, Philip Chan
no flags Details
guest console output from master-2 (29.21 KB, text/plain)
2020-08-13 19:09 UTC, Philip Chan
no flags Details
rdsosreport from the bootstrap (52.19 KB, text/plain)
2020-08-19 15:27 UTC, wvoesch
no flags Details
Successful 4.5 to 4.6 upgrade info (14.77 KB, application/zip)
2020-08-19 17:04 UTC, Christian LaPolt
no flags Details
guest console output from bootstrap-0 - 2020-08-26 (35.73 KB, application/rtf)
2020-08-27 03:08 UTC, Philip Chan
no flags Details
Failed dasd installation: Dependency failed for Reboot after CoreOS Installer. (208.55 KB, image/jpeg)
2020-08-28 15:26 UTC, wvoesch
no flags Details
guest console output from bootstrap-0 - 2020-08-28 (28.79 KB, application/rtf)
2020-08-28 15:36 UTC, Philip Chan
no flags Details
screenshots and parmfile of the installation version 4.6.0-0.nightly-s390x-2020-08-28-040333 without the option "coreos.inst=yes" (4.25 MB, application/gzip)
2020-09-02 09:07 UTC, wvoesch
no flags Details
screenshots and parmfile of the installation version 4.6.0-0.nightly-s390x-2020-08-28-040333 with the option "coreos.inst=yes" (4.88 MB, application/gzip)
2020-09-02 09:08 UTC, wvoesch
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:22:57 UTC

Description Philip Chan 2020-08-03 16:26:38 UTC
Description of problem:
When installing the RHCOS build 46.82.202007071640-0 on zKVM guest(s), a number of dependency failures occur. This causes the installation to go into Emergency Mode.

Version-Release number of selected component (if applicable):
RHCOS Build 46.82.202007071640-0

How reproducible: Consistently


Steps to Reproduce:
1. For direct kernel boot, I defined the following kernel and initrd stored on the host OS to install the new guest RHCOS:
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-rhel7.6.0'>hvm</type>
    <kernel>/bootkvm/rhcos-46.82.202007071640-0-installer-kernel-s390x</kernel>
    <initrd>/bootkvm/rhcos-46.82.202007071640-0-installer-initramfs.s390x.img</initrd>
    <cmdline>rd.neednet=1 console=ttysclp0 coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.image_url=http://9.12.23.79:8080/4.6-nightly//rhcos-46.82.202007071640-0-metal.s390x.raw.gz coreos.inst.ignition_url=http://192.168.79.1:8080/ignition/master.ign ip=dhcp nameserver=192.168.79.1</cmdline>
  </os>
2. IPL the zKVM guest
3. A number of errors occur during the installation.  I have attached the console output from bootstrap-0 guest.  The same errors occur on the master and worker guests.

Actual results: RHCOS installation fails and enters emergency mode.


Expected results: RHCOS installation succeeds and OS boots up.


Additional info: The last RHCOS build to successfully install on zKVM was build rhcos-45.82.202006190257-0-s390x.

Comment 1 Philip Chan 2020-08-03 16:27:15 UTC
Created attachment 1704016 [details]
guest console output from bootstrap-0

Comment 2 krmoser 2020-08-03 16:38:33 UTC
We have also encountered similar RHCOS 46.82.202007071640-0 install issues on the zVM platform.

Comment 3 Micah Abbott 2020-08-03 17:11:12 UTC
From the linked console log:

```
[    6.152603] systemd[1]: Started dracut pre-mount hook.
         Mounting /sysroot...[    6.153038] systemd[1]: Mounting /sysroot...

[    6.271141] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    6.271632] SQUASHFS error: zlib decompression failed, data probably corrupt
[    6.271634] SQUASHFS error: squashfs_read_data failed to read block 0x2ae3a284
[    6.271635] SQUASHFS error: Unable to read metadata cache entry [2ae3a284]
[    6.271635] SQUASHFS error: Unable to read inode 0x7f2700e9b
[    6.271857] mount[751]: mount: /sysroot: can't read superblock on /dev/loop1.
[    6.318987] systemd[1]: sysroot.mount: Mount process exited, code=exited status=32
[    6.319010] systemd[1]: sysroot.mount: Failed with result 'exit-code'.
[    6.319160] systemd[1]: Failed to mount /sysroot.
[FAILED] Failed to mount /sysroot.
See 'systemctl status sysroot.mount' for details.
[[    6.319269] systemd[1]: Dependency failed for sysroot-xfs-ephemeral-setup.service.
DEPEND] Dependency failed for sysroot-xfs-ephemeral-setup.service.
[[    6.319318] systemd[1]: Dependency failed for /sysroot/var.
DEPEND] Dependency failed for /sysroot/var.
[DEPEND] Dependency failed for Initrd Root File System.
[    6.319351] systemd[1]: Dependency failed for Initrd Root File System.
[[    6.319366] systemd[1]: Dependency failed for Reload Configuration from the Real Root.
DEPEND] Dependency failed for Reload Configuration from the Real Root.
```

Tagging in Andy and Pranshanth for visibility

Comment 4 Colin Walters 2020-08-03 19:54:41 UTC
How much RAM are you providing?
https://github.com/coreos/fedora-coreos-tracker/issues/407

Comment 5 krmoser 2020-08-03 19:57:28 UTC
For the zVM cluster nodes, all of the nodes (bootstrap, masters, workers) have 32GB Real Memory each.

Thank you.

Comment 6 Philip Chan 2020-08-04 02:10:03 UTC
For the zKVM cluster nodes, I used 32GB memory for each guest.  I also tried 64GB, but the same errors occur.

Comment 7 Benjamin Gilbert 2020-08-05 16:18:29 UTC
> <kernel>/bootkvm/rhcos-46.82.202007071640-0-installer-kernel-s390x</kernel>
> <initrd>/bootkvm/rhcos-46.82.202007071640-0-installer-initramfs.s390x.img</initrd>

The above implies that you're using the old installer PXE image, but there's no squashfs in that image.  Is your initramfs actually a concatenation of the live initramfs and live rootfs images?  If so, could you try passing only the initramfs image via the bootloader, and adding a kernel argument "coreos.live.rootfs_url=" with an HTTP/HTTPS URL to the rootfs image?  I'm wondering whether you're encountering a platform-specific limitation on initramfs size.

Comment 8 krmoser 2020-08-06 14:37:40 UTC
Benjamin,

Here's some additional information to help with this issue as I continue to debug this issue using a zVM ECKD (DASD) based cluster, while my colleague Phil Chan works to test on an zKVM FCP based cluster.

1. Using a zVM ECKD (DASD) based cluster, we are able to successfully install the OCP 4.6 nightly builds for 4.6.0-0.nightly-s390x-2020-07-30-034519 and previous 4.6 nightly builds using rhcos-4.5.4-s390x for the bootstrap and then successfully install Red Hat Enterprise Linux CoreOS 46.82.202007071640-0 (Ootpa) 4.6 for the master and worker nodes ....  

2. The break point appears to be with the OCP 4.6 nightly build 4.6.0-0.nightly-s390x-2020-08-01-045347 (working to further pinpoint this).  We are pulling these OCP 4.6 nightly builds from https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp-dev-preview

3. There appear to be 2 issues here:
 1. The bootstrap install of 46.82.202007071640-0 Red Hat Enterprise Linux CoreOS on the bastion node, which seems to consistently fail.
 2. The install of any OCP 4.6 nightly build after 4.6.0-0.nightly-s390x-2020-07-30-034519, including using rhcos-4.5.4-s390x for the bootstrap.  The bootstrap installs using rhcos-4.5.4-s390x for the 4.6.0-0.nightly-s390x-2020-07-30-034519 nightly build, but not for any OCP 4.5 build after.


We are working to provide additional zVM console information, and will update this bugzilla with this information.

Thank you,
Kyle

Comment 9 krmoser 2020-08-06 14:40:09 UTC
Benjamin,

My apologies for the typo, for 3.2 in the above post, I meant to say:

2. The install of any OCP 4.6 nightly build after 4.6.0-0.nightly-s390x-2020-07-30-034519, including using rhcos-4.5.4-s390x for the bootstrap.  The bootstrap installs using rhcos-4.5.4-s390x for the 4.6.0-0.nightly-s390x-2020-07-30-034519 nightly build, but not for any OCP 4.6 nightly build after.

Thank you,
Kyle

Comment 10 krmoser 2020-08-06 15:05:48 UTC
Benjamin,

Would it be possible to post the rhcos 46.82 files (Red Hat Enterprise Linux CoreOS 46.82.202007071640-0 and later when available) to the following website, following the same or similar naming convention that has previously been used for OCP 4.3, 4.4, and 4.5? 

https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/pre-release/

From this website, the following rhcos directories exist, but not a directory for "latest-46.82" or equivalent:
	- 	 
[DIR]	4.3.0-0.nightly-s390x-2020-04-20-185529/	20-Apr-2020 23:47 	- 	 
[DIR]	4.4.0-0.nightly-s390x-2020-05-08-033629/	06-Jul-2020 18:30 	- 	 
[DIR]	4.4.0-0.nightly-s390x-2020-06-01-021037/	06-Jul-2020 18:30 	- 	 
[DIR]	45.81.202005020555-0/	05-Jun-2020 12:21 	- 	 
[DIR]	45.82.202006190257-0/	19-Jun-2020 14:59 	- 	 
[DIR]	latest-4.3/	20-Apr-2020 23:47 	- 	 
[DIR]	latest-4.4/	06-Jul-2020 18:30 	- 	 
[DIR]	latest-4.5/	05-Jun-2020 12:21 	- 	 
[DIR]	latest-45.82/	19-Jun-2020 14:59 	- 	 
[DIR]	latest/	19-Jun-2020 14:59 	- 	 

Thank you,
Kyle

Comment 11 Prashanth Sundararaman 2020-08-06 15:44:36 UTC
Kyle,

Where did you get the kernel and initramfs images from ? Did you by any chance rename the live image by removing the "live" keyword in the image name? We only have live images now : 

https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6-s390x&release=46.82.202007071640-0#46.82.202007071640-0

Also i am not sure i understand the comments above from you. You mention using rhcos-4.5 as a bootimage for 4.6? I am assuming this is a UPI install. In that case what are the kernel and initrd images you are using? Are you saying you can successfully use a 4.5 kernel and initrd for the bootstrap but not a 4.6 kernel and initrd?

Thanks,
Prashanth

Comment 12 krmoser 2020-08-06 16:56:43 UTC
Prashanth,

Thanks for the questions and I hope all is well.  

1. Yes, we downloaded the rhcos live images from the internal CI site.  Please see my previous post on trying to understand/requesting if the rhcos 4.6 images could also be posted on the public mirror site at some point?

Specifically from: 
    https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/pre-release/


2. This is correct that we are performing OCP 4.6 UPI installs.

3. This is also correct that we have successfully used rhcos-4.5 on the bootstrap node for the OCP 4.6 installs for the 4.6.0-0.nightly-s390x-2020-07-30-034519 nightly build and previous July 2020 date nightly builds.  Here are the OCP 4.5.4 rhcos images we have successfully used:

  rhcos-4.5.4-s390x-dasd.s390x.raw.gz
  rhcos-4.5.4-s390x-installer-initramfs.s390x.img
  rhcos-4.5.4-s390x-installer-kernel-s390x
  rhcos-4.5.4-s390x-installer.s390x.iso
  rhcos-4.5.4-s390x-metal.s390x.raw.gz
  rhcos-4.5.4-s390x-openstack.s390x.qcow2.gz
  rhcos-4.5.4-s390x-ostree.s390x.tar
  rhcos-4.5.4-s390x-qemu.s390x.qcow2.gz

4. This is also correct that for the zVM and zKVM platforms, we have not been able to successfully install a bootstrap node using the rhcos-46 live images.


Thank you,
Kyle

Comment 13 Benjamin Gilbert 2020-08-06 17:13:54 UTC
Kyle,

Thanks for the additional info.  Have you had a chance to try the suggestion in comment 7?

Best,
Benjamin

Comment 14 Philip Chan 2020-08-06 19:12:06 UTC
Hi Benjamin,

With regards to Comment 7 -

1) I did rename the kernel and initramfs images to remove "live" from their filenames.  We have some bash scripts that have a dependency on the file name.  The change in the name to include live would break this, so I just renamed them.
2) I'm in the process of following your steps in using "coreos.live.rootfs_url=" and have not been successful in IPL'ing the guest with the new configuration.  Here's a sample of what I've tried testing with

  <os>
    <type arch='s390x' machine='s390-ccw-virtio-rhel7.6.0'>hvm</type>
    <initrd>/bootkvm/rhcos-46.82.202007071640-0-installer-initramfs.s390x.img</initrd>
    <cmdline>rd.neednet=1 console=ttysclp0 coreos.inst=yes coreos.inst.install_dev=vda coreos.live.rootfs_url=http://9.12.23.79:8080/46.82.202007071640-nightly/rhcos-46.82.202007071640-0-installer-kernel-s390x coreos.inst.image_url=http://9.12.23.79:8080/46.82.202007071640-nightly/rhcos-46.82.202007071640-0-metal.s390x.raw.gz coreos.inst.ignition_url=http://192.168.79.1:8080/ignition/bootstrap.ign ip=dhcp nameserver=192.168.79.1</cmdline>
  </os>

I'm not entirely sure if that is the correct syntax.  When I ipl, it takes this error:

error: Failed to create domain from guest-bootstrap-0.xml
error: internal error: qemu unexpectedly closed the monitor: 2020-08-06T19:08:29.003781Z qemu-kvm: -append only allowed with -kernel option

I also wanted to add to what Kyle tested with.  I was able to successfully install on a zKVM (FCP) based cluster with OCP 4.6 nightly build 4.6.0-0.nightly-s390x-2020-07-30-034519 using the RHCOS 4.5.4 build.  We've been in sync between zVM and zKVM cluster results.

Regards,
-Phil

Comment 15 Benjamin Gilbert 2020-08-06 19:28:56 UTC
Hi Phil,

1.  Renaming the files is fine, I just wanted to check which ones you were using.
2.  You should leave the <kernel> element as you originally had it, and add the live-rootfs image as the argument to coreos.live.rootfs_url.  So:

  <os>
    <type arch='s390x' machine='s390-ccw-virtio-rhel7.6.0'>hvm</type>
    <kernel>/bootkvm/rhcos-46.82.202007071640-0-installer-kernel-s390x</kernel>
    <initrd>/bootkvm/rhcos-46.82.202007071640-0-installer-initramfs.s390x.img</initrd>
    <cmdline>rd.neednet=1 console=ttysclp0 coreos.live.rootfs_url=http://9.12.23.79:8080/46.82.202007071640-nightly/rhcos-46.82.202007071640-0-live-rootfs.s390x.img coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.image_url=http://9.12.23.79:8080/4.6-nightly//rhcos-46.82.202007071640-0-metal.s390x.raw.gz coreos.inst.ignition_url=http://192.168.79.1:8080/ignition/master.ign ip=dhcp nameserver=192.168.79.1</cmdline>
  </os>

3.  With the new live installer, you can drop coreos.inst=yes.  If you want to install the RHCOS version matching the version of the live image, you can also drop coreos.inst.image_url.

Best,
Benjamin

Comment 16 Philip Chan 2020-08-06 21:12:28 UTC
Hi Benjamin, 

We currently do not have a local copy of the live-rootfs image.  I'm logged into the CI RHCOS build site to check.  There does not seem to be a live-rootfs image under the rhcos-4.6-s390x stream.  I do see it available under the rhcos-4.6 (for x86_64) and rhcos-4.6-ppc64le streams.

Regards,
-Phil

Comment 17 Benjamin Gilbert 2020-08-06 21:17:21 UTC
Hi Phil,

Indeed, it seems there hasn't been a new build since we started generating the rootfs image.  I'll check into that.

Since you're using an older build, that also explains how you're getting a squashfs error without a rootfs image.

Best,
Benjamin

Comment 18 Prashanth Sundararaman 2020-08-07 00:47:24 UTC
We are working to fix the s390x rhcos builds for 4.6. In the meantime you can even use the legacy initramfs image which can be accessed like this for a particular rhcos release:

https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.6-s390x/46.82.202007071640-0/s390x/rhcos-46.82.202007071640-0-installer-initramfs.s390x.img

Comment 19 krmoser 2020-08-10 05:17:36 UTC
Benjamin and Prashanth,

Thank you for all your help with debugging and resolving this issue.  

Just wanted to check to see if you had sufficient information for resolution of this OCP 4.6 s390x rhcos build issue, as it seems from your latest post, or if you possibly needed Phil and I to provide any additional information?

Thank you,
Kyle

Comment 20 Benjamin Gilbert 2020-08-11 04:09:57 UTC
Hi Kyle,

I don't need anything else for now.  Once we have a new 4.6 s390x build for testing, I'd be interested in a couple more data points from you:

1. Whether you still see the squashfs errors on the new build if you specify an <initrd> file which is the concatenation of the live-initramfs and live-rootfs images.
2. Whether you still see the errors if you specify the live-initramfs image as the <initrd>, and add coreos.live.rootfs_url=https://url/to/live-rootfs/image to the kernel command line.

Thanks,
Benjamin

Comment 21 krmoser 2020-08-11 12:51:37 UTC
Benjamin,

Thanks for the information.  

Just to clarify our understanding, will the new OCP 4.6 s390x rhcos build hopefully resolve the OCP 4.6 install issues we have been seeing using any OCP 4.6 s390x build after the July 31, 2020 build, or will it be a debug build to gather additional information?

Thank you,
Kyle

Comment 22 Benjamin Gilbert 2020-08-11 13:01:35 UTC
Hi Kyle,

I'm not aware of the issues you mentioned.  Do you have a Bugzilla reference?

With respect to this bug, there won't be any concrete fix in the newer build.  It'll just give us a chance to try the split rootfs live image and see if the issue is still present.

Best,
Benjamin

Comment 23 krmoser 2020-08-11 13:31:24 UTC
Benjamin,

Thanks for the information.

THIS bugzilla is the issue where we are not able to install any OCP 4.6 s390x build after the July 30, 2020 build.  This pertains to using either rhcos 4.5.4 or rhcos 4.6 on the bootstrap node, master, and worker nodes, and for both the zKVM and zVM platforms.

Please see the original description, and comments 2, 9, 10, and 12 where I describe this issue and the behavior.

Thank you,
Kyle

Comment 24 Benjamin Gilbert 2020-08-11 13:45:50 UTC
Hi Kyle,

Okay, I was confused by the date you mentioned, since the original issue reports an install failure with a July 7 build.

As I understand it, there have been a number of s390x fixes in the last month or so.  I don't know that the new build will result in a successful install, but we'll see how far we get.

Best,
Benjamin

Comment 25 krmoser 2020-08-11 14:52:25 UTC
Benjamin,

Thanks for the update.  

1. Would there be any information on when there may be a new RHCOS 4.6 build (rhcos 46.82) available?  

2. Just a follow-up to comment 10, would it be possible to post the rhcos 46.82 files (Red Hat Enterprise Linux CoreOS 46.82.202007071640-0 and later builds when available) to the following website, following the same or similar naming convention that has previously been used for OCP 4.3, 4.4, and 4.5? 

   https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/pre-release/


Thank you,
Kyle

Comment 26 Prashanth Sundararaman 2020-08-13 15:40:55 UTC
Hi Benjamin/Kyle,

I tried the install with the new rhcos bits generated yesterday here: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6-s390x&release=46.82.202008131339-0#46.82.202008131339-0

I notice that the coreos-installer segfaults when installing with :

coreos-installer-service: /usr/libexec/coreos-installer-service:line 105: 1205 Segmentation fault    (core dumped)

Unfortunately I did the install on a zVM with an x3270 console without ssh access so couldn't get any more logs out. This was what i was using:

Kernel="rhcos-46.82.202008121739-0-live-kernel-s390x"
Initramfs="rhcos-46.82.202008121739-0-live-initramfs.s390x.img"

rd.neednet=1 console=ttysclp0 coreos.inst.install_dev=sda coreos.live.rootfs_url=http://10.19.14.24:8080/ignition/rhcos-46.82.202008121739-0-live-rootfs.s390x.img coreos.inst.ignition_url=http://bastion.test.example.com:8080/ignition/bootstrap.ign ip=10.19.14.30::10.19.14.254:255.255.255.0:::none nameserver=10.19.14.24 rd.znet=qeth,0.0.0600,0.0.0601,0.0.0602,layer2=1,portno=0 rd.zfcp=0.0.0120,0xC05076CE76800D1C,0x0000000000000000 rd.zfcp=0.0.0130,0xC05076CE76800D70,0x0000000000000000

Kyle - could you also give it a try please to confirm and also get the journal logs if you can?

Thanks
Prashanth

Comment 27 Philip Chan 2020-08-13 19:09:11 UTC
Hi Prashanth,

I tried RHCOS build images on zKVM from here: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?release=46.82.202008121739-0&stream=releases%2Frhcos-4.6-s390x#46.82.202008121739-0

I am using the following configuration for the KVM guest:

  <os>
    <type arch='s390x' machine='s390-ccw-virtio-rhel7.6.0'>hvm</type>
    <kernel>/bootkvm/rhcos-46.82.202008121739-0-live-kernel-s390x</kernel>
    <initrd>/bootkvm/rhcos-46.82.202008121739-0-live-initramfs.s390x.img</initrd>
    <cmdline>rd.neednet=1 console=ttysclp0 coreos.live.rootfs_url=http://9.12.23.79:8080/CI/4.6.0-0.nightly-s390x-2020-08-13-143202/rhcos-46.82.202008121739-0-live-rootfs.s390x.img coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.image_url=http://9.12.23.79:8080/CI/4.6.0-0.nightly-s390x-2020-08-13-143202/rhcos-46.82.202008121739-0-metal.s390x.raw.gz coreos.inst.ignition_url=http://192.168.79.1:8080/ignition/master.ign ip=dhcp nameserver=192.168.79.1</cmdline>
  </os>

When I IPL the KVM master-2 guest for this case, the console output seems to match the bootstrap console output as originally reported.  I'll attach the latest master-2 console output here.

Please let me know if you need any additional information.

Thank you,
-Phil

Comment 28 Philip Chan 2020-08-13 19:09:40 UTC
Created attachment 1711368 [details]
guest console output from master-2

Comment 29 wvoesch 2020-08-19 15:27:14 UTC
Hi all,

I tested the two ways (“live”, “installer”) again with new builds (list below), which should include the fixes required. 
Can we please get advice which way to go for future installations, the “live”-way, or the “installer”-way? And is this the same for KVM and zVM?

Tests were conducted on a z13, zVM, Hipersockets, zFCP.

_____________________
Test 1 (“live”):

Used resources:
  OCP version: 
    4.6.0-0.nightly-s390x-2020-08-18-112620
  RHCOS version:
    rhcos-46.82.202008171940-0-live-initramfs.s390x.img
    rhcos-46.82.202008171940-0-live-kernel-s390x
    rhcos-46.82.202008171940-0-metal-zfcp.s390x.raw.gz
  Download-url:
    https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-....

Problem: 
The installation of OCP 4.6 failes, because the installer wants to load the rootfs image with PXE. This does not work on zVM. Below the part of the corresponding output of the rdsosreport from the bootstrap. Full report added as attachment. 

[  190.167762] localhost.localdomain systemd[1]: Started dracut initqueue hook.
[  190.169354] localhost.localdomain systemd[1]: Starting Acquire live PXE rootfs image...
[  190.169424] localhost.localdomain systemd[1]: Reached target Remote File Systems (Pre).
[  190.169483] localhost.localdomain systemd[1]: Reached target Remote File Systems.
[  190.170423] localhost.localdomain systemd[1]: Starting dracut pre-mount hook...
[  190.186191] localhost.localdomain systemd[1]: Started dracut pre-mount hook.
[  190.192015] localhost.localdomain coreos-livepxe-rootfs[812]: No rootfs image found.  Modify your PXE configuration to add the rootfs
[  190.192330] localhost.localdomain coreos-livepxe-rootfs[812]: image as a second initrd, or use the coreos.live.rootfs_url= kernel parameter
[  190.192330] localhost.localdomain coreos-livepxe-rootfs[812]: to specify an HTTP or HTTPS URL to the rootfs.
[  190.194842] localhost.localdomain systemd[1]: coreos-livepxe-rootfs.service: Main process exited, code=exited, status=1/FAILURE
[  190.194996] localhost.localdomain systemd[1]: coreos-livepxe-rootfs.service: Failed with result 'exit-code'.
[  190.195341] localhost.localdomain systemd[1]: Failed to start Acquire live PXE rootfs image.
[  190.195418] localhost.localdomain systemd[1]: Dependency failed for Initrd Root File System.
[  190.195456] localhost.localdomain systemd[1]: Dependency failed for Reload Configuration from the Real Root.

additional question:
It seems as if there are Asian characters in the rdsosreport too (see below). Is there a connection to the problem we are seeing?

ID_FS_PARTLABEL=戀漀漀琀
...
ID_FS_PARTLABEL=氀甀欀猀开爀漀漀琀
...
lrwxrwxrwx 1 root root 10 Aug 19 11:56 戀漀漀琀 -> ../../sda1
lrwxrwxrwx 1 root root 10 Aug 19 11:56 氀甀欀猀开爀漀漀琀 -> ../../sda4

_____________________
Test 2 (“installer”):

Used resources:
  OCP version: 
    4.6.0-0.nightly-s390x-2020-08-18-112620
  RHCOS version:
    rhcos-46.82.202008171940-0-installer-initramfs.s390x.img
    rhcos-46.82.202008171940-0-installer-kernel-s390x
    rhcos-46.82.202008171940-0-metal-zfcp.s390x.raw.gz
  Download-url:
    https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-....
    (even if the “installer”-bits are not shown, they can be downloaded nonetheless, following the naming scheme.)

Problem: 

The bootstrap will come up and shows the “standard” login screen, but it did not pick up its name and calls itself localhost. It does not set the network online, pinging the ip-address results in no answer. DNS is working in this setup. 
The masters are looping in the wait for ignition. These machines respond to pings.

Comment 30 wvoesch 2020-08-19 15:27:53 UTC
Created attachment 1711904 [details]
rdsosreport from the bootstrap

Comment 31 Christian LaPolt 2020-08-19 17:02:20 UTC
Just a note that I have tested the upgrade from 4.5 to the 4.6 nightly from 8/18 4.6.0-0.nightly-s390x-2020-08-18-095743.  That works fine, I didn't see any issues, this is one of the COP nightlies that does fail to install.

I am attaching the results of that upgrade.

Comment 32 Christian LaPolt 2020-08-19 17:04:28 UTC
Created attachment 1711917 [details]
Successful 4.5 to 4.6 upgrade info

Comment 33 Prashanth Sundararaman 2020-08-19 17:08:06 UTC
(In reply to wvoesch from comment #29)
> Hi all,
> 
> I tested the two ways (“live”, “installer”) again with new builds (list
> below), which should include the fixes required. 
> Can we please get advice which way to go for future installations, the
> “live”-way, or the “installer”-way? And is this the same for KVM and zVM?
> 
> Tests were conducted on a z13, zVM, Hipersockets, zFCP.
> 
> _____________________
> Test 1 (“live”):
> 
> Used resources:
>   OCP version: 
>     4.6.0-0.nightly-s390x-2020-08-18-112620
>   RHCOS version:
>     rhcos-46.82.202008171940-0-live-initramfs.s390x.img
>     rhcos-46.82.202008171940-0-live-kernel-s390x
>     rhcos-46.82.202008171940-0-metal-zfcp.s390x.raw.gz
>   Download-url:
>    
> https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/
> rhcos-....
> 
> Problem: 
> The installation of OCP 4.6 failes, because the installer wants to load the
> rootfs image with PXE. This does not work on zVM. Below the part of the
> corresponding output of the rdsosreport from the bootstrap. Full report
> added as attachment. 
> 
> [  190.167762] localhost.localdomain systemd[1]: Started dracut initqueue
> hook.
> [  190.169354] localhost.localdomain systemd[1]: Starting Acquire live PXE
> rootfs image...
> [  190.169424] localhost.localdomain systemd[1]: Reached target Remote File
> Systems (Pre).
> [  190.169483] localhost.localdomain systemd[1]: Reached target Remote File
> Systems.
> [  190.170423] localhost.localdomain systemd[1]: Starting dracut pre-mount
> hook...
> [  190.186191] localhost.localdomain systemd[1]: Started dracut pre-mount
> hook.
> [  190.192015] localhost.localdomain coreos-livepxe-rootfs[812]: No rootfs
> image found.  Modify your PXE configuration to add the rootfs
> [  190.192330] localhost.localdomain coreos-livepxe-rootfs[812]: image as a
> second initrd, or use the coreos.live.rootfs_url= kernel parameter
> [  190.192330] localhost.localdomain coreos-livepxe-rootfs[812]: to specify
> an HTTP or HTTPS URL to the rootfs.
> [  190.194842] localhost.localdomain systemd[1]:
> coreos-livepxe-rootfs.service: Main process exited, code=exited,
> status=1/FAILURE
> [  190.194996] localhost.localdomain systemd[1]:
> coreos-livepxe-rootfs.service: Failed with result 'exit-code'.
> [  190.195341] localhost.localdomain systemd[1]: Failed to start Acquire
> live PXE rootfs image.
> [  190.195418] localhost.localdomain systemd[1]: Dependency failed for
> Initrd Root File System.
> [  190.195456] localhost.localdomain systemd[1]: Dependency failed for
> Reload Configuration from the Real Root.
> 
> additional question:
> It seems as if there are Asian characters in the rdsosreport too (see
> below). Is there a connection to the problem we are seeing?
> 
> ID_FS_PARTLABEL=戀漀漀琀
> ...
> ID_FS_PARTLABEL=氀甀欀猀开爀漀漀琀
> ...
> lrwxrwxrwx 1 root root 10 Aug 19 11:56 戀漀漀琀 -> ../../sda1
> lrwxrwxrwx 1 root root 10 Aug 19 11:56 氀甀欀猀开爀漀漀琀 -> ../../sda4
> 
> _____________________
> Test 2 (“installer”):
> 
> Used resources:
>   OCP version: 
>     4.6.0-0.nightly-s390x-2020-08-18-112620
>   RHCOS version:
>     rhcos-46.82.202008171940-0-installer-initramfs.s390x.img
>     rhcos-46.82.202008171940-0-installer-kernel-s390x
>     rhcos-46.82.202008171940-0-metal-zfcp.s390x.raw.gz
>   Download-url:
>    
> https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/
> rhcos-....
>     (even if the “installer”-bits are not shown, they can be downloaded
> nonetheless, following the naming scheme.)
> 
> Problem: 
> 
> The bootstrap will come up and shows the “standard” login screen, but it did
> not pick up its name and calls itself localhost. It does not set the network
> online, pinging the ip-address results in no answer. DNS is working in this
> setup. 
> The masters are looping in the wait for ignition. These machines respond to
> pings.

The live images are used going forward. The installer images use the old coreos-installer which will be deprecated soon. For the live method, instead of specifying a metal/dasd image - you just specify the rootfs image using coreos.live.rootfs_url. The rootfs image is like this one: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.6-s390x/46.82.202008181640-0/s390x/rhcos-46.82.202008181640-0-live-rootfs.s390x.img.

This works fine for DASD although there is an issue with installation on metal which Nikita is looking into.

Comment 34 Prashanth Sundararaman 2020-08-19 17:08:35 UTC
(In reply to Prashanth Sundararaman from comment #33)
> (In reply to wvoesch from comment #29)
> > Hi all,
> > 
> > I tested the two ways (“live”, “installer”) again with new builds (list
> > below), which should include the fixes required. 
> > Can we please get advice which way to go for future installations, the
> > “live”-way, or the “installer”-way? And is this the same for KVM and zVM?
> > 
> > Tests were conducted on a z13, zVM, Hipersockets, zFCP.
> > 
> > _____________________
> > Test 1 (“live”):
> > 
> > Used resources:
> >   OCP version: 
> >     4.6.0-0.nightly-s390x-2020-08-18-112620
> >   RHCOS version:
> >     rhcos-46.82.202008171940-0-live-initramfs.s390x.img
> >     rhcos-46.82.202008171940-0-live-kernel-s390x
> >     rhcos-46.82.202008171940-0-metal-zfcp.s390x.raw.gz
> >   Download-url:
> >    
> > https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/
> > rhcos-....
> > 
> > Problem: 
> > The installation of OCP 4.6 failes, because the installer wants to load the
> > rootfs image with PXE. This does not work on zVM. Below the part of the
> > corresponding output of the rdsosreport from the bootstrap. Full report
> > added as attachment. 
> > 
> > [  190.167762] localhost.localdomain systemd[1]: Started dracut initqueue
> > hook.
> > [  190.169354] localhost.localdomain systemd[1]: Starting Acquire live PXE
> > rootfs image...
> > [  190.169424] localhost.localdomain systemd[1]: Reached target Remote File
> > Systems (Pre).
> > [  190.169483] localhost.localdomain systemd[1]: Reached target Remote File
> > Systems.
> > [  190.170423] localhost.localdomain systemd[1]: Starting dracut pre-mount
> > hook...
> > [  190.186191] localhost.localdomain systemd[1]: Started dracut pre-mount
> > hook.
> > [  190.192015] localhost.localdomain coreos-livepxe-rootfs[812]: No rootfs
> > image found.  Modify your PXE configuration to add the rootfs
> > [  190.192330] localhost.localdomain coreos-livepxe-rootfs[812]: image as a
> > second initrd, or use the coreos.live.rootfs_url= kernel parameter
> > [  190.192330] localhost.localdomain coreos-livepxe-rootfs[812]: to specify
> > an HTTP or HTTPS URL to the rootfs.
> > [  190.194842] localhost.localdomain systemd[1]:
> > coreos-livepxe-rootfs.service: Main process exited, code=exited,
> > status=1/FAILURE
> > [  190.194996] localhost.localdomain systemd[1]:
> > coreos-livepxe-rootfs.service: Failed with result 'exit-code'.
> > [  190.195341] localhost.localdomain systemd[1]: Failed to start Acquire
> > live PXE rootfs image.
> > [  190.195418] localhost.localdomain systemd[1]: Dependency failed for
> > Initrd Root File System.
> > [  190.195456] localhost.localdomain systemd[1]: Dependency failed for
> > Reload Configuration from the Real Root.
> > 
> > additional question:
> > It seems as if there are Asian characters in the rdsosreport too (see
> > below). Is there a connection to the problem we are seeing?
> > 
> > ID_FS_PARTLABEL=戀漀漀琀
> > ...
> > ID_FS_PARTLABEL=氀甀欀猀开爀漀漀琀
> > ...
> > lrwxrwxrwx 1 root root 10 Aug 19 11:56 戀漀漀琀 -> ../../sda1
> > lrwxrwxrwx 1 root root 10 Aug 19 11:56 氀甀欀猀开爀漀漀琀 -> ../../sda4
> > 
> > _____________________
> > Test 2 (“installer”):
> > 
> > Used resources:
> >   OCP version: 
> >     4.6.0-0.nightly-s390x-2020-08-18-112620
> >   RHCOS version:
> >     rhcos-46.82.202008171940-0-installer-initramfs.s390x.img
> >     rhcos-46.82.202008171940-0-installer-kernel-s390x
> >     rhcos-46.82.202008171940-0-metal-zfcp.s390x.raw.gz
> >   Download-url:
> >    
> > https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/
> > rhcos-....
> >     (even if the “installer”-bits are not shown, they can be downloaded
> > nonetheless, following the naming scheme.)
> > 
> > Problem: 
> > 
> > The bootstrap will come up and shows the “standard” login screen, but it did
> > not pick up its name and calls itself localhost. It does not set the network
> > online, pinging the ip-address results in no answer. DNS is working in this
> > setup. 
> > The masters are looping in the wait for ignition. These machines respond to
> > pings.
> 
> The live images are used going forward. The installer images use the old
> coreos-installer which will be deprecated soon. For the live method, instead
> of specifying a metal/dasd image - you just specify the rootfs image using
> coreos.live.rootfs_url. The rootfs image is like this one:
> https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/
> rhcos-4.6-s390x/46.82.202008181640-0/s390x/rhcos-46.82.202008181640-0-live-
> rootfs.s390x.img.
> 
> This works fine for DASD although there is an issue with installation on
> metal which Nikita is looking into.

i meant issue with FIbre channel ISCSI.

Comment 35 krmoser 2020-08-19 18:16:45 UTC
Prashanth,

Thanks for the information.  When possible please send us a link to the updated OCP 4.6 install procedure that is required so that we may test as well as review this documentation.

Thank you,
Kyle

Comment 36 Benjamin Gilbert 2020-08-20 02:02:29 UTC
Hi all,

Re comment 26, has anyone reproduced the segmentation fault from coreos-installer?  Core dumps or stacktraces would be very useful.

Re comment 27, Phil, thanks for the info.  I was unable to reproduce the boot failure with those RHCOS images in a QEMU emulated s390x VM.  When the boot fails, could you press Enter twice and report the output of the following commands:

ls -l /root.squashfs
sha256sum /root.squashfs

Thanks,
Benjamin

Comment 37 Benjamin Gilbert 2020-08-20 03:09:44 UTC
Hi Kyle,

Re comment 35, docs are still being written.  The main change is to 1) drop coreos.inst.image_url and 2) add coreos.live.rootfs_url with an URL to the live-rootfs image.  There are some other cleanups to the kernel command line, but they're optional.

Best,
Benjamin

Comment 38 Philip Chan 2020-08-20 04:24:28 UTC
Hi Benjamin,

Based on all the above comments regarding the new OCP 4.6 install procedure, I was able to get further with the core os installation under zKVM.  I was also able to reproduce the segmentation fault as reported by Prashanth (comment 26).  However, I am hitting the same difficulties in obtaining the logs.  I am unable to logon when it enters emergency mode.  I've tried hitting enter once, and twice, but neither will permit me to any point to enter any commands.

I'm going to walk through the install process, as I believe there are multiple issues here.  I used RHCOS 46.82.202008181640-0 and OCP 4.6.0-0.nightly-s390x-2020-08-18-221317.  The <cmdline> I'm using for each of my cluster nodes consists of the following:

    <cmdline>rd.neednet=1 console=ttysclp0 coreos.live.rootfs_url=http://9.12.23.79:8080/CI/4.6.0-0.nightly-s390x-2020-08-18-221317/rhcos-46.82.202008181640-0-live-rootfs.s390x.img coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.ignition_url=http://192.168.79.1:8080/ignition/bootstrap.ign ip=dhcp nameserver=192.168.79.1</cmdline>

Issue #1 - Ignition files permissions are not set correctly in httpd bastion:

When IPLing the KVM guest(bootstrap), it takes this error:

        Starting CoreOS Installer...
[   32.962116] coreos-installer-service[1118]: coreos-installer install /dev/vda --ignition-url http://192.168.79.1:8080/ignition/bootstrap.ign --insecure-ignition --firstboot-args rd.neednet=1 ip=dhcp nameserver=192.168.79.1
[   33.021058] coreos-installer-service[1118]: Error: parsing arguments
[   33.022195] coreos-installer-service[1118]: Caused by: downloading source Ignition config http://192.168.79.1:8080/ignition/bootstrap.ign
[   33.022277] coreos-installer-service[1118]: Caused by: fetching 'http://192.168.79.1:8080/ignition/bootstrap.ign'
[   33.022358] coreos-installer-service[1118]: Caused by: HTTP status client error (403 Forbidden) for url (http://192.168.79.1:8080/ignition/bootstrap.ign)
[FAILED] Failed to start CoreOS Installer.
See 'systemctl status coreos-installer.service' for details.
[DEPEND] Dependency failed for CoreOS Installer Target.

I checked permissions for the ignition files which are only set to 600:

# ls -la /var/www/html/ignition/
total 296
-rw-------. 1 root root 294625 Aug 19 19:21 bootstrap.ign
-rw-------. 1 root root   1738 Aug 19 19:20 master.ign
-rw-------. 1 root root   1738 Aug 19 19:20 worker.ign

I proceeded to modify the files to 755:

# chmod 755 /var/www/html/ignition/*ign
# ls -la /var/www/html/ignition/
total 296
-rwxr-xr-x. 1 root root 294625 Aug 19 19:21 bootstrap.ign
-rwxr-xr-x. 1 root root   1738 Aug 19 19:20 master.ign
-rwxr-xr-x. 1 root root   1738 Aug 19 19:20 worker.ign

I then proceeded to IPL'ing the bootstrap guest again which succeeds:
...

Red Hat Enterprise Linux CoreOS 46.82.202008181640-0 (Ootpa) 4.6
SSH host key: SHA256:R5QSUYN7Xsz9CuqIAp/NPcQ01uxGNneib15lDFa5wDE (ECDSA)
SSH host key: SHA256:JcMC7Pbk9brJy02BcBcBLU10i0Smp7oOjAyIlgRWwBo (ED25519)
SSH host key: SHA256:+Ikms3gamNU7nQmXfq6hvkoCkFlZcadKc429R0W0FBM (RSA)
enc1: 192.168.79.20 fe80::a622:cf4d:5366:d208
bootstrap-0 login:

At this point, I proceeded to IPLing the other cluster nodes(master and worker).  Which failed due to haproxy being down on bastion. 

Issue #2 - DNS server not properly starting, which causes HA proxy to not start

Unfortunately, I do not have the logs for this, but it is another file permission problem.  Our colleague Wolfgang Voesch had also encountered this and updated the ansible playbook to resolve this.  I proceeded to manually update the two zones files and resrart the DNS server:

# chmod 744 /var/named/79.168.192.in-addr.arpa.zone
# chmod 744 /var/named/pok-192-aug19-46.pok.stglabs.ibm.com.zone
# systemctl stop named-chroot
# systemctl start named-chroot
# systemctl status named-chroot
● named-chroot.service - Berkeley Internet Name Domain (DNS)
   Loaded: loaded (/usr/lib/systemd/system/named-chroot.service; enabled; vendor preset: disabled)
   Active: active (running) (thawing) since Wed 2020-08-19 23:37:39 EDT; 7s ago
  Process: 42385 ExecStop=/bin/sh -c /usr/sbin/rndc stop > /dev/null 2>&1 || /bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS)
  Process: 42497 ExecStart=/usr/sbin/named -u named -c ${NAMEDCONF} -t /var/named/chroot $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 42494 ExecStartPre=/bin/bash -c if [ ! "$DISABLE_ZONE_CHECKING" == "yes" ]; then /usr/sbin/named-checkconf -t /var/named/chroot -z "$NAMEDCONF"; else echo "Checking of zone files is disabled>
 Main PID: 42498 (named)
    Tasks: 7 (limit: 24228)
   Memory: 57.6M
   CGroup: /system.slice/named-chroot.service
           └─42498 /usr/sbin/named -u named -c /etc/named.conf -t /var/named/chroot

Aug 19 23:37:39 bastion named[42498]: zone 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa/IN: loaded serial 0
Aug 19 23:37:39 bastion named[42498]: zone 1.0.0.127.in-addr.arpa/IN: loaded serial 0
Aug 19 23:37:39 bastion named[42498]: zone localhost.localdomain/IN: loaded serial 0
Aug 19 23:37:39 bastion named[42498]: all zones loaded
Aug 19 23:37:39 bastion systemd[1]: Started Berkeley Internet Name Domain (DNS).
Aug 19 23:37:39 bastion named[42498]: running
Aug 19 23:37:39 bastion named[42498]: zone 79.168.192.in-addr.arpa/IN: sending notifies (serial 2019062001)
Aug 19 23:37:39 bastion named[42498]: managed-keys-zone: Key 20326 for zone . acceptance timer complete: key now trusted
Aug 19 23:37:39 bastion named[42498]: resolver priming query complete
Aug 19 23:37:40 bastion named[42498]: resolver priming query complete

At this state, we can restart the HA Proxy successfully:

# systemctl stop haproxy.service
# systemctl start haproxy.service
# systemctl status haproxy.service
● haproxy.service - HAProxy Load Balancer
   Loaded: loaded (/usr/lib/systemd/system/haproxy.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-08-19 23:38:19 EDT; 8s ago
  Process: 42514 ExecStartPre=/usr/sbin/haproxy -f $CONFIG -c -q (code=exited, status=0/SUCCESS)
 Main PID: 42516 (haproxy)
    Tasks: 2 (limit: 24228)
   Memory: 2.9M
   CGroup: /system.slice/haproxy.service
           ├─42516 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
           └─42517 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid

Aug 19 23:38:09 bastion systemd[1]: Starting HAProxy Load Balancer...
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for frontend 'ocp4-kubernetes-api-server' as it requires HTTP mode.
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for frontend 'ocp4-machine-config-server' as it requires HTTP mode.
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for frontend 'ocp4-router-http' as it requires HTTP mode.
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for frontend 'ocp4-router-https' as it requires HTTP mode.
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for backend 'ocp4-kubernetes-api-server' as it requires HTTP mode.
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for backend 'ocp4-machine-config-server' as it requires HTTP mode.
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for backend 'ocp4-router-http' as it requires HTTP mode.
Aug 19 23:38:19 bastion haproxy[42516]: [WARNING] 231/233819 (42516) : config : 'option forwardfor' ignored for backend 'ocp4-router-https' as it requires HTTP mode.
Aug 19 23:38:19 bastion systemd[1]: Started HAProxy Load Balancer.

Issue #3 - Segmentation fault when IPLing the zKVM guest nodes

Once the above DNS and HA Proxy are back up and running, when I attempt to IPL the guest, it will take the segmentation fault:

         Starting CoreOS Installer...
[   19.586615] coreos-installer-service[1109]: coreos-installer install /dev/vda --ignition-url http://192.168.79.1:8080/ignition/bootstrap.ign --insecure-ignition --firstboot-args rd.neednet=1 ip=dhcp nameserver=192.168.79.1
[   19.645909] coreos-installer-service[1109]: Installing Red Hat Enterprise Linux CoreOS 46.82.202008181640-0 (Ootpa) s390x (512-byte sectors)
[   19.662157]  vda: vda1 vda4
[  OK  ] Created slice system-systemd\x2dcoredump.slice.
[  OK  ] Started Process Core Dump (PID 1131/UID 0).
[   20.172525] coreos-installer-service[1109]: /usr/libexec/coreos-installer-service: line 105:  1125 Segmentation fault      (core dumped) coreos-installer "${args[@]}"
[FAILED] Failed to start CoreOS Installer.
See 'systemctl status coreos-installer.service' for details.
[DEPEND] Dependency failed for CoreOS Installer Target.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
         Stopping Hostname Service...
[  OK  ] Stopped Run update-ca-trust.
[  OK  ] Stopped daily update of the root trust anchor for DNSSEC.
[  OK  ] Stopped Network Manager Wait Online.
         Stopping Network Manager...
[   20.175643]  vda: vda1 vda4
[  OK  ] Stopped Hostname Service.
[  OK  ] Stopped Network Manager.
         Stopping D-Bus System Message Bus...
         Stopping Network Manager Script Dispatcher Service...
[  OK  ] Stopped Network Manager Script Dispatcher Service.
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped target Basic System.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.

Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.

Press Enter to continue.

If I hit Enter, it will not provide any login or access, but it does show additional logging details:

Reloading system manager configuration
Starting default target
[  357.096492] coreos-installer-service[2014]: coreos-installer install /dev/vda --ignition-url http://192.168.79.1:8080/ignition/bootstrap.ign --insecure-ignition --firstboot-args rd.neednet=1 ip=dhcp nameserver=192.168.79.1
[  357.107111] coreos-installer-service[2014]: Installing Red Hat Enterprise Linux CoreOS 46.82.202008181640-0 (Ootpa) s390x (512-byte sectors)
[  357.122886]  vda: vda1 vda4
[  357.127241] User process fault: interruption code 003b ilc:3 in coreos-installer[2aa3ca80000+7bf000]
[  357.127245] Failing address: 0000000000000000 TEID: 0000000000000800
[  357.127246] Fault in primary space mode while using user ASCE.
[  357.127248] AS:0000000799eec1c7 R3:0000000000000024
[  357.127251] CPU: 5 PID: 2038 Comm: coreos-installe Not tainted 4.18.0-211.el8.s390x #1
[  357.127252] Hardware name: IBM 3906 M04 716 (KVM/Linux)
[  357.127253] User PSW : 0705200180000000 000002aa3cceeb50
[  357.127255]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
[  357.127257] User GPRS: 0000000000000000 0000000000000000 0000000000000001 000002aa3d1d0420
[  357.127258]            000000000000000f 0000000000000006 0000000000000200 000003ff00000001
[  357.127260]            0000000000000000 000002aa7a1ae230 0000000000000000 0000000000000008
[  357.127261]            0000000000002800 0000000000000008 000002aa3cceeb3a 000003ffc89fb5a8
[  357.127267] User Code: 000002aa3cceeb42: a77a0001		ahi	%r7,1
[  357.127267]            000002aa3cceeb46: e380f1500004	lg	%r8,336(%r15)
[  357.127267]           #000002aa3cceeb4c: b90400bd		lgr	%r11,%r13
[  357.127267]           >000002aa3cceeb50: e55cb0000001	chsi	0(%r11),1
[  357.127267]            000002aa3cceeb56: a784000f		brc	8,2aa3cceeb74
[  357.127267]            000002aa3cceeb5a: a7f4001b		brc	15,2aa3cceeb90
[  357.127267]            000002aa3cceeb5e: a78bffc0		aghi	%r8,-64
[  357.127267]            000002aa3cceeb62: 41b0b040		la	%r11,64(%r11)
[  357.127281] Last Breaking-Event-Address:
[  357.127284]  [<00000007a6676754>] system_call+0x128/0x2c8
[  357.503233] audit: type=1400 audit(1597895260.960:90): avc:  denied  { unlink } for  pid=2016 comm="NetworkManager" name="resolv.conf" dev="tmpfs" ino=21647 scontext=system_u:system_r:NetworkManager_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=file permissive=0
[  357.697359] coreos-installer-service[2014]: /usr/libexec/coreos-installer-service: line 105:  2038 Segmentation fault      (core dumped) coreos-installer "${args[@]}"
[  357.700080]  vda: vda1 vda4
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.

Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.

Press Enter to continue.

Let me know if there's a way to pull any additional logging details.

Thank you,
-Phil

Comment 39 Benjamin Gilbert 2020-08-20 05:35:53 UTC
Hi Phil,

> Cannot open access to console, the root account is locked.

Ah, right, that happens on RHCOS.  To do this the long way around:

1. Create an Ignition config that sets a password for the core user.
2. Boot without any coreos.inst.* arguments, and with "ignition.firstboot ignition.platform.id=metal ignition.config.url=http://..."
3. Log in as core.
4. Run:

sudo coreos-installer install /dev/vda --ignition-url http://192.168.79.1:8080/ignition/bootstrap.ign --insecure-ignition --firstboot-args "rd.neednet=1 ip=dhcp nameserver=192.168.79.1"

If you can reproduce the crash, a stack trace would be useful, and/or core dump, and/or the last few lines of running the installer under "sudo strace".

Best,
Benjamin

Comment 40 Benjamin Gilbert 2020-08-20 05:41:09 UTC
Also, Phil, I'm not familiar with the context of issues 1 and 2 in comment 38, but should they be split out to separate bugs?

Comment 41 Nikita Dubrovskii (IBM) 2020-08-20 13:11:46 UTC
(In reply to Benjamin Gilbert from comment #36)
> Hi all,
> 
> Re comment 26, has anyone reproduced the segmentation fault from
> coreos-installer?  Core dumps or stacktraces would be very useful.
> 
> Re comment 27, Phil, thanks for the info.  I was unable to reproduce the
> boot failure with those RHCOS images in a QEMU emulated s390x VM.  When the
> boot fails, could you press Enter twice and report the output of the
> following commands:
> 
> ls -l /root.squashfs
> sha256sum /root.squashfs
> 
> Thanks,
> Benjamin

Hi, 
i'm able to reproduce this only with Release version of installer. Debug version works fine. I've seen similar issues before on F31 with old Rust compiler, but never on F32. 
Will try to get coredump

Comment 42 Nikita Dubrovskii (IBM) 2020-08-21 14:15:03 UTC
Backtrace:

Core was generated by `coreos-installer install /dev/sda --ignition-url http://172.18.10.243/fcos.ign'.
Program terminated with signal SIGSEGV, Segmentation fault.
b#0  libcoreinst::blockdev::SavedPartitions::matches_filters::{{closure}} (f=0x8) at src/blockdev.rs:625
625                 Index(Some(first), _) if first.get() > i => false,
[Current thread is 1 (Thread 0x3ff85c74aa0 (LWP 941))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/zukku/rpmbuild/BUILD/coreos-installer-0.5.1/target/release/coreos-installer.
Use `info auto-load python-scripts [REGEXP]' to list them.
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.31-4.fc32.s390x libgcc-10.2.1-1.fc32.s390x openssl-libs-1.1.1g-1.fc32.s390x xz-libs-5.2.5-1.fc32.s390x zlib-1.2.11-21.fc32.s390x
(gdb) bt
#0  libcoreinst::blockdev::SavedPartitions::matches_filters::{{closure}} (f=0x8) at src/blockdev.rs:625
#1  <core::slice::Iter<T> as core::iter::traits::iterator::Iterator>::any (self=<optimized out>, f=...) at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libcore/slice/mod.rs:3392
#2  libcoreinst::blockdev::SavedPartitions::matches_filters (i=1, p=0x2aa34720120, filters=...) at src/blockdev.rs:624
#3  libcoreinst::blockdev::SavedPartitions::new (disk=0x3ffd367e3ec, sector_size=512, filters=...) at src/blockdev.rs:557
#4  libcoreinst::blockdev::SavedPartitions::new_from_disk (disk=0x3ffd367e3ec, filters=...) at src/blockdev.rs:510
#5  libcoreinst::install::install (config=0x3ffd377ecd8) at src/install.rs:78
#6  0x000002aa335e9b8e in coreos_installer::main () at src/main.rs:29

If i compile `debug` version or add a print of entire `filters` , than installer works! Could it be some optimization related issue?

Comment 43 Benjamin Gilbert 2020-08-21 20:34:41 UTC
Nikita, thanks for tracking that down!  For the record, your fix is in https://github.com/coreos/coreos-installer/pull/360.

Phil, does comment 38 imply that you can no longer reproduce the error from comment 28?

Comment 44 Philip Chan 2020-08-24 22:48:14 UTC
Hi Benjamin,

Yes, that is correct -- I no longer produce the error in comment 28 now that I am using the live installer procedure.

I also have not been able to produce a strace and/or core dump, however I believe Nikita has provided you some of that details. I was following your steps in comment 39, but password hash I set is not accepted for the user core.  I will continue looking into this.

Thank you,
-Phil

Comment 45 Benjamin Gilbert 2020-08-24 23:06:24 UTC
Hi Phil,

Great, thank you.  Nikita submitted a fix for the segmentation fault, but it hasn't landed in a new build yet.  I don't think you need to do anything further right now; I'll update the bug when a fixed build is ready.

Best,
Benjamin

Comment 46 Prashanth Sundararaman 2020-08-26 13:03:29 UTC
Phil/Kyle,

We have the latest build: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6-s390x&release=46.82.202008261039-0#46.82.202008261039-0 which contains the fixes for the various issues reported above. Could you give it a try on zVM with both FC and DASD disks ? These builds work for me.

Thanks
Prashanth

Comment 47 Philip Chan 2020-08-27 03:07:28 UTC
Hi Prashanth,

I tried the rhcos-46.82.202008261039-0 build on two clusters under zKVM.  

For the ECKD based OCP cluster, the guest nodes IPL successfully and I am able to complete a OCP 4.6.0-0.nightly-s390x-2020-08-26-191428 install.  

This is what I am using for the guest configuration(for bootstrap-0):

  <os>
    <type arch='s390x' machine='s390-ccw-virtio-rhel7.6.0'>hvm</type>
    <kernel>/bootkvm/rhcos-46.82.202008261039-0-live-kernel-s390x</kernel>
    <initrd>/bootkvm/rhcos-46.82.202008261039-0-live-initramfs.s390x.img</initrd>
    <cmdline>rd.neednet=1 console=ttysclp0 coreos.live.rootfs_url=http://9.12.23.79:8080/CI/4.6.0-0.nightly-s390x-2020-08-26-123142/rhcos-46.82.202008261039-0-live-rootfs.s390x.img coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.ignition_url=http://192.168.79.1:8080/ignition/bootstrap.ign ip=dhcp nameserver=192.168.79.1</cmdline>
    <boot dev='hd'/>
  </os>

For the FCP based OCP cluster, the guests fail to IPL using the same guest configuration.  I will attach the console IPL output from the KVM guest.

Please let me know if you need any additional information.

Thank you,
-Phil

Comment 48 Philip Chan 2020-08-27 03:08:47 UTC
Created attachment 1712764 [details]
guest console output from bootstrap-0 - 2020-08-26

Comment 49 Benjamin Gilbert 2020-08-27 14:22:34 UTC
Hi Phil,

Thanks for the update.  The error in comment 48 is the same as in comment 28 and in the original report.  Could you post the output of the commands from comment 36?

Thanks,
Benjamin

Comment 50 wvoesch 2020-08-28 15:26:33 UTC
Created attachment 1712966 [details]
Failed dasd installation: Dependency failed for Reboot after CoreOS Installer.

Comment 51 wvoesch 2020-08-28 15:26:52 UTC
Hi all,

the information below is for zVM, OCP version: 4.6.0-0.nightly-s390x-2020-08-28-040333 and RHCOS version 46.82.202008271939-0.

1. for fcp-disks the installation is working with the live images. 

2. for dasd-disks the installation fails. 
a) if the option coreos.inst=yes is used, then I receive the segmentation fault, as discussed above. 
b) if I don't use the option coreos.inst=yes, then the installation fails with a message: "Dependency failed for Reboot after CoreOS Installer." As the emergency console is locked I only attached a screenshot. 

If you need more information, please let me know. 

Thanks,
Wolfgang

Comment 52 Philip Chan 2020-08-28 15:35:44 UTC
Hi Benjamin,

For the rhcos-46.82.202008261039-0 build, these are the results of the output from the commands in comment 36:

:/# ls -l /root.squashfs
-rw-r--r-- 1 root root 734994432 Aug 26 11:15 /root.squashfs
:/# sha256sum /root.squashfs
cf409c02892429aa6fe4d139aecdbfc79df5b042e59a9bb71f0784a50cb1ed05  /root.squashfs


For the rhcos-46.82.202008271939-0 build from today, the IPL still fails but with different output.  I'll attach the full console output to the bugzilla, but will add a snippet of the failure here:

         Starting Setup Virtual Console...
[[   11.752309] systemd[1]: Stopped target System Initialization.
  OK  ] Stopped target System Initialization.
[[   11.752773] systemd[1]: Started Dump journal to virtio port.
  OK  ] Started Dump journal to virtio port.
[   11.777764] systemd-vconsole-setup[754]: KD_FONT_OP_GET failed while trying to get the font metadata: Function not implemented
[   11.777788] systemd-vconsole-setup[754]: Fonts will not be copied to remaining consoles
[   11.778271] systemd[1]: Started Setup Virtual Console.
[  OK  ] Started Setup Virtual Console.
[   11.778690] systemd[1]: Started Emergency Shell.
[  OK  ] Started Emergency Shell.
[[   11.778768] systemd[1]: Reached target Emergency Mode.
  OK  ] Reached target Emergency Mode.
[   11.778811] systemd[1]: Startup finished in 3.075s (kernel) + 0 (initrd) + 8.703s (userspace) = 11.778s.
Displaying logs from failed units: sysroot.mount
-- Logs begin at Fri 2020-08-28 15:03:54 UTC, end at Fri 2020-08-28 15:04:02 UTC. --
Aug 28 15:04:02 systemd[1]: Mounting /sysroot...
Aug 28 15:04:02 mount[737]: mount: /sysroot: can't read superblock on /dev/loop1.
Aug 28 15:04:02 systemd[1]: sysroot.mount: Mount process exited, code=exited status=32
Aug 28 15:04:02 systemd[1]: sysroot.mount: Failed with result 'exit-code'.
Aug 28 15:04:02 systemd[1]: Failed to mount /sysroot.
Press Enter for emergency shell or wait 3 minutes 30 seconds for reboot.

Generating "/run/initramfs/rdsosreport.txt"

This is the output from the commands for build rhcos-46.82.202008271939-0:

:/# ls -l /root.squashfs
-rw-r--r-- 1 root root 735059968 Aug 27 20:19 /root.squashfs
:/# sha256sum /root.squashfs
102135f87bb80fcbc8d149adc858438625f36e26dbbe73ed0219aefec9a00aae  /root.squashfs

In both cases from above, they were performed on FCP attached disks.

Comment 53 Philip Chan 2020-08-28 15:36:19 UTC
Created attachment 1712969 [details]
guest console output from bootstrap-0 - 2020-08-28

Comment 54 Benjamin Gilbert 2020-09-02 00:52:41 UTC
Wolfgang, the screenshot in comment 50 only shows the consequences of the error, not the original error.  Could you take a screenshot from further up in the scrollback where the original error is reported?  Also, re comment 51: is the only difference between the two cases the coreos.inst=yes argument?  coreos.inst=yes doesn't do anything in the live image.

Philip, the sizes and hashes in comment 52 are correct, so the complaints of corruption are strange.  Next I'd like to test whether there's some race condition that's trying to mount the squashfs before it's fully unpacked.  At the emergency prompt, please run the following and report output:

sha256sum /root.squashfs
mkdir -p /mnt
mount -o ro /root.squashfs /mnt
ls /mnt

Best,
Benjamin

Comment 55 wvoesch 2020-09-02 09:07:09 UTC
Created attachment 1713418 [details]
screenshots and parmfile of the installation version 4.6.0-0.nightly-s390x-2020-08-28-040333 without the option "coreos.inst=yes"

Comment 56 wvoesch 2020-09-02 09:08:05 UTC
Created attachment 1713419 [details]
screenshots and parmfile of the installation version 4.6.0-0.nightly-s390x-2020-08-28-040333 with the option "coreos.inst=yes"

Comment 57 wvoesch 2020-09-02 09:11:34 UTC
Hello Benjamin,

thank you for this information. 
Indeed the option coreos.inst=yes causes only slight differences in the installation (see below), but no differences in the actual error. For your convenience I typed the part which contains the error (hopefully). Please find the screenshots for the whole installation with both options in the attached archives in comment #55 and comment #56. 

coreos-installer-serice: Error: deserialization failed
Created slice system-systemdx2dcoredump.slice.
coreos-installer-service: Caused by: /usr/libexec/coreos-installer-service: line 105: 1172[with -]|1156[without coreos.inst=yes] Segmentation fault 8core dumped) coreos-installer "${args???}"
Failed: Failed to start CoreOS Installer

A difference between an install with and without the option coreos.inst=yes, is that with the option, the installer checks several times for dhcp4, changed the network interface and so on, only to continue after a timeout of 45 s. Please see the "Screenshot 2020-09-02 at 10.08.30.png" and following from comment #56. 

Additional info: in the screenshot series I included what happens if I hit "enter" as suggested by the system to continue, just in case it could be helpful.

Best,
Wolfgang

Comment 58 Philip Chan 2020-09-02 17:19:49 UTC
Hi Benjamin,

Here is the output requested from comment 54:

:/# sha256sum /root.squashfs
cf409c02892429aa6fe4d139aecdbfc79df5b042e59a9bb71f0784a50cb1ed05  /root.squashfs

:/# mkdir -p /mnt

:/# mount -o ro /root.squashfs /mnt
mount: /mnt: can't read superblock on /dev/loop1.

:/#  mount /root.squashfs /mnt/ -t squashfs -o loop       <=== Note: I had to use this mount command instead of the previous one provided
:/# ls -la /mnt
total 3
drwxr-xr-x  4 root root  87 Aug 26 10:59 .
drwxr-xr-x 13 root root 480 Sep  2 17:10 ..
-rw-rw-r--  1 root root 255 Aug 26 10:59 .coreos-aleph-version.json
drwxr-xr-x  5 root root 118 Aug 26 10:59 boot
drwxr-xr-x  5 root root  95 Aug 26 10:59 ostree

Thank you,
-Phil

Comment 59 Benjamin Gilbert 2020-09-03 03:54:01 UTC
Wolfgang, thanks for the additional info.  I'll submit a change to disable LTO for coreos-installer and we'll see whether that fixes the segmentation fault.

Philip, that's interesting.  What are the last few lines of output from dmesg after you execute the mount command that fails?  Is it the -t squashfs that makes the mount work, the -o loop, or are both required?

Best,
Benjamin

Comment 60 Hendrik Brueckner 2020-09-03 10:22:37 UTC
Regarding the -t mount option, it could be necessary if the squashfs is a module and not yet loaded. In this case, the autodetection of mount might fail. See also the explanation on the -t option in the mount manpage:

              If no -t option is given, or if the auto type is specified, mount will try to guess the desired type.  Mount uses the
              blkid library for guessing the filesystem type; if that does not turn up anything that looks familiar, mount will try
              to read the file /etc/filesystems, or, if that does not exist, /proc/filesystems.  All of the filesystem types listed
              there  will  be  tried,  except for those that are labeled "nodev" (e.g., devpts, proc and nfs).  If /etc/filesystems
              ends in a line with a single * only, mount will read /proc/filesystems afterwards. All of the filesystem  types  will
              be mounted with mount option "silent".

So the the contents of /proc/filesystems before and after the mount call could sched some light.

Comment 61 Hendrik Brueckner 2020-09-03 14:27:07 UTC
For the LTO issues, see also https://github.com/coreos/coreos-installer/issues/372

Comment 62 Philip Chan 2020-09-03 15:14:36 UTC
Hi Benjamin,

The last lines in dmesg caused by the failed mount command are these:

[  189.162741] SQUASHFS error: zlib decompression failed, data probably corrupt
[  189.162743] SQUASHFS error: squashfs_read_data failed to read block 0x2bac5131
[  189.162744] SQUASHFS error: Unable to read metadata cache entry [2bac5131]
[  189.162744] SQUASHFS error: Unable to read inode 0x7f5910202

Both -t squashfs and -o loop are required to mount the squashfs in this case.  Using just one individually will fail.

:/# mount /root.squashfs /mnt/ -t squashfs
mount: /mnt: can't read superblock on /dev/loop1.

:/# mount /root.squashfs /mnt/ -o loop
mount: /mnt: can't read superblock on /dev/loop1.

:/# mount /root.squashfs /mnt/ -t squashfs -o loop
:/# ls -la /mnt
total 3
drwxr-xr-x  4 root root  87 Aug 26 10:59 .
drwxr-xr-x 13 root root 480 Sep  3 15:09 ..
-rw-rw-r--  1 root root 255 Aug 26 10:59 .coreos-aleph-version.json
drwxr-xr-x  5 root root 118 Aug 26 10:59 boot
drwxr-xr-x  5 root root  95 Aug 26 10:59 ostree

However, I just observed some very odd behavior in trying the same mount commands.  If I were to umount this and try again, it will fail:

:/# umount /mnt
:/# mount /root.squashfs /mnt/ -t squashfs -o loop
mount: /mnt: can't read superblock on /dev/loop1.
:/# ls -la /mnt
total 0
drwxr-xr-x  2 root root  40 Sep  3 15:09 .
drwxr-xr-x 13 root root 480 Sep  3 15:09 ..

The dmesg output shows the same 'data probably corrupt' message we get when trying -o loop or -t squashfs.  There seems to be some inconsistency here.

Please let me know if you need anything.

Thank you,
-Phil

Comment 63 Benjamin Gilbert 2020-09-04 03:48:20 UTC
Wolfgang, the segmentation faults should be fixed in this build: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6-s390x&release=46.82.202009040140-0#46.82.202009040140-0.  PTAL.

Philip, if you repeatedly mount and unmount the filesystem using mount with both -t and -o, do you get a mix of successes and failures?

Best,
Benjamin

Comment 64 Philip Chan 2020-09-05 02:50:04 UTC
Hi Benjamin,

Yes, that's what I observed as mentioned in comment 62.  Performing the same commands have different results. I've also seen where a mount produces a failure, yet, it shows content files under /mnt.

Regards,
-Phil

Comment 65 wvoesch 2020-09-07 12:38:58 UTC
Hi Benjamin, 

I could install OpenShift on dasd successfully. Thank you. 
For reference I used this version: 

Server Version: 4.6.0-0.nightly-s390x-2020-09-05-222506
Kubernetes Version: v1.19.0-rc.2+068702d
Kernel Version:                          4.18.0-211.el8.s390x
OS Image:                                Red Hat Enterprise Linux CoreOS 46.82.202009042339-0 (Ootpa)
Container Runtime Version:               cri-o://1.19.0-8.rhaos4.6.git488942f.el8-rc.1

Comment 66 krmoser 2020-09-09 07:58:59 UTC
Benjamin and Prashanth,

Thank you for all your ongoing help with this issue.  Here are some additional information and questions to help further the progress to the resolution of this issue. 

1. Live installer issue with RHCOS 46.82.202009042339-0 version
===============================================================
Similar to the information Phil Chan provided in Comment 62 for zKVM OCP 4.6 cluster installations, we're encountering the following rhcos coreos-livepxe-rootfs issue multiple times when using the rhcos-46.82.202009042339-0-live-rootfs.s390x.img file when attempting to install zVM OCP 4.6 cluster installations with ECKD or FCP storage.  

Unlike our colleague Wolfgang Voesch, we have not been able to successfully install this rhcos version using the "live" installer method on a zVM cluster.  Please see the following zVM bootstrap node console error information:

09/08/20 17:56:28 [  184.232169] coreos-livepxe-rootfs[818]: Fetching rootfs image from http://10.20.116.244:8080/4.6.0-0.nightly-s390x-2020-09-05-222506/rhcos-46.82.202009042339-0-live-rootfs.s390x.img..
09/08/20 17:56:30 [[0;32m  OK  [0m] Started Acquire live PXE rootfs image.                                                                                      
09/08/20 17:56:30 [  186.342496] systemd[1]: Started Acquire live PXE rootfs image.                                                                             
09/08/20 17:56:30          Starting Persist osmet files (PXE)...[  186.343145] systemd[1]: Starting Persist osmet files (PXE)...                                
09/08/20 17:56:30          Mounting /sysroot...                                                                                                                 
09/08/20 17:56:30 [  186.343575] systemd[1]: Mounting /sysroot...                                                                                               
09/08/20 17:56:30 [  186.381282] systemd[1]: Started Persist osmet files (PXE).                                                                                 
09/08/20 17:56:30 [[0;32m  OK  [0m] Started Persist osmet files (PXE).                                                                                          
09/08/20 17:56:30 [  186.475437] squashfs: version 4.0 (2009/01/31) Phillip Lougher                                                                             
09/08/20 17:56:30 [  186.476605] SQUASHFS error: zlib decompression failed, data probably corrupt                                                               
09/08/20 17:56:30 [  186.476607] SQUASHFS error: squashfs_read_data failed to read block 0x2b7f57f4                                                             
09/08/20 17:56:30 [  186.476609] SQUASHFS error: Unable to read metadata cache entry [2b7f57f4]                                                                 
09/08/20 17:56:30 [  186.476610] SQUASHFS error: Unable to read inode 0x7f3831d7e                                                                               
09/08/20 17:56:30 [  186.477602] mount[838]: mount: /sysroot: can't read superblock on /dev/loop1.                                                              
09/08/20 17:56:30 [  186.514916] systemd[1]: sysroot.mount: Mount process exited, code=exited status=32                                                         
09/08/20 17:56:30 [  186.514950] systemd[1]: sysroot.mount: Failed with result 'exit-code'.                                                                     
09/08/20 17:56:30 [  186.515226] systemd[1]: Failed to mount /sysroot.                                                                                          
09/08/20 17:56:30 [[0;1;31mFAILED[0m] Failed to mount /sysroot.                       


2. Public mirror OCP 4.6 RHCOS version: 4.6.0-0.nightly-s390x-2020-08-26-181417
===============================================================================  
1. With the rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417 build available at https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/pre-release/latest-4.6/ there are both "legacy" and "live" installer modules as follows:
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-dasd.s390x.raw.gz
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-installer-initramfs.s390x.img
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-installer-kernel-s390x
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-installer.s390x.iso
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-live-initramfs.s390x.img
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-live-kernel-s390x
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-live-rootfs.s390x.img
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-live.s390x.iso
   rhcos-4.6.0-0.nightly-s390x-2020-08-26-181417-metal.s390x.raw.gz

2. This OCP 4.6 rhcos build's "legacy" installer is successful for both zVM ECKD and FCP based OCP 4.6 cluster installations.



3. Red Hat CI OCP 4.6 RHCOS versions: 46.82.202009042339-0 and rhcos-46.82.202009071639-0 builds
================================================================================================
1. With the Red Hat CI OCP 4.6 RHCOS 46.82.202009042339-0 version, there are only "live" installer modules as follows (and no "legacy" installer versions).
   rhcos-46.82.202009042339-0-dasd.s390x.raw.gz
   rhcos-46.82.202009042339-0-live-initramfs.s390x.img
   rhcos-46.82.202009042339-0-live-kernel-s390x
   rhcos-46.82.202009042339-0-live-rootfs.s390x.img
   rhcos-46.82.202009042339-0-metal.s390x.raw.gz

2. With the Red Hat CI OCP 4.6 RHCOS 46.82.202009071639-0 version, there are only "live" installer modules as follows (and no "legacy" installer versions).
   rhcos-46.82.202008271939-0-dasd.s390x.raw.gz
   rhcos-46.82.202008271939-0-live-initramfs.s390x.img
   rhcos-46.82.202008271939-0-live-kernel-s390x
   rhcos-46.82.202008271939-0-live-rootfs.s390x.img
   rhcos-46.82.202008271939-0-metal4k.s390x.raw.gz
   rhcos-46.82.202008271939-0-metal.s390x.raw.gz

3. Given there are only "live" installer versions available for these 2 most recent Red Hat CI OCP 4.6 rhcos builds (and no "legacy" installer versions), is this consistent with only the "live" version of the OCP 4.6 installer being supported for OCP 4.6 (and going forward from approximately 09/04/2020)?


4. Public mirror OCP 4.6 RHCOS version availability question
============================================================
Similar to the OCP 4.6 RHCOS 4.6.0-0.nightly-s390x-2020-08-26-181417 public mirror build, will subsequent OCP 4.6 rhcos "live" only builds also be made available at https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/pre-release/latest-4.6/?


5. OCP 4.6 Live installer documentation availability question
=============================================================
Just checking to see if any OCP 4.6 live installer documentation is now available, in draft or any other form?  


Thank you,
Kyle

Comment 67 Colin Walters 2020-09-09 14:54:40 UTC
> is this consistent with only the "live" version of the OCP 4.6 installer being supported for OCP 4.6 (and going forward from approximately 09/04/2020)?

Yes, that's been the plan since https://github.com/openshift/enhancements/blob/master/enhancements/rhcos/liveisoinstall.md

> Just checking to see if any OCP 4.6 live installer documentation is now available, in draft or any other form?  

I'd reference the enhancement but we do need to start migrating that into the product documentation.

> Similar to the OCP 4.6 RHCOS 4.6.0-0.nightly-s390x-2020-08-26-181417 public mirror build, will subsequent OCP 4.6 rhcos "live" only builds also be made available at https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/pre-release/latest-4.6/?

Yes but unfortunately that's a manual process at the moment.

For now you can also access the latest "installer pinned" images at:
https://github.com/openshift/installer/blob/master/data/data/rhcos.json

See also https://github.com/openshift/enhancements/pull/201

> 09/08/20 17:56:30 [  186.476605] SQUASHFS error: zlib decompression failed, data probably corrupt    

So this is clearly the fatal error.  First, have you validated the sha256 checksum of the downloaded ISO or PXE media?  

How reproducible is this?

Comment 68 Benjamin Gilbert 2020-09-09 14:59:13 UTC
> Just checking to see if any OCP 4.6 live installer documentation is now available, in draft or any other form?

Documentation is being developed but I'm not aware of any drafts yet.

> So this is clearly the fatal error.  First, have you validated the sha256 checksum of the downloaded ISO or PXE media?  How reproducible is this?

See comment 52, comment 58, and comment 62.  The hash of the squashfs file is correct.  Some reporters seem to encounter the problem consistently and some don't encounter it.

Comment 69 krmoser 2020-09-10 05:56:18 UTC
Benjamin, Colin, and Prashanth,

Thanks for all the information and your assistance.  We've been continuing to investigate and here's some additional information and questions to help towards the resolution of this issue. 

1. For zKVM, we have been able to successfully install the OCP 4.6 RHCOS 46.82.202009042339-0 and the OCP 4.6.0-0.nightly-s390x-2020-09-05-222506 build on ECKD (DASD) storage using the "live" installer on an IBM z14 server, but not an IBM z15 server.

2. For zVM, our BOE colleague Wolfgang Voesch has successfully installed OCP 4.6 with the "live" installer on an IBM z13 server for both ECKD (DASD) and FCP storage.  Please see comment 65 for the specifics.

3. Would you know what System z model(s) that you have been successfully installing your OCP 4.6 clusters with the "live" installer?  z13, z14, and/or z15?

4. To determine the IBM server Machine Type, please run this command from your bastion and master nodes "grep Type: /proc/sysinfo"


Thank you,
Kyle

Comment 70 wvoesch 2020-09-10 12:12:13 UTC
Hi all,


I can confirm, that RHCOS 46.82.202009042339-0 does not install on z15 on minidisk using the installer-versions: 4.6.0-0.nightly-s390x-2020-09-07-065507 or 4.6.0-0.nightly-s390x-2020-09-05-222506.
I have successfully installed OCP on a z13 with the exact same images. 
Both times the error is:
[    4.235254] localhost coreos-livepxe-rootfs[797]: Fetching rootfs image from http://172.18.73.49/redhat/alkl/rhcos/nightly/PE/rhcos-46.82.202009042339-0//rhcos-46.82.202009042339-0-live-rootfs.img...
[    6.179921] localhost systemd[1]: Started Acquire live PXE rootfs image.
[    6.180175] localhost systemd[1]: Mounting /sysroot...
[    6.180493] localhost systemd[1]: Starting Persist osmet files (PXE)...
[    6.204031] localhost systemd[1]: Started Persist osmet files (PXE).
[    6.320051] localhost mount[816]: mount: /sysroot: can't read superblock on /dev/loop1.
[    6.325779] localhost systemd-journald[259]: Missed 33 kernel messages
[    6.318010] localhost kernel: squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    6.319213] localhost kernel: SQUASHFS error: zlib decompression failed, data probably corrupt
[    6.319215] localhost kernel: SQUASHFS error: squashfs_read_data failed to read block 0x2b7f57f4
[    6.319216] localhost kernel: SQUASHFS error: Unable to read metadata cache entry [2b7f57f4]
[    6.319218] localhost kernel: SQUASHFS error: Unable to read inode 0x7f3831d7e
[    6.358519] localhost systemd[1]: sysroot.mount: Mount process exited, code=exited status=32
[    6.358567] localhost systemd[1]: sysroot.mount: Failed with result 'exit-code'.
[    6.358586] localhost systemd[1]: Failed to mount /sysroot.
If you need the rdsosreport from both runs, please let me know. 


I was able to install RHCOS 46.82.202009091339-0 on z15 on minidisk using the installer-version 4.6.0-0.nightly-s390x-2020-09-10-082639. This test was done on the exact same environment as with the failed installations form above. 

Best,
Wolfgang

Comment 71 Nikita Dubrovskii (IBM) 2020-09-10 15:33:01 UTC
Hi all,

Just checked rhcos-46.82.202009090139-0-live-{rootfs,kernel,initramfs}.s390x.img from https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6-s390x and it works.
Also did check manually built rhcos-46.82 on zVM with DASD - also works.

Example of parm file i use:
  rd.neednet=1                                                                                                           
  rd.znet=qeth,0.0.bdf0,0.0.bdf1,0.0.bdf2,layer2=1,portno=0                                                              
  console=ttysclp0                                                                                                                                                                                                                              
  ip=172.18.142.6::172.18.0.1:255.255.255.0:fmp:encbdf0:off                                                                                                                                                                                     
  nameserver=172.18.78.6                                                                                                 
  coreos.inst=yes                                          
  coreos.inst.insecure=yes                                                                                               
  coreos.inst.image_url=http://172.18.10.243/rhcos-46.82.202009101300-0-dasd.s390x.raw                                   
  coreos.inst.ignition_url=http://172.18.10.243/fcos.ign                                                                 
  coreos.live.rootfs_url=http://172.18.10.243/rhcos-46.82.202009101300-0-live-rootfs.s390x.img                           
  zfcp.allow_lun_scan=0                                                                                                  
  cio_ignore=all,!condev                                   
  coreos.inst.install_dev=/dev/disk/by-path/ccw-0.0.6608
  rd.dasd=0.0.6608  


Best, Nikita

Comment 72 Benjamin Gilbert 2020-09-11 17:34:11 UTC
Nikita, per comment 69, what hardware type were you able to successfully install on?

Since the problem seems to be hardware-specific, and seems to be happening on a known-good squashfs image (see comment 58 and comment 62), I'm wondering whether there might be some sort of kernel bug with squashfs decompression.

Comment 73 Benjamin Gilbert 2020-09-11 17:55:02 UTC
Wolfgang, between the non-working and working RHCOS releases mentioned in comment 70, RHCOS switched from a development kernel snapshot to an 8.2 production kernel.  That might be the culprit.

Kyle or anyone else, can you still reproduce the squashfs error with RHCOS 46.82.202009091339-0 or above?

Best,
Benjamin

Comment 74 Nikita Dubrovskii (IBM) 2020-09-12 08:24:34 UTC
> Nikita, per comment 69, what hardware type were you able to successfully install on?

I have:
Type:   IBM 8561, which is z15. 
Kernel: Linux fmp 4.18.0-193.19.1.el8_2.s390x #1 SMP Wed Aug 26 15:15:48 EDT 2020 s390x s390x s390x GNU/Linux

I can test some other kernels:

kernel-4.18.0-147.0.2.el8_1.s390x.rpm
kernel-4.18.0-147.0.3.el8_1.s390x.rpm
kernel-4.18.0-147.3.1.el8_1.s390x.rpm
kernel-4.18.0-147.5.1.el8_1.s390x.rpm
kernel-4.18.0-147.8.1.el8_1.s390x.rpm
kernel-4.18.0-147.el8.s390x.rpm
kernel-4.18.0-193.1.2.el8_2.s390x.rpm
kernel-4.18.0-193.13.2.el8_2.s390x.rpm
kernel-4.18.0-193.14.3.el8_2.s390x.rpm
kernel-4.18.0-193.6.3.el8_2.s390x.rpm
kernel-4.18.0-193.el8.s390x.rpm
kernel-4.18.0-80.11.1.el8_0.s390x.rpm
kernel-4.18.0-80.11.2.el8_0.s390x.rpm
kernel-4.18.0-80.1.2.el8_0.s390x.rpm
kernel-4.18.0-80.4.2.el8_0.s390x.rpm
kernel-4.18.0-80.7.1.el8_0.s390x.rpm
kernel-4.18.0-80.7.2.el8_0.s390x.rpm
kernel-4.18.0-80.el8.s390x.rpm

Any suggestion which version to select first?

Comment 75 Benjamin Gilbert 2020-09-14 04:26:23 UTC
Hi Nikita,

The failing kernel was 4.18.0-211.el8; recent RHCOS images switched to an older one.  If you're not seeing the issue on a current image, and if no one else can reproduce it there either, I think no additional testing should be needed.

Best,
Benjamin

Comment 76 Micah Abbott 2020-09-14 17:28:14 UTC
(In reply to Benjamin Gilbert from comment #75)
> Hi Nikita,
> 
> The failing kernel was 4.18.0-211.el8; recent RHCOS images switched to an
> older one.  If you're not seeing the issue on a current image, and if no one
> else can reproduce it there either, I think no additional testing should be
> needed.
> 
> Best,
> Benjamin

New 4.6 pre-release images were published to the mirror at the end of last week - https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest-4.6/

They include the intended kernel (4.18.0-193.19.1.el8_2) that will go out with RHCOS 4.6

Comment 77 krmoser 2020-09-16 13:59:58 UTC
Benjamin, Colin, Nikita, and Prashanth,

Thanks for all the information and assistance.  Just an update that with the OCP 4.6 live installer we have successfully installed zVM ECKD and FCP storage based clusters with the OCP 4.6.0-0.nightly-s390x-2020-09-15-223248 build and the RHCOS 4.6.0-0.nightly-s390x-2020-09-10-112115 build. 

We'll be in touch with some additional updates, information, and questions within the next day.

Thank you,
Kyle

Comment 78 Benjamin Gilbert 2020-09-16 15:20:49 UTC
Great, I'll close this out.  Thanks for all the help, everyone!

Best,
Benjamin

Comment 79 krmoser 2020-09-16 15:40:40 UTC
Benjamin,

Please don't close this bugzilla until we've had a few more days to confirm, provide some additional information, and also follow-up here with some questions to better understand this issue going forward.

Thank you,
Kyle

Comment 80 Benjamin Gilbert 2020-09-16 15:47:09 UTC
Hi Kyle,

Do we have reason to think the squashfs mount failure hasn't actually been fixed?  Otherwise, we should get this bug handed off to QA (which is what I meant by "closed out"; apologies for being unclear) so we can verify it from our end.

In any event, you can still follow up here as needed.

Best,
Benjamin

Comment 82 Philip Chan 2020-09-17 03:20:26 UTC
Thank you again for your assistance.  I wanted to provide an update for zKVM, where the live installer has now succeeded on both the z14 and z15.  We've been able to successfully install on zKVM ECKD and FCP storage based clusters using OCP 4.6.0-0.nightly-s390x-2020-09-17-004643 and RHCOS 4.6.0-0.nightly-s390x-2020-09-10-112115.

Thank you,
-Phil

Comment 83 Benjamin Gilbert 2020-09-24 06:28:11 UTC
Marking VERIFIED based on comment 77 and comment 82.

Comment 86 errata-xmlrpc 2020-10-27 16:22:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.