Bug 2031637

Summary: zVM cluster installation fails for OCP 4.10 4.10.0-0.nightly-s390x-2021-12-10-233457 build with RHCOS 410.84.202112091602-0 build: fails to install with "executing lszdev on /dev/dasda; No such file or directory (os error 2)" message
Product: OpenShift Container Platform Reporter: krmoser
Component: Multi-ArchAssignee: Dan Horák <dhorak>
Multi-Arch sub component: IBM P / Z QA Contact: Douglas Slavens <dslavens>
Status: CLOSED NEXTRELEASE Docs Contact:
Severity: medium    
Priority: high CC: aos-bugs, chanphil, christian.lapolt, danili, dhorak, dslavens, fleber, Holger.Wolf, jcajka, jschinta, madeel, psundara
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: s390x   
OS: Linux   
Whiteboard: multi-arch
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2032486 (view as bug list) Environment:
Last Closed: 2022-03-07 15:48:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2032507    
Bug Blocks: 2009709    

Description krmoser 2021-12-13 06:06:14 UTC
Description of problem:
1. zVM environment OCP 4.10 on Z cluster installation fails for OCP 4.10 4.10.0-0.nightly-s390x-2021-12-10-233457 build with RHCOS 410.84.202112091602-0 build:  fails to install with "executing lszdev on /dev/dasda; No such file or directory (os error 2)" message.


2. Here are the error messages received for each bootstrap, master (control), worker (compute) node when attempting to boot and install RHCOS 410.84.202112091602-0:

12/12/21 23:59:06 [   13.648751] coreos-installer-service[1229]: coreos-installer install /dev/dasda --ignition-url http://bastion.pok-25.ocptest.pok.stglabs.i 
12/12/21 23:59:06 [   13.747032] coreos-installer-service[1229]: Error: getting sector size of /dev/dasda                                                       
12/12/21 23:59:06 [   13.747120] coreos-installer-service[1229]: Caused by:                                                                                     
12/12/21 23:59:06 [   13.747140] coreos-installer-service[1229]:     0: executing lszdev on /dev/dasda                                                          
12/12/21 23:59:06 [   13.747172] coreos-installer-service[1229]:     1: No such file or directory (os error 2)                                                  
12/12/21 23:59:06 [[0;1;31mFAILED[0m] Failed to start CoreOS Installer.                                                                                         


3. Inspection of the /usr/sbin directory in maintenance mode indicates that, as expected by the above messages, the /usr/sbin/lszdev file is not present.


Version-Release number of selected component (if applicable):
1. OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2021-12-10-233457
2. RHCOS build 410.84.202112091602-0

How reproducible:
Consistently reproducible.

Steps to Reproduce:
1. Attempt to install OCP 4.10 nightly build 4.10.0-0.nightly-s390x-2021-12-10-233457 with RHCOS 410.84.202112091602-0.

Actual results:
Bootstrap, master (control plane), and worker (compute) nodes all fail to boot and install RHCOS wuth the same error message as contained in the first section of this bugzilla.

Expected results:
All of the bootstrap, master (control plane), and worker (compute) nodes should all successfully install the RHCOS build. 

Additional info:


Thank you.

Comment 1 Prashanth Sundararaman 2021-12-13 19:13:45 UTC
Looks like lszdev is not in s390utils-core. According to this comment a while ago: https://github.com/coreos/fedora-coreos-config/pull/756#issuecomment-754699901 Dan Horak did move fdasd and lszdev to s390utils-core, but looks like it is only present in version 2.16 which is part of RHEL 8.5

Dan,

Can this change be backported to 8.4 s390utils-core to fix this issue?

Thanks
Prashanth

Comment 2 Prashanth Sundararaman 2021-12-13 19:33:37 UTC
rhcos uses the 8.4 EUS rpms to build: http://rhsm-pulp.corp.redhat.com/content/eus/rhel8/8.4/s390x/baseos/os/Packages/s/ and they have an older version of s390utils-core which does not contain fdasd and lszdev

Comment 3 Dan Li 2021-12-13 23:32:36 UTC
Setting a needinfo for Dan per Comment 1

Comment 4 krmoser 2021-12-14 06:53:21 UTC
Folks,

FYI and for OCP on Z support documentation purposes: OCP 4.10 on Z RHCOS build 410.84.202112132002-0 fails with the same issue.

Thank you,
Kyle

Comment 5 Dan Horák 2021-12-14 09:43:45 UTC
(In reply to Prashanth Sundararaman from comment #1)
> Looks like lszdev is not in s390utils-core. According to this comment a
> while ago:
> https://github.com/coreos/fedora-coreos-config/pull/756#issuecomment-
> 754699901 Dan Horak did move fdasd and lszdev to s390utils-core, but looks
> like it is only present in version 2.16 which is part of RHEL 8.5
> 
> Dan,
> 
> Can this change be backported to 8.4 s390utils-core to fix this issue?

yes, that shouldn't be a big issue. Could you somehow clone this bug into a RHEL bug for s390utils? I will then clone it further for the 8.4.0.z zstream.

Comment 6 Prashanth Sundararaman 2021-12-14 15:00:17 UTC
Thanks Dan. opened https://bugzilla.redhat.com/show_bug.cgi?id=2032486 against RHEL.

Comment 7 Muhammad Adeel (IBM) 2021-12-14 16:02:00 UTC
vmcp command is now also missing in RHCOS which was available in older RHCOS release. Could you also add that?

Comment 8 Dan Horák 2021-12-14 16:38:40 UTC
yes, we can do that, vmcp is moving to core in 8.6.0 via bug #2021071

Comment 9 Prashanth Sundararaman 2021-12-16 18:27:55 UTC
i'm wondering  now whether this is a blocker. we can also use an older bootimage to get the install to work.

Kyle,

Could you use an older 4.10 rhcos image for the installation or even a 4.9 rhcos image to install the 4.10 payload ? if that works, we can remove the blocker flag.

Thanks
Prashanth

Comment 10 krmoser 2021-12-16 19:20:58 UTC
Prashanth,

Thank you for the assistance.  

1. To work around this issue for zVM environments, we have been using the last/latest OCP 4.10 on Z RHCOS build that does boot properly for the bootstrap, master/control, and worker/compute nodes, as it contains the required s390x tools.

2. This latest OCP 4.10 on Z RHCOS build that does boot properly is the OCP 4.10 on Z RHCOS 410.84.202112062233-0 build.

3. As you know, although using the RHCOS 410.84.202112062233-0 build to boot the bootstrap, master/control, and worker/compute nodes is a workaround for the issue, in doing so we do not exercise any changes to the OCP 4.10 on Z RHCOS builds released after this RHCOS 410.84.202112062233-0 build for the boot process.  

4. Any boot related changes introduced into these post OCP 4.10 on Z RHCOS 410.84.202112062233-0 builds will not be tested until a subsequent OCP 4.10 on Z RHCOS build is made available with the required s390x tools to properly boot the bootstrap, master/control, and worker/compute nodes.

Thank you,
Kyle

Comment 11 Prashanth Sundararaman 2021-12-16 21:24:46 UTC
Hi Kyle,

Thanks for testing it and getting back so quickly. I certainly understand your concern about missing boot related changes, but those are far and few. Also after install, the os will be pivoted to the one present in the OCP payload as part of the machine-os-content so all your libraries will be up to date after install.

I am removing the blocker flag for this as there is a workaround.

Thanks
Prashanth

Comment 12 Prashanth Sundararaman 2021-12-22 03:33:57 UTC
Kyle,

We have latest 4.10 rhcos images with the version of s390utils-core containing the missing binaries: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.10-s390x&release=410.84.202112212202-0#410.84.202112212202-0 . If you could test this out that would be great.

Thanks
Prashanth

Comment 13 krmoser 2021-12-22 13:59:21 UTC
Prashanth,

Thanks for the update and OCP 4.10 RHCOS build.  I'll test today with this OCP 4.10 RHCOS build and provide an update.

Thank you,
Kyle

Comment 14 krmoser 2021-12-22 17:42:38 UTC
Prashanth,

Using the OCP 4.10 on Z RHCOS 410.84.202112212202-0 build to boot the bootstrap, master/control, and worker/compute nodes, we have successfully installed the 5 following OCP 4.10 on Z builds in a zVM environment.  These 5 OCP 4.10 on Z builds subsequently installed their required OCP 4.10 on Z RHCOS builds as part of the normal installation process.

1. OCP 4.10.0-0.nightly-s390x-2021-12-16-185334
2. OCP 4.10.0-0.nightly-s390x-2021-12-18-034912
3. OCP 4.10.0-0.nightly-s390x-2021-12-20-215258
4. OCP 4.10.0-0.nightly-s390x-2021-12-21-231942
5. OCP 4.10.0-0.nightly-s390x-2021-12-22-053640


Thank you,
Kyle

Comment 15 Prashanth Sundararaman 2021-12-22 19:33:47 UTC
thanks Kyle. I'll keep this open till https://bugzilla.redhat.com/show_bug.cgi?id=2032507 makes it but i will lower the severity.

Comment 16 krmoser 2021-12-22 19:47:48 UTC
Prashanth,


Thank you.  Sounds good.

Thank you,
Kyle

Comment 17 Dan Li 2022-01-04 15:58:29 UTC
Hi Dan, do you know if https://bugzilla.redhat.com/show_bug.cgi?id=2032507 will move to ON_QA before the end of the current OpenShift sprint (January 8th)? If not, I'd like to add "reviewed-in-sprint" flag to this bug

Comment 18 Dan Horák 2022-01-04 16:31:30 UTC
I think it's unlikely to meet the January 8th date, I expect progress during the next week.

Comment 19 Dan Li 2022-01-04 16:34:27 UTC
Thank you Dan. Setting the flag.

Comment 20 Dan Li 2022-01-24 14:34:56 UTC
Hi Dan, do you know if the related bug BZ 2032507 would move to ON_QA before the end of the current OpenShift sprint (January 29th)? If not, I'd like to add "reviewed-in-sprint" flag to indicate that the team has reviewed the progress of this bug during this sprint.

Comment 21 Dan Horák 2022-01-26 11:10:23 UTC
(In reply to Dan Li from comment #20)
> Hi Dan, do you know if the related bug BZ 2032507 would move to ON_QA before
> the end of the current OpenShift sprint (January 29th)? If not, I'd like to
> add "reviewed-in-sprint" flag to indicate that the team has reviewed the
> progress of this bug during this sprint.

I don't think so, it's waiting on our QA to do first level of verification.

Comment 22 Dan Li 2022-01-26 11:16:24 UTC
Thanks Dan. Adding reviewed-in-sprint.

Comment 23 krmoser 2022-02-03 13:42:26 UTC
Folks,

In testing the latest OCP 4.10 FC build, 4.10.0-fc.4, with the corresponding RHCOS 410.84.202201280202-0 build, the vmcp command and s390utils-base rpm are not contained within this build.

1. The s390utils-base fdasd and lszdev commands are contained within the RHCOS 410.84.202201280202-0 build, in the s390utils-core-2.16.0-2.el8.s390x rpm.

2. The s390utils-base vmcp command is not contained within the RHCOS 410.84.202201280202-0 build, as the s390utils-base rpm is not contained within this RHCOS build, and the vmcp command has not been copied to the s390utils-core-2.16.0-2.el8.s390x rpm.

   [root@master-0 bin]# which vmcp
   /usr/bin/which: no vmcp in (/root/.local/bin:/root/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
   [root@master-0 bin]# vmcp
   -bash: vmcp: command not found
   [root@master-0 bin]#


   [root@master-0 bin]# rpm -qa | grep s390utils
   s390utils-core-2.16.0-2.el8.s390x
   [root@master-0 bin]#


3. Given the s390utils-base rpm is not contained within the RHCOS 410.84.202201280202-0 build, other important s390utils commands are missing.  For example, but not limited to:
   (1) /usr/sbin/chccwdev
   (2) /usr/sbin/chchp
   (3) /usr/sbin/chcpumf
   (4) /usr/sbin/chshut
   (5) /usr/sbin/chzcrypt
   (6) /usr/sbin/dasdstat
   (7) /usr/sbin/dasdview
   (8) /usr/sbin/dbginfo.sh
   (9) /usr/sbin/lschp
  (10) /usr/sbin/lscpumf
  (11) /usr/sbin/lscss
  (12) /usr/sbin/lsdasd
  (13) /usr/sbin/lsluns
  (14) /usr/sbin/lsqeth
  (15) /usr/sbin/lstape
  (16) /usr/sbin/lszcrypt
  (17) /usr/sbin/lszfcp


4. For Client debug and support purposes, and general test purposes, it would be very helpful if the full list of s390utils-base commands could be included in the OCP 4.10 on Z RHCOS builds, as with the OCP 4.9 on Z and previous OCP 4.x on Z releases' RHCOS builds.


5. The current workaround for not having the full list of s390utils-base commands within the OCP 4.10 on Z RHCOS builds is to copy them (scp) over from the bastion and/or other RHEL 8.4/8.5 server with the s390utils-base-2.16.0-2.el8.s390x rpm or other appropriate s390utils-base rpm level.  This is problematic, though, as the s390utils-base rpm level will potentially not always be compatible with the   


6. For comparison purposes, the latest OCP 4.9 build, 4.9.19, with its corresponding RHCOS 49.84.202201262102-0 build contains both the s390utils-core and s390utils-base rpms, including the fdasd, lszdev, and vmcp commands in the s390utils-base-2.15.1-5.el8.s390x rpm.

   [root@master-0 ~]# rpm -qa | grep s390utils
   s390utils-core-2.15.1-5.el8.s390x
   s390utils-base-2.15.1-5.el8.s390x
   [root@master-0 ~]#

   [root@master-0 ~]# rpm -qf `which lsdasd `
   s390utils-base-2.15.1-5.el8.s390x
   [root@master-0 ~]# rpm -qf `which fdasd `
   s390utils-base-2.15.1-5.el8.s390x
   [root@master-0 ~]# rpm -qf `which lszdev `
   s390utils-base-2.15.1-5.el8.s390x
   [root@master-0 ~]# rpm -qf `which vmcp `
   s390utils-base-2.15.1-5.el8.s390x
   [root@master-0 ~]#



   [root@master-0 ~]# which vmcp
   /sbin/vmcp
   [root@master-0 ~]#

   [root@master-0 ~]# vmcp
   Usage: vmcp [OPTIONS] CP-command

   z/VM CP command interface

   OPTIONS
    -k, --keepcase     Do not convert CP-command string to uppercase
    -b, --buffer SIZE  Specify buffer size in bytes, kilobytes (k) or megabytes
                       (M). SIZE range from 4096 to 1048576 bytes
    -h, --help         Print this help, then exit
    -v, --version      Print version information, then exit
   [root@master-0 ~]#



Thank you,
Kyle

Comment 24 krmoser 2022-02-03 18:32:23 UTC
Prashanth,

Just a folllow-up to our discussion earlier today to document the clarification that
1. Although the vmcp command is missing from the latest OCP 4.10 on Z RHCOS builds, the actual OCP 4.10 on Z installs continue to succeed since the December 21, 2021 RHCOS update, as documented above in comments 12 through 16, as they contain the lszdev command.

2. I will provide a list here of the requested/highly recommended s390utils-base commands to be included in the s390utils-core rpm for RHCOS 4.10 and beyond, as the s390utils-base rpm will no longer be included in the RHCOS 4.10 and beyond builds.


Thank you,
Kyle

Comment 25 Prashanth Sundararaman 2022-02-04 00:13:51 UTC
Thanks Kyle - could you actually create a new BZ and assign it to the RHEL component with the commands we would need ?

Comment 26 krmoser 2022-02-09 19:08:47 UTC
Prashanth,

Thanks.  I've opened Red Hat RHEL 8.4 bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=2052687 for this issue.

Thank you,
Kyle

Comment 27 Dan Li 2022-02-14 18:14:39 UTC
Hi Dan, as BZ 2032507 has moved to ON_QA, do you know if this bug would progress past ON_QA before the end of this current OpenShift sprint (Feb. 19th)? If not, I'd like to add "reviewed-in-sprint" flag to indicate that the team has reviewed the progress of this bug during this sprint.

Comment 28 Dan Li 2022-02-17 23:58:10 UTC
Setting reviewed-in-sprint, as dependent bug has just been verified but it may take it longer for this bug to reach ON_QA

Comment 29 Dan Li 2022-03-07 14:10:25 UTC
Hi Dan, as BZ 2032507 has moved to VERIFIED, should this bug be moved to a further state than ASSIGNED? If not, I'd like to add "reviewed-in-sprint" flag to indicate that the team has reviewed the progress of this bug during this sprint.

Comment 30 Dan Horák 2022-03-07 14:40:12 UTC
I think it depends how the OCP team needs to track the progress. The RHEL update made a progress, so we know it will be released as part the 8.4.0 batch update 8 (on 2022-04-19). AFAIK to be consumed until then it still needs some manual action by the OCP/CoreOS team.

Comment 31 Dan Li 2022-03-07 15:48:44 UTC
I had a chat with the team and we think that since the RHEL bug (BZ 2032507) has been verified, we can close this bug for now.

Note for future viewers of this bug, that the fix for this bug will ship with the 8.6 s390utils-core.