Description of problem: An installation with multipath parameters in the parmfile: rd.multipath=default coreos.inst.install_dev=/dev/mapper/mpatha and hostnames in the parmfile fails. The installation ends with an emergency shell. Network (IP) is configured, but name resolution is not working. Ping with IP to other system works. The same installation (with MP parameters) works, if IP addresses are specified instead of hostnames in the parmfile. It also works with hostnames in the parmfile, if rd.multipath=default is removed and sda is used instead of dev/mapper/mpatha. It looks like the MP parameter(s) breaks the correct setup of the name resolution during installation. Not sure if it should be there, but there is no /etc/resolv.conf in the booted linux (emergency shell). Version-Release number of selected component (if applicable): oc version Client Version: 4.8.0-0.nightly-s390x-2021-06-18-055818 Server Version: 4.8.0-0.nightly-s390x-2021-06-18-055818 Kubernetes Version: v1.21.0-rc.0+120883f How reproducible: Install a node with MP parameters and hostnames in the parmfile. Steps to Reproduce: 1. 2. 3. Actual results: Installation ends in an emergency shell. Expected results: Installation process works. Additional info:
Created attachment 1792825 [details] error snapshot
Hi Jonathan, Is this a possible regression caused by https://github.com/coreos/fedora-coreos-config/pull/1011 ? Like the original description says, if the ignition url is configured with a hostname, the coreos-installer errors out. If configured with an ip address, it works. Thanks Prashanth
Setting "Blocker-" after discussing with the team. Based on these reasons: 1. configuring multipath as a day 2 operation still works 2. specifying ip address instead of hostname works
Hmm, I'm not sure how this could be multipath related. It looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=1967483, except in the initrd. Full logs from the initrd would be helpful, esp. NetworkManager.
funnily enough the coreos-livepxe-rootfs.service succeeds so it is able to resolve the hostname there, but not when running the coreos-installer.
(In reply to Jonathan Lebon from comment #4) > Hmm, I'm not sure how this could be multipath related. > It looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=1967483, > except in the initrd. Sorry, this is incorrect. This BZ matches rhbz#1967483 in that respect as well, since coreos-installer.service runs in the real root. (I'm so used to "emergency shell" referring to the initrd emergency shell that my brain jumped to that. :) )
Today did some testing of custom rhcos-4.8 (with https://github.com/coreos/coreos-installer/pull/564), ignition config gets downloaded from github.com - system works without any DNS issues. ( Here is cmdline: ``` Kernel command line: rd.neednet=1 dfltcc=off random.trust_cpu=on rd.znet=qeth,0.0.bdf0,0.0.bdf1,0.0.bdf2,layer2=1,por tno=0 console=ttysclp0 ip=172.18.142.3::172.18.0.1:255.254.0.0:coreos:encbdf0:off nameserver=172.18.0.1 coreos.inst=yes coreos.inst. insecure=yes coreos.inst.ignition_url=https://raw.githubusercontent.com/nikita-dubrovskii/s390x-ignition-configs/master/ignition.ign coreos.live.rootfs_url=http://172.18.10.243/rhcos-48.84.202106231130-0-live-rootfs.s390x.img zfcp.allow_lun_scan=0 cio_ignore=all,! condev rd.zfcp=0.0.1903,0x500507630910d435,0x408240d100000000 rd.zfcp=0.0.1943,0x500507630914d435,0x408240d100000000 coreos.inst.ins tall_dev=sda coreos.inst.mpath=yes ``` ) Using another zVM/Linux as http-server with ignition config - also works (http://m1314001.lnxne.boe:8080/ignition/ignition.ign). But using http://bastion.ocp-m1314001.lnxne.boe:8080/ignition/ignition.ign - doesn't work, so i guess there is smth wrong with bastion node's config (as you can see same m1314001 is used as http-server).
> Using another zVM/Linux as http-server with ignition config - also works (http://m1314001.lnxne.boe:8080/ignition/ignition.ign). But using http://bastion.ocp-m1314001.lnxne.boe:8080/ignition/ignition.ign - doesn't work, so i guess there is smth wrong with bastion node's config (as you can see same m1314001 is used as http-server). That's interesting, thanks for the tests. I did some interactive debugging via screenshare with @madeel on this and indeed we saw the install pass without multipath enabled, and fail with it enabled. I'm still not sure how multipath can affect DNS resolution, unless it simply makes an existing race easier to trigger. If that's the case, then it might be helped by https://github.com/coreos/coreos-installer/pull/565. I've made a scratch build with that patch: http://brew-task-repos.usersys.redhat.com/repos/scratch/jlebon/coreos-installer/0.9.0/7.pr565.rhaos4.8.el8/s390x/ Re-hosted RPMs in a public space if you don't have VPN access: https://jlebon.fedorapeople.org/coreos-installer-0.9.0-7.pr565.rhaos4.8.el8.s390x.rpm https://jlebon.fedorapeople.org/coreos-installer-bootinfra-0.9.0-7.pr565.rhaos4.8.el8.s390x.rpm Developers with access to an s390x machine who can reproduce this bug should be able to build an RHCOS image with those RPMs and test that.
Ok, look's like i've got what' wrong here: 1) with 'coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default' and `hostname.com/ignition.conf` and in the parm file: coreos-installer cannot fetch ignition (DNS), but! at first coreos tries to propagate 'multipat.conf' to the '/sysroot', so we end up with a failure: ``` coreos-propagate-multipath-conf[926]: cp: cannot create regular file '/sysroot/etc/multipath.conf': Read-only file system systemd[1]: coreos-propagate-multipath-conf.service: Main process exited, code=exited, status=1/FAILURE ... systemd[1]: Reached target Emergency Mode. ``` 2) with `coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default` and `1.2.3.4/ignition.conf` in the parm file: coreos-installer can fetch ignition (no DNS) , but fails with `kpartx` (propagation of 'multipat.conf' to the '/sysroot' also failed): ``` coreos-propagate-multipath-conf[926]: cp: cannot create regular file '/sysroot/etc/multipath.conf': Read-only file system ... systemd[1]: Reached target Emergency Mode. ... [ 23.522376] coreos-installer-service[1859]: device-mapper: resume ioctl on mpatha4 failed: Invalid argument [ 23.522453] coreos-installer-service[1859]: resume failed on mpatha4 [ 23.811211] coreos-installer-service[1859]: Error: getting partition table for /dev/mapper/mpatha [ 23.811374] coreos-installer-service[1859]: Caused by: [ 23.811395] coreos-installer-service[1859]: "kpartx" "-u" "-n" "/dev/dm-0" failed with exit code: 1 Failed to start CoreOS Installer. ``` If we take a look at /etc/resolv.conf without multipath, we have valid config: ``` search lnxne.boe nameserver 172.18.0.1 ``` But with `rd.multipath=default` it's empty, systemd already had failed, so for me it looks not like a DNS issue. And installing this way also makes no sense - during fristboot coreos starts without multipath, so i don't see any reason for installing coreos with `rd.multipath=default` right now. i would assume this as not a bug, or not a DNS-bug
(In reply to Nikita Dubrovskii (IBM) from comment #9) > Ok, look's like i've got what' wrong here: > > 1) with 'coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default' > and `hostname.com/ignition.conf` and in the parm file: What is that karg? Do you mean `ip=...`? Can you show the full parmfile you used? > coreos-installer cannot fetch ignition (DNS), but! at first coreos tries to > propagate 'multipat.conf' to the '/sysroot', so we end up with a failure: > > ``` > coreos-propagate-multipath-conf[926]: cp: cannot create regular file > '/sysroot/etc/multipath.conf': Read-only file system > systemd[1]: coreos-propagate-multipath-conf.service: Main process exited, > code=exited, status=1/FAILURE > ... > > systemd[1]: Reached target Emergency Mode. > ``` Ouch good catch. So we continue on to the real root even if the service failed. > 2) with `coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default` > and `1.2.3.4/ignition.conf` in the parm file: > coreos-installer can fetch ignition (no DNS) , but fails with `kpartx` > (propagation of 'multipat.conf' to the '/sysroot' also failed): > > ``` > coreos-propagate-multipath-conf[926]: cp: cannot create regular file > '/sysroot/etc/multipath.conf': Read-only file system > ... > systemd[1]: Reached target Emergency Mode. > ... > > [ 23.522376] coreos-installer-service[1859]: device-mapper: resume ioctl > on mpatha4 failed: Invalid argument > [ 23.522453] coreos-installer-service[1859]: resume failed on mpatha4 > [ 23.811211] coreos-installer-service[1859]: Error: getting partition > table for /dev/mapper/mpatha > [ 23.811374] coreos-installer-service[1859]: Caused by: > [ 23.811395] coreos-installer-service[1859]: "kpartx" "-u" "-n" > "/dev/dm-0" failed with exit code: 1 > Failed to start CoreOS Installer. > ``` > > If we take a look at /etc/resolv.conf without multipath, we have valid > config: > ``` > search lnxne.boe > nameserver 172.18.0.1 > ``` > > But with `rd.multipath=default` it's empty, systemd already had failed, so > for me it looks not like a DNS issue. OK, so I think there are two issues here: 1. `coreos-propagate-multipath-conf.service` doesn't have ``` OnFailure=emergency.target OnFailureJobMode=isolate ``` 2. We have no ordering between `coreos-propagate-multipath-conf.service` and `sysroot-etc.mount`. <times passes> Filed: https://github.com/coreos/fedora-coreos-config/pull/1077 Can you try that out? > And installing this way also makes no sense - during fristboot coreos starts > without multipath, > so i don't see any reason for installing coreos with `rd.multipath=default` > right now. It's valid to turn on multipath at installation time so that coreos-installer can copy the content on top of the multipath target (for the same reasons as https://github.com/coreos/fedora-coreos-config/pull/1011). coreos-installer should support this already (see e.g. https://github.com/coreos/coreos-installer/pull/499), but if we hit issues with kpartx there, let's work on fixing them.
Hi Muhammad, do you think this bug will be resolved before the end of this sprint (July 3rd)? If not, can we set "Reviewed-in-Sprint"?
Hi Dan, The root cause is still not clear, so please set the reviewed flag.
(In reply to Jonathan Lebon from comment #10) > (In reply to Nikita Dubrovskii (IBM) from comment #9) > > Ok, look's like i've got what' wrong here: > > > > 1) with 'coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default' > > and `hostname.com/ignition.conf` and in the parm file: > > What is that karg? Do you mean `ip=...`? Can you show the full parmfile you > used? no, it's not an IP here, but some hostname: ``` ip=172.18.142.3::172.18.0.1:255.254.0.0:coreos:encbdf0:off nameserver=172.18.0.1 coreos.inst=yes coreos.inst.ignition_url=http://m1314001.lnxne.boe:8080/ignition/ignition.ign ``` ``` > OK, so I think there are two issues here: > 1. `coreos-propagate-multipath-conf.service` doesn't have > > ``` > OnFailure=emergency.target > OnFailureJobMode=isolate > ``` > 2. We have no ordering between `coreos-propagate-multipath-conf.service` and > `sysroot-etc.mount`. > > <times passes> > > Filed: https://github.com/coreos/fedora-coreos-config/pull/1077 > > Can you try that out? Did it, works as expected: - system could be installed using DNS (```coreos.inst.ignition_url=http://m1314001.lnxne.boe:8080/ignition/ignition.ign```) - system could be installed using IP (```coreos.inst.ignition_url=http://172.18.10.243/ignition.ign```) > with kpartx there, let's work on fixing them. Here is PR for kpartx issue: https://github.com/coreos/coreos-installer/pull/566
Hi Muhammad, do you think this bug will move past ON_QA by the end of this Sprint? If not, can we add "reviewed-in-sprint" flag?
Hi Jonathan, do you know when the fix[https://github.com/coreos/fedora-coreos-config/pull/1077] will be pickup by RHCOS?
Will try to get it in the next 4.8 bootimage bump.
Latest RHCOS 4.9 build should have the necessary patches for this bug, so should be ready to be verified. Muhammad, can you verify it's fixed?
Sorry for the confusion on this. It has to stay in POST until the 4.9 bootimage bump PR gets merged.
Setting reviewed-in-sprint as we are waiting for OpenShift to pick up the RHCOS PR
Hi Muhammad, do you think this bug will reach ON_QA by the end of this sprint (August 14th)? If not, can we add "reviewed-in-sprint" flag?
Hi Dan, the fix is landed in 4.9. Though it needs to be tested, therefore you can add the "reviewed-in-sprint" flag.
I successfully installed two nodes with rd.multipath=default in parmfile: rd.neednet=1 rd.multipath=default console=ttysclp0 coreos.inst.install_dev=/dev/mapper/mpatha coreos.live.rootfs_url=http://bistro.lnxne.boe/redhat/alkl/rhcos/nightly/PE/rhcos-49.84.202108041448-0/rhcos-49.84.202108041448-0-live-rootfs.s390x.img coreos.inst.ignition_url=http://bastion.m3558001.lnxne.boe:8080/ignition/worker.ign ip=10.107.1.52::10.107.1.51:255.255.255.0::ence383:none nameserver=10.107.1.51 zfcp.allow_lun_scan=0 cio_ignore=all,!condev rd.znet=qeth,0.0.e383,0.0.e384,0.0.e385,layer2=1 rd.zfcp=0.0.1c42,0x5001738030290140,0x0002000000000000 rd.zfcp=0.0.1c02,0x5001738030290140,0x0002000000000000 rd.zfcp=0.0.1c42,0x5001738030290151,0x0002000000000000 rd.zfcp=0.0.1c02,0x5001738030290151,0x0002000000000000 On worker node (without Day-2 operation): ----------------------------------------- [core@bootstrap-0 ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 112.2G 0 disk |-sda3 8:3 0 384M 0 part /boot `-sda4 8:4 0 111.8G 0 part sdb 8:16 0 112.2G 0 disk |-sdb3 8:19 0 384M 0 part `-sdb4 8:20 0 111.8G 0 part sdc 8:32 0 112.2G 0 disk |-sdc3 8:35 0 384M 0 part `-sdc4 8:36 0 111.8G 0 part /sysroot sdd 8:48 0 112.2G 0 disk |-sdd3 8:51 0 384M 0 part `-sdd4 8:52 0 111.8G 0 part [core@bootstrap-0 ~]$ cat /proc/cmdline random.trust_cpu=on ignition.platform.id=metal $ignition_firstboot ostree=/ostree/boot.0/rhcos/98ec6ea1fe3b3ff8df599883fdae7041851b624a27b00bb70a3f8488616b6e93/0 zfcp.allow_lun_scan=0 cio_ignore=all,!condev rd.znet=qeth,0.0.e383,0.0.e384,0.0.e385,layer2=1 rd.zfcp=0.0.1c42,0x5001738030290140,0x0002000000000000 rd.zfcp=0.0.1c02,0x5001738030290140,0x0002000000000000 rd.zfcp=0.0.1c42,0x5001738030290151,0x0002000000000000 rd.zfcp=0.0.1c02,0x5001738030290151,0x0002000000000000 root=UUID=7c1ea61a-5c64-4a97-807f-f8f55f949faa rw rootflags=prjquota [core@bootstrap-0 ~]$ lszdev TYPE ID ON PERS NAMES zfcp-host 0.0.1c02 yes no zfcp-host 0.0.1c42 yes no zfcp-lun 0.0.1c02:0x5001738030290140:0x0002000000000000 yes no sdd sg3 zfcp-lun 0.0.1c02:0x5001738030290151:0x0002000000000000 yes no sdc sg2 zfcp-lun 0.0.1c42:0x5001738030290140:0x0002000000000000 yes no sda sg0 zfcp-lun 0.0.1c42:0x5001738030290151:0x0002000000000000 yes no sdb sg1 qeth 0.0.e383:0.0.e384:0.0.e385 yes no ence383 generic-ccw 0.0.0009 yes no ---------------------------------------------------------------------------------- BUT: On one system, I have run the installation 3 times, on the other system 2 times to have success. The failed installations stop in emergency shell, but DNS / hostname / IP / FCP looks good: lsblk lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 7.9G 0 loop /run/ephemeral loop1 7:1 0 771.9M 0 loop /sysroot sda 8:0 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part `-mpatha4 253:2 0 111.8G 0 part sdb 8:16 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part `-mpatha4 253:2 0 111.8G 0 part sdc 8:32 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part `-mpatha4 253:2 0 111.8G 0 part sdd 8:48 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part `-mpatha4 253:2 0 111.8G 0 part bash-4.4# lszdev lszdev TYPE ID ON PERS NAMES zfcp-host 0.0.1c02 yes no zfcp-host 0.0.1c42 yes no zfcp-lun 0.0.1c02:0x5001738030290140:0x0002000000000000 yes no sdd sg3 zfcp-lun 0.0.1c02:0x5001738030290151:0x0002000000000000 yes no sdc sg2 zfcp-lun 0.0.1c42:0x5001738030290140:0x0002000000000000 yes no sdb sg1 zfcp-lun 0.0.1c42:0x5001738030290151:0x0002000000000000 yes no sda sg0 qeth 0.0.e383:0.0.e384:0.0.e385 yes no ence383 generic-ccw 0.0.0009 yes no ping -c 4 bastion.m3558001.lnxne.boe ping -c 4 bastion.m3558001.lnxne.boe PING bastion.m3558001.lnxne.boe (172.18.160.1) 56(84) bytes of data. 64 bytes from 172.18.160.1 (172.18.160.1): icmp_seq=1 ttl=64 time=0.175 ms 64 bytes from 172.18.160.1 (172.18.160.1): icmp_seq=2 ttl=64 time=0.203 ms 64 bytes from 172.18.160.1 (172.18.160.1): icmp_seq=3 ttl=64 time=0.184 ms 64 bytes from 172.18.160.1 (172.18.160.1): icmp_seq=4 ttl=64 time=0.207 ms --- bastion.m3558001.lnxne.boe ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3090ms rtt min/avg/max/mdev = 0.175/0.192/0.207/0.016 ms bash-4.4# For some reason, the image was not downloaded. @madeel mentioned that this could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1991928 On my system, there is only one NIC configured.
After Day-2 Operation: Last login: Wed Aug 11 09:32:37 2021 from 10.107.1.51 [core@bootstrap-0 ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part /boot `-mpatha4 253:2 0 111.8G 0 part /sysroot sdb 8:16 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part /boot `-mpatha4 253:2 0 111.8G 0 part /sysroot sdc 8:32 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part /boot `-mpatha4 253:2 0 111.8G 0 part /sysroot sdd 8:48 0 112.2G 0 disk `-mpatha 253:0 0 112.2G 0 mpath |-mpatha3 253:1 0 384M 0 part /boot `-mpatha4 253:2 0 111.8G 0 part /sysroot [core@bootstrap-0 ~]$ sudo multipath -ll mpatha (20017380030290193) dm-0 IBM,2810XIV size=112G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:2 sdc 8:32 active ready running |- 1:0:1:2 sdd 8:48 active ready running |- 0:0:0:2 sda 8:0 active ready running `- 0:0:1:2 sdb 8:16 active ready running
The Z team tested this specific error described in the initial comment and the it has been fixed; however, other bugs have come up and we think it may be related to BZ 1991928: https://bugzilla.redhat.com/show_bug.cgi?id=1991928 so we are tracking the other bug. Moving it to VERIFIED at the moment
The fix for this bug will not be delivered to customers until it lands in an updated bootimage. That process is tracked in bug 1981999, which is in state ASSIGNED. Moving this bug back to POST.
Setting needinfo- as there are other networking related patches coming with the tracker bug 1981999. We will wait for the bootimage bump.
Boot image bump is merged, moving to MODIFIED
Adding "reviewed-in-sprint" as the bug will not be resolved before the end of this sprint.
Hi Muhammad, do you think this bug is still waiting for the PRs to merge? If so, we may want to add "reviewed-in-sprint"
Hi Dan, We could not reproduce the problem mentioned in the BZ anymore. We can close this.
Closing per Muhammad's Comment 33
The fix for this bug will not be delivered to customers until it lands in an updated bootimage. That process is tracked in bug 1981999, which is in state POST. Moving this bug back to POST.
Bot is linked with the bootimage bug and therefore cannot close. Adding "reviewed-in-sprint"
The fix for this bug has landed in a bootimage bump, as tracked in bug 1981999 (now in status MODIFIED). Moving this bug to MODIFIED.
This has been verified.
I believe Doug's Comment 40 refers back to Comment 33 regarding the fact that we can no longer reproduce the problem; however, we were unable to close this bug as it is linked to BZ 1981999 (which is VERIFIED)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759