Bug 2157082

Summary: "nm-run.sh" runs only "online" hook once, which may lead to not fetching the Stage2 at all
Product: Red Hat Enterprise Linux 9 Reporter: Renaud Métrich <rmetrich>
Component: dracutAssignee: Pavel Valena <pvalena>
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: high Docs Contact:
Priority: high    
Version: 9.1CC: bgalvani, dtardon, fsumsal, jstodola, lrintel, nilesh.javali, pvalena, rkhan, rvykydal, sfaye, sukulkar, till
Target Milestone: rcKeywords: Bugfix, Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: dracut-057-21.git20230214.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-09 08:24:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2153361    

Description Renaud Métrich 2022-12-30 15:02:55 UTC
Description of problem:

A customer hits the issue when installing a system through Infiniband interface.
On his system, for some reason, the "nm-run.sh" script executes before the Infiniband interface is discovered by the kernel.
At the end of the script, "/tmp/nm.done" is created, causing the online hook for the interface to never execute, hence Stage2 to never be fetched.
See below a partial "set -x" sample output, the boot is done with "ip=ibs1f0:dhcp" ("ibs1f0" being initially seen as "ib0" by the kernel)
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
[   17.943242] localhost.localdomain dracut-initqueue[2290]: ++ '[' -e /tmp/nm.done ']'
[   17.943399] localhost.localdomain dracut-initqueue[2290]: ++ '[' -z 1 ']'
[   17.943399] localhost.localdomain dracut-initqueue[2290]: ++ '[' -s /run/NetworkManager/initrd/hostname ']'
[   17.943399] localhost.localdomain dracut-initqueue[2290]: ++ for _i in /sys/class/net/*
[   17.943399] localhost.localdomain dracut-initqueue[2290]: ++ '[' -d /sys/class/net/ens8f0 ']'
[   17.945130] localhost.localdomain dracut-initqueue[2496]: +++ cat /sys/class/net/ens8f0/ifindex
[   17.945700] localhost.localdomain dracut-initqueue[2290]: ++ state=/run/NetworkManager/devices/2
[   17.945700] localhost.localdomain dracut-initqueue[2290]: ++ grep -q '^connection-uuid=' /run/NetworkManager/devices/2
[   17.945700] localhost.localdomain dracut-initqueue[2290]: ++ continue
[   17.945700] localhost.localdomain dracut-initqueue[2290]: ++ for _i in /sys/class/net/*
[   17.945700] localhost.localdomain dracut-initqueue[2290]: ++ '[' -d /sys/class/net/ens8f1 ']'
[   17.947365] localhost.localdomain dracut-initqueue[2498]: +++ cat /sys/class/net/ens8f1/ifindex
[   17.947940] localhost.localdomain dracut-initqueue[2290]: ++ state=/run/NetworkManager/devices/3
[   17.947940] localhost.localdomain dracut-initqueue[2290]: ++ grep -q '^connection-uuid=' /run/NetworkManager/devices/3
[   17.947940] localhost.localdomain dracut-initqueue[2290]: ++ continue
[   17.947940] localhost.localdomain dracut-initqueue[2290]: ++ for _i in /sys/class/net/*
[   17.947940] localhost.localdomain dracut-initqueue[2290]: ++ '[' -d /sys/class/net/ens8f2 ']'
[   17.949659] localhost.localdomain dracut-initqueue[2500]: +++ cat /sys/class/net/ens8f2/ifindex
[   17.950163] localhost.localdomain dracut-initqueue[2290]: ++ state=/run/NetworkManager/devices/4
[   17.950163] localhost.localdomain dracut-initqueue[2290]: ++ grep -q '^connection-uuid=' /run/NetworkManager/devices/4
[   17.950163] localhost.localdomain dracut-initqueue[2290]: ++ continue
[   17.950163] localhost.localdomain dracut-initqueue[2290]: ++ for _i in /sys/class/net/*
[   17.950163] localhost.localdomain dracut-initqueue[2290]: ++ '[' -d /sys/class/net/ens8f3 ']'
[   17.951827] localhost.localdomain dracut-initqueue[2502]: +++ cat /sys/class/net/ens8f3/ifindex
[   17.952376] localhost.localdomain dracut-initqueue[2290]: ++ state=/run/NetworkManager/devices/5
[   17.952376] localhost.localdomain dracut-initqueue[2290]: ++ grep -q '^connection-uuid=' /run/NetworkManager/devices/5
[   17.952376] localhost.localdomain dracut-initqueue[2290]: ++ continue
[   17.952376] localhost.localdomain dracut-initqueue[2290]: ++ for _i in /sys/class/net/*
[   17.952376] localhost.localdomain dracut-initqueue[2290]: ++ '[' -d /sys/class/net/lo ']'
[   17.953457] localhost.localdomain dracut-initqueue[2504]: +++ cat /sys/class/net/lo/ifindex
[   17.953910] localhost.localdomain dracut-initqueue[2290]: ++ state=/run/NetworkManager/devices/1
[   17.953910] localhost.localdomain dracut-initqueue[2290]: ++ grep -q '^connection-uuid=' /run/NetworkManager/devices/1
[   17.953910] localhost.localdomain dracut-initqueue[2290]: ++ continue
[   17.953910] localhost.localdomain dracut-initqueue[2290]: ++ :

---> HERE ABOVE "ibs1f0" doesn't exist yet

[   18.217260] localhost.localdomain kernel: mlx5_core 0000:31:00.0 ibs1f0: renamed from ib0

---> Interface is now discovered by the kernel
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

When booting with "rd.debug", the issue doesn't happen because of slowness induced by "rd.debug" (especially writing to the console), causing the interface to be discovered before the script executes.

Version-Release number of selected component (if applicable):

dracut-057-13.git20220816.el9

How reproducible:

Always

Steps to Reproduce: this can be reproduced using a QEMU/KVM and "live-plumbing" of the interface
1. Configure a VM with network interface that *won't be used* (will be "enp1s0" usually)
2. Configure booting directly on kernel/initrd

  "Direct Kernel Boot"
  kernel: rhel91 DVD kernel
  initrd: rhel91 DVD initrd
  arguments: console=tty0 console=ttyS0,115200n8 ip=enp5s0:dhcp inst.repo=http://192.168.122.1/rhel91 rd.debug rd.break

3. Boot the system and wait for dracut-initqueue to start

4. Add network interface "enp5s0"

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    <interface type="network">
      <mac address="52:54:00:ce:e3:e4"/>
      <source network="default" portid="f8966d36-8586-430d-8f57-265a878ddc35" bridge="virbr0"/>
      <target dev="vnet10"/>
      <model type="virtio"/>
      <alias name="net1"/>
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
    </interface>
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Actual results:

dracut-initqueue times out and Stage2 is never downloaded

Expected results:

Stage2 gets downloaded because "online" hook for enp5s0 executes at some point in time

Additional info:

The root cause for the issue is having line 72 unconditionally execute and stop further executions of "for" loop on line 62:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 :
  5 if [ -e /tmp/nm.done ]; then
  6     return
  7 fi
 :
 62 for _i in /sys/class/net/*; do
 63     [ -d "$_i" ] || continue
 64     state="/run/NetworkManager/devices/$(cat "$_i"/ifindex)"
 65     grep -q '^connection-uuid=' "$state" 2> /dev/null || continue
 66     ifname="${_i##*/}"
 67     dhcpopts_create "$state" > /tmp/dhclient."$ifname".dhcpopts
 68     source_hook initqueue/online "$ifname"
 69     /sbin/netroot "$ifname"
 70 done
 71 
 72 : > /tmp/nm.done
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Due to this, even if no interface was eligible for "online" hook (lines 68-69), the loop won't be entered anymore.

I believe a fix is to create the "nm.done" file only if "source_hook" could execute, something like this below:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 :
 62 for _i in /sys/class/net/*; do
 63     [ -d "$_i" ] || continue
 64     state="/run/NetworkManager/devices/$(cat "$_i"/ifindex)"
 65     grep -q '^connection-uuid=' "$state" 2> /dev/null || continue
 66     ifname="${_i##*/}"
 67     dhcpopts_create "$state" > /tmp/dhclient."$ifname".dhcpopts
 68     source_hook initqueue/online "$ifname"
 69     /sbin/netroot "$ifname"
 70     : > /tmp/nm.done
 71 done
 72 
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

But I'm not completely sure how this works, especially if it's expected to execute the "online" hook for multiple interfaces, in such case the proposed fix will not work because we cannot be sure if the interface for netroot is already there.

Comment 1 Lukáš Nykrýn 2023-01-03 14:53:59 UTC
Can we set up the NM generator in a way, that if user set a concrete device on the kernel cmdline NM will wait for it?

Comment 2 Renaud Métrich 2023-01-03 15:28:10 UTC
I'm ok with having NM wait on the interface (which it currently doesn't, or only for a few seconds).
However this won't solve the case when booting with "ip=dhcp".

Comment 4 Lubomir Rintel 2023-01-12 15:48:59 UTC
Reassigning this to dracut. This is a dracut regression, pull request for a revert filed here: https://github.com/dracutdevs/dracut/pull/2134

Dracut maintainers, please review & apply as appropriate. Thank you!

Comment 5 David Tardon 2023-01-24 13:50:18 UTC
*** Bug 2134060 has been marked as a duplicate of this bug. ***

Comment 6 Pavel Valena 2023-01-24 14:45:14 UTC
Alternative solution proposed by lnykryn: https://github.com/dracutdevs/dracut/pull/2173

Comment 11 Radek Vykydal 2023-03-02 07:55:06 UTC
I wonder when the fix will hit rawhide. Seems that it was merged here: https://github.com/dracutdevs/dracut/pull/2134 (https://bugzilla.redhat.com/show_bug.cgi?id=2153361#c16) but I can't see it in the current rawhide (dracut-059-03.fc39).

Comment 12 Renaud Métrich 2023-03-03 07:22:25 UTC
Hello,

Even in case there is a single network interface, it's possible that 99-nm-run.sh executes while no interface was enumerated yet, causing it just impossible to install the system.
This can be seen with IB interfaces (mlx5_core) on a customer site.

So now the question is what we can do to workaround this reliably on 9.0 and 9.1?
9.0 has EUS, so it's even more critical than 9.1 (once BZ is release with - hopefully - 9.2).

Renaud.

Comment 15 Pavel Valena 2023-03-14 19:00:34 UTC
FYI I've added the fix to the prepared Rawhide PR, soon to be merged: https://src.fedoraproject.org/rpms/dracut/pull-request/32

Comment 16 Radek Vykydal 2023-03-21 13:59:49 UTC
(In reply to Pavel Valena from comment #15)
> FYI I've added the fix to the prepared Rawhide PR, soon to be merged:
> https://src.fedoraproject.org/rpms/dracut/pull-request/32

Thank you, works for me in my local tests with Rawhide and anaconda part of the fix.

Comment 18 errata-xmlrpc 2023-05-09 08:24:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (dracut bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2547