Bug 1292623 - dracut: fails to fetch kickstart config+stage2 reliably on v6-only networks [NEEDINFO]
dracut: fails to fetch kickstart config+stage2 reliably on v6-only networks
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dracut (Show other bugs)
7.1
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Lukáš Nykrýn
Release Test Team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-17 18:18 EST by Bryan Wann
Modified: 2017-03-27 19:25 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-04 04:02:22 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
pallotron: needinfo? (lnykryn)


Attachments (Terms of Use)
Syslog of kickstart+stage2 download success with dracut sleep patch (4.20 KB, text/plain)
2015-12-17 18:18 EST, Bryan Wann
no flags Details
dracut sleep patch (569 bytes, application/x-shellscript)
2015-12-17 18:19 EST, Bryan Wann
no flags Details

  None (edit)
Description Bryan Wann 2015-12-17 18:18:00 EST
Created attachment 1106910 [details]
Syslog of kickstart+stage2 download success with dracut sleep patch

Description of problem:
When kickstarting a CentOS 7 host on an IPv6-only network, there's a race condition in dracut's initramfs between bringing up the NIC, learning the v6 gateway, and downloading the kickstart config. This often causes fetching of kickstart and the stage2 image to fail and eventually dropping to an emergency dracut shell.

In our network we rely on learning the v6 default gateway from layer3 rack switches via ICMP6 router solicitation/router advertisements. For installing hosts we assign a static IPv6 addresses (for DNS mapping) and disable SLAAC.

Syslog shows the Ethernet link state changing up/down/up immediately before the failure. Based on the "network is unreachable" errors, I highly suspect the link is considered to be online and dracut continues to try to fetch kickstart config+stage2 before the host has been able to learn a v6 default gateway from the switch.


Version-Release number of selected component (if applicable):
dracut-033-240.el7

How reproducible:
This is very reproducible when kickstart installing hosts in our v6-only environment.

Steps to Reproduce:
1.
Kickstart a host (e.g. with iPXE or UEFI) over a v6 network, specifying a v6 IP address for the initrd, stage2 and kickstart options.  Example:

  4.0.9-30_vmlinuz noipv4 console=ttyS1,57600 selinux=0 pcie_aspm=off net.ifnames=0 initrd=\4.0.9-30_centos7.1.1503_initrd.img biosdevname=0 inst.ks.sendmac ip=[2401:db00:20:153:face:0:2d:0]:::64:::none inst.stage2=http://[2401:db00:11:df:face:b00c:0:134]/yum/centos/7.x/stage2/7.1-r18/ ramdisk_size=2097152 inst.loglevel=debug inst.cmdline inst.sshd rd.live.image nameserver=2401:db00:f0:a53:: nameserver=2401:db00:f0:b53:: inst.ks=http://[2401:db00:11:df:face:b00c:0:76]/ks-aux1.prn1.facebook.com.cfg

2. Wait for dracut to load and start bringing up the Basic System target.
3. 

Actual results:

Kernel boots, dracut runs, attempts to fetch kickstart configuration and specified stage2 squashfs.img over network, fails.  Drops to dracut prompt, e.g.

[   16.033918] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   16.210200] ixgbe 0000:03:00.0 eth0: NIC Link is Down
[   16.319369] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
dracut-initqueue[1219]: curl: (7) Failed to connect to 2401:db00:11:df:face:b00c:0:134: Network is unreachable
dracut-initqueue[1219]: Warning: failed to fetch kickstart from http://[2401:db00:11:df:face:b00c:0:134]/ks-aux1.prn1.facebook.com.cfg

Expected results:

Dracut fetches kickstart config and stage2, starts anaconda, installation succeeds


Additional info:

I have been able to reliably work around this by inserting a dracut online script named 10-fb-network-sleep-fix.sh that does nothing but sleep 4 seconds before running 11-fetch-kickstart-net.sh

Dracut rd.debug logs are not very useful here. The race condition is only 1-2 seconds and enabling of the debug logging slows down the boot process just enough that the race is never exposed with debug turned on. :(


Failure log:  https://gist.github.com/bwann/7bc1c4217807b01087ff

Originally reported on anaconda's github issue tracker:
https://github.com/rhinstaller/anaconda/issues/464
Comment 1 Bryan Wann 2015-12-17 18:19 EST
Created attachment 1106911 [details]
dracut sleep patch

Inserted into initrd as  /usr/lib/dracut/hooks/initqueue/online/10-fb-network-sleep-fix.sh
Comment 5 Jan Stodola 2016-09-06 09:28:39 EDT
Bryan Wann,
this issue should be fixed in RHEL-7.3 Beta. Could you please try if it works for you now?

I was able to reproduce it with RHEL-7.2. When retesting on the same system with latest dracut-033-458.el7, this issue didn't appear (the system was booted ~100 times without any problem)
Comment 6 Phil Dibowitz 2016-09-07 19:38:41 EDT
It looks like master needs a similar fix?
Comment 7 Bryan Wann 2016-09-08 01:51:39 EDT
Great!  I need to get my hands on 7.3 to test.  Looking over the Dracut commits in the RHEL-7 branch I think I see the applicable commits that address this (e.g. network-lib wait for network RA), but as Phil mentioned those commits don't seem to be in the 0.44 rawhide/master?
Comment 9 errata-xmlrpc 2016-11-04 04:02:22 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2530.html
Comment 10 Bryan Wann 2017-03-20 23:18:41 EDT
Sorry for the delay, but yes this does seem to be better in 7.3. Thanks!
Comment 11 Angelo FAilla 2017-03-27 19:06:03 EDT
Hi, can you please confirm whether this has been merged to master/rawhide too? We really need those in rawhide... Thanks!
Comment 12 Angelo FAilla 2017-03-27 19:07:29 EDT
Can you please confirm this has been merged into rawhide/master? We really need this merged there... thanks!
Comment 13 Angelo FAilla 2017-03-27 19:25:51 EDT
I see master doesn't have the fix:

https://github.com/dracutdevs/dracut/blob/master/modules.d/40network/net-lib.sh#L665

same on:

https://git.kernel.org/pub/scm/boot/dracut/dracut.git/tree/modules.d/40network/net-lib.sh?h=master#n665

not sure which one of the two is authoritative.

Note You need to log in before you can comment on or make changes to this bug.