Created attachment 1106910 [details]
Syslog of kickstart+stage2 download success with dracut sleep patch
Description of problem:
When kickstarting a CentOS 7 host on an IPv6-only network, there's a race condition in dracut's initramfs between bringing up the NIC, learning the v6 gateway, and downloading the kickstart config. This often causes fetching of kickstart and the stage2 image to fail and eventually dropping to an emergency dracut shell.
In our network we rely on learning the v6 default gateway from layer3 rack switches via ICMP6 router solicitation/router advertisements. For installing hosts we assign a static IPv6 addresses (for DNS mapping) and disable SLAAC.
Syslog shows the Ethernet link state changing up/down/up immediately before the failure. Based on the "network is unreachable" errors, I highly suspect the link is considered to be online and dracut continues to try to fetch kickstart config+stage2 before the host has been able to learn a v6 default gateway from the switch.
Version-Release number of selected component (if applicable):
This is very reproducible when kickstart installing hosts in our v6-only environment.
Steps to Reproduce:
Kickstart a host (e.g. with iPXE or UEFI) over a v6 network, specifying a v6 IP address for the initrd, stage2 and kickstart options. Example:
4.0.9-30_vmlinuz noipv4 console=ttyS1,57600 selinux=0 pcie_aspm=off net.ifnames=0 initrd=\4.0.9-30_centos7.1.1503_initrd.img biosdevname=0 inst.ks.sendmac ip=[2401:db00:20:153:face:0:2d:0]:::64:::none inst.stage2=http://[2401:db00:11:df:face:b00c:0:134]/yum/centos/7.x/stage2/7.1-r18/ ramdisk_size=2097152 inst.loglevel=debug inst.cmdline inst.sshd rd.live.image nameserver=2401:db00:f0:a53:: nameserver=2401:db00:f0:b53:: inst.ks=http://[2401:db00:11:df:face:b00c:0:76]/ks-aux1.prn1.facebook.com.cfg
2. Wait for dracut to load and start bringing up the Basic System target.
Kernel boots, dracut runs, attempts to fetch kickstart configuration and specified stage2 squashfs.img over network, fails. Drops to dracut prompt, e.g.
[ 16.033918] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 16.210200] ixgbe 0000:03:00.0 eth0: NIC Link is Down
[ 16.319369] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
dracut-initqueue: curl: (7) Failed to connect to 2401:db00:11:df:face:b00c:0:134: Network is unreachable
dracut-initqueue: Warning: failed to fetch kickstart from http://[2401:db00:11:df:face:b00c:0:134]/ks-aux1.prn1.facebook.com.cfg
Dracut fetches kickstart config and stage2, starts anaconda, installation succeeds
I have been able to reliably work around this by inserting a dracut online script named 10-fb-network-sleep-fix.sh that does nothing but sleep 4 seconds before running 11-fetch-kickstart-net.sh
Dracut rd.debug logs are not very useful here. The race condition is only 1-2 seconds and enabling of the debug logging slows down the boot process just enough that the race is never exposed with debug turned on. :(
Failure log: https://gist.github.com/bwann/7bc1c4217807b01087ff
Originally reported on anaconda's github issue tracker:
Created attachment 1106911 [details]
dracut sleep patch
Inserted into initrd as /usr/lib/dracut/hooks/initqueue/online/10-fb-network-sleep-fix.sh
this issue should be fixed in RHEL-7.3 Beta. Could you please try if it works for you now?
I was able to reproduce it with RHEL-7.2. When retesting on the same system with latest dracut-033-458.el7, this issue didn't appear (the system was booted ~100 times without any problem)
It looks like master needs a similar fix?
Great! I need to get my hands on 7.3 to test. Looking over the Dracut commits in the RHEL-7 branch I think I see the applicable commits that address this (e.g. network-lib wait for network RA), but as Phil mentioned those commits don't seem to be in the 0.44 rawhide/master?
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
Sorry for the delay, but yes this does seem to be better in 7.3. Thanks!
Hi, can you please confirm whether this has been merged to master/rawhide too? We really need those in rawhide... Thanks!
Can you please confirm this has been merged into rawhide/master? We really need this merged there... thanks!
I see master doesn't have the fix:
not sure which one of the two is authoritative.