Bug 1292623

Summary: dracut: fails to fetch kickstart config+stage2 reliably on v6-only networks
Product: Red Hat Enterprise Linux 7 Reporter: Bryan Wann <bwann>
Component: dracutAssignee: Lukáš Nykrýn <lnykryn>
Status: CLOSED ERRATA QA Contact: Release Test Team <release-test-team-automation>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.1CC: bwann, dracut-maint-list, harald, josef, jstodola, lnykryn, mikolaj, pallotron, phil
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 08:02:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Syslog of kickstart+stage2 download success with dracut sleep patch
none
dracut sleep patch none

Description Bryan Wann 2015-12-17 23:18:00 UTC
Created attachment 1106910 [details]
Syslog of kickstart+stage2 download success with dracut sleep patch

Description of problem:
When kickstarting a CentOS 7 host on an IPv6-only network, there's a race condition in dracut's initramfs between bringing up the NIC, learning the v6 gateway, and downloading the kickstart config. This often causes fetching of kickstart and the stage2 image to fail and eventually dropping to an emergency dracut shell.

In our network we rely on learning the v6 default gateway from layer3 rack switches via ICMP6 router solicitation/router advertisements. For installing hosts we assign a static IPv6 addresses (for DNS mapping) and disable SLAAC.

Syslog shows the Ethernet link state changing up/down/up immediately before the failure. Based on the "network is unreachable" errors, I highly suspect the link is considered to be online and dracut continues to try to fetch kickstart config+stage2 before the host has been able to learn a v6 default gateway from the switch.


Version-Release number of selected component (if applicable):
dracut-033-240.el7

How reproducible:
This is very reproducible when kickstart installing hosts in our v6-only environment.

Steps to Reproduce:
1.
Kickstart a host (e.g. with iPXE or UEFI) over a v6 network, specifying a v6 IP address for the initrd, stage2 and kickstart options.  Example:

  4.0.9-30_vmlinuz noipv4 console=ttyS1,57600 selinux=0 pcie_aspm=off net.ifnames=0 initrd=\4.0.9-30_centos7.1.1503_initrd.img biosdevname=0 inst.ks.sendmac ip=[2401:db00:20:153:face:0:2d:0]:::64:::none inst.stage2=http://[2401:db00:11:df:face:b00c:0:134]/yum/centos/7.x/stage2/7.1-r18/ ramdisk_size=2097152 inst.loglevel=debug inst.cmdline inst.sshd rd.live.image nameserver=2401:db00:f0:a53:: nameserver=2401:db00:f0:b53:: inst.ks=http://[2401:db00:11:df:face:b00c:0:76]/ks-aux1.prn1.facebook.com.cfg

2. Wait for dracut to load and start bringing up the Basic System target.
3. 

Actual results:

Kernel boots, dracut runs, attempts to fetch kickstart configuration and specified stage2 squashfs.img over network, fails.  Drops to dracut prompt, e.g.

[   16.033918] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   16.210200] ixgbe 0000:03:00.0 eth0: NIC Link is Down
[   16.319369] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
dracut-initqueue[1219]: curl: (7) Failed to connect to 2401:db00:11:df:face:b00c:0:134: Network is unreachable
dracut-initqueue[1219]: Warning: failed to fetch kickstart from http://[2401:db00:11:df:face:b00c:0:134]/ks-aux1.prn1.facebook.com.cfg

Expected results:

Dracut fetches kickstart config and stage2, starts anaconda, installation succeeds


Additional info:

I have been able to reliably work around this by inserting a dracut online script named 10-fb-network-sleep-fix.sh that does nothing but sleep 4 seconds before running 11-fetch-kickstart-net.sh

Dracut rd.debug logs are not very useful here. The race condition is only 1-2 seconds and enabling of the debug logging slows down the boot process just enough that the race is never exposed with debug turned on. :(


Failure log:  https://gist.github.com/bwann/7bc1c4217807b01087ff

Originally reported on anaconda's github issue tracker:
https://github.com/rhinstaller/anaconda/issues/464

Comment 1 Bryan Wann 2015-12-17 23:19:01 UTC
Created attachment 1106911 [details]
dracut sleep patch

Inserted into initrd as  /usr/lib/dracut/hooks/initqueue/online/10-fb-network-sleep-fix.sh

Comment 5 Jan Stodola 2016-09-06 13:28:39 UTC
Bryan Wann,
this issue should be fixed in RHEL-7.3 Beta. Could you please try if it works for you now?

I was able to reproduce it with RHEL-7.2. When retesting on the same system with latest dracut-033-458.el7, this issue didn't appear (the system was booted ~100 times without any problem)

Comment 6 Phil Dibowitz 2016-09-07 23:38:41 UTC
It looks like master needs a similar fix?

Comment 7 Bryan Wann 2016-09-08 05:51:39 UTC
Great!  I need to get my hands on 7.3 to test.  Looking over the Dracut commits in the RHEL-7 branch I think I see the applicable commits that address this (e.g. network-lib wait for network RA), but as Phil mentioned those commits don't seem to be in the 0.44 rawhide/master?

Comment 9 errata-xmlrpc 2016-11-04 08:02:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2530.html

Comment 10 Bryan Wann 2017-03-21 03:18:41 UTC
Sorry for the delay, but yes this does seem to be better in 7.3. Thanks!

Comment 11 Angelo FAilla 2017-03-27 23:06:03 UTC
Hi, can you please confirm whether this has been merged to master/rawhide too? We really need those in rawhide... Thanks!

Comment 12 Angelo FAilla 2017-03-27 23:07:29 UTC
Can you please confirm this has been merged into rawhide/master? We really need this merged there... thanks!

Comment 13 Angelo FAilla 2017-03-27 23:25:51 UTC
I see master doesn't have the fix:

https://github.com/dracutdevs/dracut/blob/master/modules.d/40network/net-lib.sh#L665

same on:

https://git.kernel.org/pub/scm/boot/dracut/dracut.git/tree/modules.d/40network/net-lib.sh?h=master#n665

not sure which one of the two is authoritative.