Hide Forgot
Created attachment 1106893 [details] Anaconda .treeinfo retry patch Description of problem: During non-interactive kickstart installation on a IPv6-only host, package source+selection fails in Anaconda due to race condition with NetworkManager bringing up the NIC and learning a v6 gateway via router solicitation. In our network we rely on learning the v6 default gateway from layer3 rack switches via ICMP6 router solicitation/router advertisements. For installing hosts we assign a static IPv6 addresses (for DNS mapping) and disable SLAAC. Anaconda starts, which then starts NetworkManager. Anaconda immediately tries to fetch treeinfo from defined repos. However, NetworkManager has not yet finished bringing up the NIC to a CONNECTED_GLOBAL state. Thus anaconda's treeinfo download fails with a "Network is unreachable" error, removes the repo(s) from consideration. Anaconda fails 1-2 seconds before NetworkManager has finished and has installed a v6 default gateway. Version-Release number of selected component (if applicable): anaconda 19.31.123-1 NetworkManager 1.0.0-14.git20150121.b4ea599c.el7 How reproducible: Very reproducible, virtually all v6-only kickstart installations Steps to Reproduce: 1. Build a kickstart configuration with v6-only base/repo urls, and v6-only network configuration: url --url http://[2401:db00:11:df:face:b00c:0:134]/yum/centos/7.x/os/x86_64 network --noipv4 --hostname=aux1.prn1.facebook.com --bootproto=static --ipv6=2401:db00:19:5a:face:0:31:0 --device=eth0 --nameserver=2401:db00:f0:a53::,2401:db00:f0:b53:: 2. Kickstart the host on an IPv6-only network (e.g. via iPXE or UEFI), specifying options for a static v6 address, disabling SLAAC, and specifying no gateway on the kernel/dracut command-line: ip=[2401:db00:11:815a:face:0:31:0]:::64:::none noipv4 3. Watch /tmp/packaging.log vs /tmp/syslog for NetworkManager progress on the host being installed Actual results: Anaconda starts, fails to fetch repo tree data, considers the repos unusable. On console this results in: 3) [!] Software selection (Installation source not set up) 4) [!] Installation source (Error setting up software source) Expected results: No software selection/source errors, anaconda finishes the installation. Additional info: I've been able to fix this problem by adding a retry mechanism to the .treeinfo download function in pyanaconda/packaging/__init__.py. This is attached as fb-anaconda-package-treeinfo.patch. There's an upstream Anaconda patch that retries package repo metadata downloads. I basically did exactly this in my treeinfo fix: https://github.com/rhinstaller/anaconda/commit/8c2544c6f537240179569d0068a7b22250e21a25 Logs of failures: https://gist.github.com/bwann/396ea39750264ebb56a6 Logs of success after workaround attached as fixed-anaconda-packaging.log --bwann
Created attachment 1106894 [details] Log output after patch applied
Issue originally reported on Anaconda's github page, but my workaround there was incorrect: https://github.com/rhinstaller/anaconda/issues/466
Hi Bryan, would you be able to help with testing of this bug once the fix is available? Thanks, Martin
Sure thing
Hello Bryan, from your first comment I understand that your issue should be fixed now by the commit you have mentioned. https://github.com/rhinstaller/anaconda/commit/8c2544c6f537240179569d0068a7b22250e21a25 If I am correct could you please test your issue in RHEL 7.2 where this patch should be included. Thank you
No, it's not the same thing. The code that's already in Anaconda handles retries for package repo metadata. My issue happens earlier when Anaconda is fetching .treeinfo since it's the first download operation that happens during the install. My fix replicated the same code from that commit and applied it to packaging/__init__.py so we retry fetching there. This gives us enough time for v6 to have gone through things like duplicate address detection, RS/RA and have a usable gateway. This all could take 1-4 seconds to complete. Otherwise we will likely fail downloading .treeinfo and mark the repo as unusable. I was looking at the NetworkManager code in Anaconda yesterday. It looks like the root cause is that we wait for NM to signal any sort of 'connected' state, i.e. local, site, global before allowing Anaconda to continue. This seems kind of broken because if we have to go outside our local network for package repos/etc but proceed on a connected_local state we could miss out. (The FIXME comment in the code alludes to this) NM code where this happens: https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/nm.py#L159 Unfortunately a more rigorous fix for this seems pretty thorny, we'd have to figure out what repos/resources during installation are remote. Perhaps this retry mechanism for .treeinfo is the best compromise.
Sorry for my misunderstanding of your issue and thank you for your patch and for explanation. I'll look on this soon.
PR: https://github.com/rhinstaller/anaconda/pull/561 I've created patch based on your patch Bryan. Thank you for your work on the patch. The final solution (Network Manager state) seems to me too invasive for the RHEL but I'm going to create that fix to master branch later.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
Hi Brian, this issue should be fixed in RHEL-7.3 Beta compose. Could you please retest that the issue is fixed for you? Thanks, Martin
Bryan, any update? Thanks, M.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2158.html