Created attachment 1106893 [details]
Anaconda .treeinfo retry patch
Description of problem:
During non-interactive kickstart installation on a IPv6-only host, package source+selection fails in Anaconda due to race condition with NetworkManager bringing up the NIC and learning a v6 gateway via router solicitation.
In our network we rely on learning the v6 default gateway from layer3 rack switches via ICMP6 router solicitation/router advertisements. For installing hosts we assign a static IPv6 addresses (for DNS mapping) and disable SLAAC.
Anaconda starts, which then starts NetworkManager. Anaconda immediately tries to fetch treeinfo from defined repos. However, NetworkManager has not yet finished bringing up the NIC to a CONNECTED_GLOBAL state. Thus anaconda's treeinfo download fails with a "Network is unreachable" error, removes the repo(s) from consideration.
Anaconda fails 1-2 seconds before NetworkManager has finished and has installed a v6 default gateway.
Version-Release number of selected component (if applicable):
Very reproducible, virtually all v6-only kickstart installations
Steps to Reproduce:
1. Build a kickstart configuration with v6-only base/repo urls, and v6-only network configuration:
url --url http://[2401:db00:11:df:face:b00c:0:134]/yum/centos/7.x/os/x86_64
network --noipv4 --hostname=aux1.prn1.facebook.com --bootproto=static --ipv6=2401:db00:19:5a:face:0:31:0 --device=eth0 --nameserver=2401:db00:f0:a53::,2401:db00:f0:b53::
2. Kickstart the host on an IPv6-only network (e.g. via iPXE or UEFI), specifying options for a static v6 address, disabling SLAAC, and specifying no gateway on the kernel/dracut command-line:
3. Watch /tmp/packaging.log vs /tmp/syslog for NetworkManager progress on the host being installed
Anaconda starts, fails to fetch repo tree data, considers the repos unusable. On console this results in:
3) [!] Software selection (Installation source not set up)
4) [!] Installation source (Error setting up software source)
No software selection/source errors, anaconda finishes the installation.
I've been able to fix this problem by adding a retry mechanism to the .treeinfo download function in pyanaconda/packaging/__init__.py. This is attached as fb-anaconda-package-treeinfo.patch.
There's an upstream Anaconda patch that retries package repo metadata downloads. I basically did exactly this in my treeinfo fix:
Logs of failures:
Logs of success after workaround attached as fixed-anaconda-packaging.log
Created attachment 1106894 [details]
Log output after patch applied
Issue originally reported on Anaconda's github page, but my workaround there was incorrect:
would you be able to help with testing of this bug once the fix is available?
from your first comment I understand that your issue should be fixed now by the commit you have mentioned.
If I am correct could you please test your issue in RHEL 7.2 where this patch should be included.
No, it's not the same thing. The code that's already in Anaconda handles retries for package repo metadata. My issue happens earlier when Anaconda is fetching .treeinfo since it's the first download operation that happens during the install. My fix replicated the same code from that commit and applied it to packaging/__init__.py so we retry fetching there.
This gives us enough time for v6 to have gone through things like duplicate address detection, RS/RA and have a usable gateway. This all could take 1-4 seconds to complete. Otherwise we will likely fail downloading .treeinfo and mark the repo as unusable.
I was looking at the NetworkManager code in Anaconda yesterday. It looks like the root cause is that we wait for NM to signal any sort of 'connected' state, i.e. local, site, global before allowing Anaconda to continue. This seems kind of broken because if we have to go outside our local network for package repos/etc but proceed on a connected_local state we could miss out. (The FIXME comment in the code alludes to this)
NM code where this happens:
Unfortunately a more rigorous fix for this seems pretty thorny, we'd have to figure out what repos/resources during installation are remote. Perhaps this retry mechanism for .treeinfo is the best compromise.
Sorry for my misunderstanding of your issue and thank you for your patch and for explanation.
I'll look on this soon.
I've created patch based on your patch Bryan. Thank you for your work on the patch.
The final solution (Network Manager state) seems to me too invasive for the RHEL but I'm going to create that fix to master branch later.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
this issue should be fixed in RHEL-7.3 Beta compose. Could you please retest that the issue is fixed for you?
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.