Bug 1292613

Summary: Race condition between NetworkManager and anaconda on IPv6-only hosts
Product: Red Hat Enterprise Linux 7 Reporter: Bryan Wann <bwann>
Component: anacondaAssignee: Jiri Konecny <jkonecny>
Status: CLOSED ERRATA QA Contact: Release Test Team <release-test-team-automation>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.1CC: bwann, josef, mbanas, mikolaj
Target Milestone: rcKeywords: OtherQA
Target Release: ---Flags: bwann: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: anaconda-21.48.22.63-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 23:20:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Anaconda .treeinfo retry patch
none
Log output after patch applied none

Description Bryan Wann 2015-12-17 22:39:27 UTC
Created attachment 1106893 [details]
Anaconda .treeinfo retry patch

Description of problem:
During non-interactive kickstart installation on a IPv6-only host, package source+selection fails in Anaconda due to race condition with NetworkManager bringing up the NIC and learning a v6 gateway via router solicitation.

In our network we rely on learning the v6 default gateway from layer3 rack switches via ICMP6 router solicitation/router advertisements. For installing hosts we assign a static IPv6 addresses (for DNS mapping) and disable SLAAC.

Anaconda starts, which then starts NetworkManager. Anaconda immediately tries to fetch treeinfo from defined repos. However, NetworkManager has not yet finished bringing up the NIC to a CONNECTED_GLOBAL state. Thus anaconda's treeinfo download fails with a "Network is unreachable" error, removes the repo(s) from consideration. 

Anaconda fails 1-2 seconds before NetworkManager has finished and has installed a v6 default gateway.


Version-Release number of selected component (if applicable):
anaconda 19.31.123-1
NetworkManager 1.0.0-14.git20150121.b4ea599c.el7


How reproducible:
Very reproducible, virtually all v6-only kickstart installations

Steps to Reproduce:
1. Build a kickstart configuration with v6-only base/repo urls, and v6-only network configuration:
  url --url http://[2401:db00:11:df:face:b00c:0:134]/yum/centos/7.x/os/x86_64
  network --noipv4 --hostname=aux1.prn1.facebook.com --bootproto=static --ipv6=2401:db00:19:5a:face:0:31:0 --device=eth0 --nameserver=2401:db00:f0:a53::,2401:db00:f0:b53::

2. Kickstart the host on an IPv6-only network (e.g. via iPXE or UEFI), specifying options for a static v6 address, disabling SLAAC, and specifying no gateway on the kernel/dracut command-line:

  ip=[2401:db00:11:815a:face:0:31:0]:::64:::none noipv4

3. Watch /tmp/packaging.log vs /tmp/syslog for NetworkManager progress on the host being installed

Actual results:
Anaconda starts, fails to fetch repo tree data, considers the repos unusable. On console this results in:

3) [!] Software selection (Installation source not set up)
4) [!] Installation source (Error setting up software source)


Expected results:
No software selection/source errors, anaconda finishes the installation.


Additional info:
I've been able to fix this problem by adding a retry mechanism to the .treeinfo download function in pyanaconda/packaging/__init__.py.  This is attached as fb-anaconda-package-treeinfo.patch.

There's an upstream Anaconda patch that retries package repo metadata downloads. I basically did exactly this in my treeinfo fix:
https://github.com/rhinstaller/anaconda/commit/8c2544c6f537240179569d0068a7b22250e21a25

Logs of failures:
https://gist.github.com/bwann/396ea39750264ebb56a6

Logs of success after workaround attached as fixed-anaconda-packaging.log



--bwann

Comment 1 Bryan Wann 2015-12-17 22:40:10 UTC
Created attachment 1106894 [details]
Log output after patch applied

Comment 3 Bryan Wann 2015-12-17 23:20:12 UTC
Issue originally reported on Anaconda's github page, but my workaround there was incorrect:

https://github.com/rhinstaller/anaconda/issues/466

Comment 4 Martin Banas 2016-03-01 08:23:43 UTC
Hi Bryan,
would you be able to help with testing of this bug once the fix is available?

Thanks,
Martin

Comment 5 Bryan Wann 2016-03-01 20:32:28 UTC
Sure thing

Comment 6 Jiri Konecny 2016-03-17 13:53:43 UTC
Hello Bryan,

from your first comment I understand that your issue should be fixed now by the commit you have mentioned.

https://github.com/rhinstaller/anaconda/commit/8c2544c6f537240179569d0068a7b22250e21a25

If I am correct could you please test your issue in RHEL 7.2 where this patch should be included.

Thank you

Comment 7 Bryan Wann 2016-03-17 18:59:44 UTC
No, it's not the same thing. The code that's already in Anaconda handles retries for package repo metadata. My issue happens earlier when Anaconda is fetching .treeinfo since it's the first download operation that happens during the install. My fix replicated the same code from that commit and applied it to packaging/__init__.py so we retry fetching there.

This gives us enough time for v6 to have gone through things like duplicate address detection, RS/RA and have a usable gateway. This all could take 1-4 seconds to complete. Otherwise we will likely fail downloading .treeinfo and mark the repo as unusable.

I was looking at the NetworkManager code in Anaconda yesterday. It looks like the root cause is that we wait for NM to signal any sort of 'connected' state, i.e. local, site, global before allowing Anaconda to continue. This seems kind of broken because if we have to go outside our local network for package repos/etc but proceed on a connected_local state we could miss out. (The FIXME comment in the code alludes to this)

NM code where this happens:
https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/nm.py#L159

Unfortunately a more rigorous fix for this seems pretty thorny, we'd have to figure out what repos/resources during installation are remote. Perhaps this retry mechanism for .treeinfo is the best compromise.

Comment 8 Jiri Konecny 2016-03-18 09:00:46 UTC
Sorry for my misunderstanding of your issue and thank you for your patch and for explanation.

I'll look on this soon.

Comment 9 Jiri Konecny 2016-03-22 10:00:43 UTC
PR: https://github.com/rhinstaller/anaconda/pull/561

I've created patch based on your patch Bryan. Thank you for your work on the patch.

The final solution (Network Manager state) seems to me too invasive for the RHEL but I'm going to create that fix to master branch later.

Comment 10 Mike McCune 2016-03-28 22:46:32 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Martin Banas 2016-08-31 11:22:48 UTC
Hi Brian,
this issue should be fixed in RHEL-7.3 Beta compose. Could you please retest that the issue is fixed for you?

Thanks,
Martin

Comment 13 Martin Banas 2016-09-21 07:55:48 UTC
Bryan,
any update?

Thanks,
M.

Comment 16 errata-xmlrpc 2016-11-03 23:20:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2158.html