1292613 – Race condition between NetworkManager and anaconda on IPv6-only hosts

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1292613 - Race condition between NetworkManager and anaconda on IPv6-only hosts

Summary: Race condition between NetworkManager and anaconda on IPv6-only hosts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	anaconda
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Konecny
QA Contact:	Release Test Team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-17 22:39 UTC by Bryan Wann
Modified:	2019-12-25 02:13 UTC (History)
CC List:	4 users (show)
Fixed In Version:	anaconda-21.48.22.63-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-03 23:20:56 UTC
Target Upstream Version:
Embargoed:
Flags:	bwann: needinfo-

Attachments	(Terms of Use)
Anaconda .treeinfo retry patch (2.79 KB, patch) 2015-12-17 22:39 UTC, Bryan Wann	no flags	Details \| Diff
Log output after patch applied (16.29 KB, text/plain) 2015-12-17 22:40 UTC, Bryan Wann	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:2158	0	normal	SHIPPED_LIVE	anaconda bug fix and enhancement update	2016-11-03 13:13:55 UTC

Description Bryan Wann 2015-12-17 22:39:27 UTC

Created attachment 1106893 [details]
Anaconda .treeinfo retry patch

Description of problem:
During non-interactive kickstart installation on a IPv6-only host, package source+selection fails in Anaconda due to race condition with NetworkManager bringing up the NIC and learning a v6 gateway via router solicitation.

In our network we rely on learning the v6 default gateway from layer3 rack switches via ICMP6 router solicitation/router advertisements. For installing hosts we assign a static IPv6 addresses (for DNS mapping) and disable SLAAC.

Anaconda starts, which then starts NetworkManager. Anaconda immediately tries to fetch treeinfo from defined repos. However, NetworkManager has not yet finished bringing up the NIC to a CONNECTED_GLOBAL state. Thus anaconda's treeinfo download fails with a "Network is unreachable" error, removes the repo(s) from consideration. 

Anaconda fails 1-2 seconds before NetworkManager has finished and has installed a v6 default gateway.


Version-Release number of selected component (if applicable):
anaconda 19.31.123-1
NetworkManager 1.0.0-14.git20150121.b4ea599c.el7


How reproducible:
Very reproducible, virtually all v6-only kickstart installations

Steps to Reproduce:
1. Build a kickstart configuration with v6-only base/repo urls, and v6-only network configuration:
  url --url http://[2401:db00:11:df:face:b00c:0:134]/yum/centos/7.x/os/x86_64
  network --noipv4 --hostname=aux1.prn1.facebook.com --bootproto=static --ipv6=2401:db00:19:5a:face:0:31:0 --device=eth0 --nameserver=2401:db00:f0:a53::,2401:db00:f0:b53::

2. Kickstart the host on an IPv6-only network (e.g. via iPXE or UEFI), specifying options for a static v6 address, disabling SLAAC, and specifying no gateway on the kernel/dracut command-line:

  ip=[2401:db00:11:815a:face:0:31:0]:::64:::none noipv4

3. Watch /tmp/packaging.log vs /tmp/syslog for NetworkManager progress on the host being installed

Actual results:
Anaconda starts, fails to fetch repo tree data, considers the repos unusable. On console this results in:

3) [!] Software selection (Installation source not set up)
4) [!] Installation source (Error setting up software source)


Expected results:
No software selection/source errors, anaconda finishes the installation.


Additional info:
I've been able to fix this problem by adding a retry mechanism to the .treeinfo download function in pyanaconda/packaging/__init__.py.  This is attached as fb-anaconda-package-treeinfo.patch.

There's an upstream Anaconda patch that retries package repo metadata downloads. I basically did exactly this in my treeinfo fix:
https://github.com/rhinstaller/anaconda/commit/8c2544c6f537240179569d0068a7b22250e21a25

Logs of failures:
https://gist.github.com/bwann/396ea39750264ebb56a6

Logs of success after workaround attached as fixed-anaconda-packaging.log



--bwann

Comment 1 Bryan Wann 2015-12-17 22:40:10 UTC

Created attachment 1106894 [details]
Log output after patch applied

Comment 3 Bryan Wann 2015-12-17 23:20:12 UTC

Issue originally reported on Anaconda's github page, but my workaround there was incorrect:

https://github.com/rhinstaller/anaconda/issues/466

Comment 4 Martin Banas 2016-03-01 08:23:43 UTC

Hi Bryan,
would you be able to help with testing of this bug once the fix is available?

Thanks,
Martin

Comment 5 Bryan Wann 2016-03-01 20:32:28 UTC

Sure thing

Comment 6 Jiri Konecny 2016-03-17 13:53:43 UTC

Hello Bryan,

from your first comment I understand that your issue should be fixed now by the commit you have mentioned.

https://github.com/rhinstaller/anaconda/commit/8c2544c6f537240179569d0068a7b22250e21a25

If I am correct could you please test your issue in RHEL 7.2 where this patch should be included.

Thank you

Comment 7 Bryan Wann 2016-03-17 18:59:44 UTC

No, it's not the same thing. The code that's already in Anaconda handles retries for package repo metadata. My issue happens earlier when Anaconda is fetching .treeinfo since it's the first download operation that happens during the install. My fix replicated the same code from that commit and applied it to packaging/__init__.py so we retry fetching there.

This gives us enough time for v6 to have gone through things like duplicate address detection, RS/RA and have a usable gateway. This all could take 1-4 seconds to complete. Otherwise we will likely fail downloading .treeinfo and mark the repo as unusable.

I was looking at the NetworkManager code in Anaconda yesterday. It looks like the root cause is that we wait for NM to signal any sort of 'connected' state, i.e. local, site, global before allowing Anaconda to continue. This seems kind of broken because if we have to go outside our local network for package repos/etc but proceed on a connected_local state we could miss out. (The FIXME comment in the code alludes to this)

NM code where this happens:
https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/nm.py#L159

Unfortunately a more rigorous fix for this seems pretty thorny, we'd have to figure out what repos/resources during installation are remote. Perhaps this retry mechanism for .treeinfo is the best compromise.

Comment 8 Jiri Konecny 2016-03-18 09:00:46 UTC

Sorry for my misunderstanding of your issue and thank you for your patch and for explanation.

I'll look on this soon.

Comment 9 Jiri Konecny 2016-03-22 10:00:43 UTC

PR: https://github.com/rhinstaller/anaconda/pull/561

I've created patch based on your patch Bryan. Thank you for your work on the patch.

The final solution (Network Manager state) seems to me too invasive for the RHEL but I'm going to create that fix to master branch later.

Comment 10 Mike McCune 2016-03-28 22:46:32 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Martin Banas 2016-08-31 11:22:48 UTC

Hi Brian,
this issue should be fixed in RHEL-7.3 Beta compose. Could you please retest that the issue is fixed for you?

Thanks,
Martin

Comment 13 Martin Banas 2016-09-21 07:55:48 UTC

Bryan,
any update?

Thanks,
M.

Comment 16 errata-xmlrpc 2016-11-03 23:20:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2158.html

Note You need to log in before you can comment on or make changes to this bug.