Bug 1262950 - Race condition with NetworkManager on discovery image
Race condition with NetworkManager on discovery image
Status: CLOSED ERRATA
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Discovery Image (Show other bugs)
6.1.1
Unspecified Unspecified
high Severity high (vote)
: Beta
: --
Assigned To: Lukas Zapletal
Sachin Ghai
http://projects.theforeman.org/issues...
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-14 13:52 EDT by David Critch
Modified: 2016-07-27 07:06 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-27 07:06:43 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
screenshot of discovery (472.08 KB, image/jpeg)
2015-09-14 13:52 EDT, David Critch
no flags Details
journal from discovery boot (119.67 KB, text/x-vhdl)
2015-10-05 14:08 EDT, David Critch
no flags Details
journal output from foreman-discovery-image-2.1.1-1.el7sat.noarch.rpm (65.60 KB, text/x-vhdl)
2015-12-11 13:25 EST, David Critch
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1227017 None None None Never

  None (edit)
Description David Critch 2015-09-14 13:52:08 EDT
Created attachment 1073364 [details]
screenshot of discovery

Description of problem:
Essentially the same as https://bugzilla.redhat.com/show_bug.cgi?id=1227017 but that was specific to the OSP installer.

This is in a lab within Red Hat. Note that this is on some older gear. I don't see the issue in another lab w/ more modern kit.

It seems that DHCP is still coming up when the discoveryd process starts, so the process fails to resolve the foreman host.

If I -HUP the discovery process in another terminal, the host properly resolves the foreman IP, reports in and is discovered.

Version-Release number of selected component (if applicable):
foreman-discovery-image-2.1.0-36.el7sat.noarch

How reproducible:
Always

Steps to Reproduce:
1. configure Satellite or capsule for discovery, with fdi.rootpw=PASSWD
2. boot host over PXE
3. host gets DHCP lease and gets TFTP file

Actual results:
4. host fails to resolve foreman IP. must kill -HUP discovery process in separate terminal

Expected results:
4. host checks in to foreman for discovery

Additional info:
this is in a lab within Red Hat and I can share some access/logs as needed.
Comment 2 David Critch 2015-10-05 14:08 EDT
Created attachment 1080015 [details]
journal from discovery boot

This is a dump of journalctl from the host once booted in discovery mode. You can see the first couple of times, the host fails to send to foreman due to DNS issues. 

After kill -HUP the discovery process, you can see it then register properly.

Discovery is starting before DHCP is fully up, and can't resolve the foreman URL at that time. Eventually the host is on the network and can resolve foreman, but the discovery process never seems to learn that the host is resolvable until after restarting the discovery daemon.
Comment 4 Lukas Zapletal 2015-10-20 07:06:02 EDT
Hello David,

I can confirm we encountered this kind of behaviour and it has been fixed upstream already. We are planning an discovery errata in one or two months that will rebase the image and include this fix as well.
Comment 18 Lukas Zapletal 2015-12-08 10:25:40 EST
I can make you another build if you want and upload it for you. Just let me know on IRC.
Comment 20 David Critch 2015-12-11 13:24:57 EST
with the latest build, the discovery-register service doesn't start. attaching latest journalctl
Comment 21 David Critch 2015-12-11 13:25 EST
Created attachment 1104757 [details]
journal output from foreman-discovery-image-2.1.1-1.el7sat.noarch.rpm
Comment 24 Lukas Zapletal 2015-12-17 06:14:35 EST
Yes, Brad, there is a patch pending we need to include.

David, does the above link from comment 22 work?

Anyway, 6.1.5 errata is out and it contains completly rebased image, it won't be compatible with OSP tho anymore, but this bug was filed against Satellite 6, so use it.

We track one additional race condition which hasn't been merged yet upstream. I am attaching it to this BZ, symptoms are similar (this time foreman-proxy is not started properly): http://projects.theforeman.org/issues/12429
Comment 26 Bryan Kearney 2016-02-24 18:10:31 EST
Moving to POST since upstream bug http://projects.theforeman.org/issues/12429 has been closed
-------------
Kamil Madac
Solution (workaround?) is to start foreman-proxy after NetworkManager-wait-online.service is ready.
More in https://github.com/theforeman/foreman-discovery-image/pull/48
-------------
Kamil Madac
My fault. I forgot to replace both lines (Wants= and After=) in foreman-proxy.service (https://github.com/lzap/foreman-discovery-image/commit/c72e80902b4cd34c4b8369f3ec118b8ef7ac9bf6). Once I did it, provisioning works as expected.
-------------
Lukas Zapletal
Great, can you please confirm in the PR itself that the build I made works as expected? Or at least show us the patch you made on your own build. Thanks. It's the https://github.com/theforeman/foreman-discovery-image/pull/50
-------------
Anonymous
Applied in changeset commit:foreman-discovery-image|0c18ba2a6d04e5105db1e2085fe69f091b6922c7.
Comment 28 Sachin Ghai 2016-04-15 02:23:23 EDT
@Lukas, Please provide verification steps. Assuming host should have multiple interfaces to reproduce this ? Please advise.
Comment 29 Lukas Zapletal 2016-04-15 03:45:37 EDT
QA steps:

Simply verify if discovery works with one or multiple NICs in various environments.

Also, if possible, simulate slow DHCP and verify it starts correctly as well. You could easily simulate this by turning off DHCP server on the network, waiting until Welcome screen appears and then turning it on. The background process should start discovery request and after few seconds, you should be able to refresh the screen. The status will likely be UNKNOWN - Use Refresh button to update info, this is expected.
Comment 30 Sachin Ghai 2016-04-18 06:17:23 EDT
Verified with sat6.2 beta snap8.2

I discovered a host with two nics and tried to simulate the slow DHCP as suggested in comment29. However, I'm not able to reproduce the reported issue.

Host is discovered successfully and I can see that host in webUI.
Comment 31 Bryan Kearney 2016-07-27 07:06:43 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501

Note You need to log in before you can comment on or make changes to this bug.