Red Hat Bugzilla – Bug 1262950
Race condition with NetworkManager on discovery image
Last modified: 2016-07-27 07:06:43 EDT
Created attachment 1073364 [details]
screenshot of discovery
Description of problem:
Essentially the same as https://bugzilla.redhat.com/show_bug.cgi?id=1227017 but that was specific to the OSP installer.
This is in a lab within Red Hat. Note that this is on some older gear. I don't see the issue in another lab w/ more modern kit.
It seems that DHCP is still coming up when the discoveryd process starts, so the process fails to resolve the foreman host.
If I -HUP the discovery process in another terminal, the host properly resolves the foreman IP, reports in and is discovered.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. configure Satellite or capsule for discovery, with fdi.rootpw=PASSWD
2. boot host over PXE
3. host gets DHCP lease and gets TFTP file
4. host fails to resolve foreman IP. must kill -HUP discovery process in separate terminal
4. host checks in to foreman for discovery
this is in a lab within Red Hat and I can share some access/logs as needed.
Created attachment 1080015 [details]
journal from discovery boot
This is a dump of journalctl from the host once booted in discovery mode. You can see the first couple of times, the host fails to send to foreman due to DNS issues.
After kill -HUP the discovery process, you can see it then register properly.
Discovery is starting before DHCP is fully up, and can't resolve the foreman URL at that time. Eventually the host is on the network and can resolve foreman, but the discovery process never seems to learn that the host is resolvable until after restarting the discovery daemon.
I can confirm we encountered this kind of behaviour and it has been fixed upstream already. We are planning an discovery errata in one or two months that will rebase the image and include this fix as well.
I can make you another build if you want and upload it for you. Just let me know on IRC.
with the latest build, the discovery-register service doesn't start. attaching latest journalctl
Created attachment 1104757 [details]
journal output from foreman-discovery-image-2.1.1-1.el7sat.noarch.rpm
Yes, Brad, there is a patch pending we need to include.
David, does the above link from comment 22 work?
Anyway, 6.1.5 errata is out and it contains completly rebased image, it won't be compatible with OSP tho anymore, but this bug was filed against Satellite 6, so use it.
We track one additional race condition which hasn't been merged yet upstream. I am attaching it to this BZ, symptoms are similar (this time foreman-proxy is not started properly): http://projects.theforeman.org/issues/12429
Moving to POST since upstream bug http://projects.theforeman.org/issues/12429 has been closed
Solution (workaround?) is to start foreman-proxy after NetworkManager-wait-online.service is ready.
More in https://github.com/theforeman/foreman-discovery-image/pull/48
My fault. I forgot to replace both lines (Wants= and After=) in foreman-proxy.service (https://github.com/lzap/foreman-discovery-image/commit/c72e80902b4cd34c4b8369f3ec118b8ef7ac9bf6). Once I did it, provisioning works as expected.
Great, can you please confirm in the PR itself that the build I made works as expected? Or at least show us the patch you made on your own build. Thanks. It's the https://github.com/theforeman/foreman-discovery-image/pull/50
Applied in changeset commit:foreman-discovery-image|0c18ba2a6d04e5105db1e2085fe69f091b6922c7.
@Lukas, Please provide verification steps. Assuming host should have multiple interfaces to reproduce this ? Please advise.
Simply verify if discovery works with one or multiple NICs in various environments.
Also, if possible, simulate slow DHCP and verify it starts correctly as well. You could easily simulate this by turning off DHCP server on the network, waiting until Welcome screen appears and then turning it on. The background process should start discovery request and after few seconds, you should be able to refresh the screen. The status will likely be UNKNOWN - Use Refresh button to update info, this is expected.
Verified with sat6.2 beta snap8.2
I discovered a host with two nics and tried to simulate the slow DHCP as suggested in comment29. However, I'm not able to reproduce the reported issue.
Host is discovered successfully and I can see that host in webUI.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.