Bug 1554597

Summary: Provisioning a host with a lease (discovered host) leads to Unable to set DHCP entry (409 Conflict)
Product: Red Hat Satellite Reporter: Sandeep MJ <sjayapra>
Component: ProvisioningAssignee: Lukas Zapletal <lzap>
Status: CLOSED UPSTREAM QA Contact: Roman Plevka <rplevka>
Severity: high Docs Contact:
Priority: high    
Version: 6.3.0CC: ajambhul, aperotti, bbuckingham, bkearney, fgarciad, gapatil, hprakash, hshukla, inecas, jalviso, ktordeur, logank, lzap, mhulan, mina.asaad, mmccune, rchauhan, vgunasek, vijsingh, vvasilev
Target Milestone: UnspecifiedKeywords: Regression, Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-08 13:32:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sandeep MJ 2018-03-13 01:06:11 UTC
Description of problem:

After upgrade to Satellite 6.3, rebuilding a host with different subnet failed the below dhcp error in the UI:

~~~
Error: Failed to enable satellite.example.com for installation: ["Create DHCP Settings for satellite.example.com task failed with the following error: ERF12-6899 [ProxyAPI::ProxyException]: Unable to set DHCP entry ([RestClient::Conflict]: 409 Conflict) for Capsule https://capsule.example.com:9090/dhcp", "Failed to perform rollback on Remove DHCP Settings for satellite.example.com - ERF12-6899 [ProxyAPI::ProxyException]: Unable to set DHCP entry ([RestClient::Conflict]: 409 Conflict) for Capsule https://capsule.example.com:9090/dhcp"]
~~~

Version-Release number of selected component (if applicable):
Satellite 6.3
foreman-1.15.6.34-1.el7sat.noarch

How reproducible:
Every time

Steps to Reproduce:
1. Change the subnet of an existing build host and try to rebuild.
2. The overwrite button fails to overwrite the existing recored in the dhcp lease, and it fail with the error posted


Actual results:
Errors as specified above

Expected results:
No errors

Additional info:
Appears to be related to http://projects.theforeman.org/issues/19634 which was fixed in foreman-1.15.5

When the server info is cleaned from these 2 files - /var/lib/dhcpd/dhcpd.leases and /var/lib/dhcpd/dhcpd.leases, build mode is ok.

Comment 2 Lukas Zapletal 2018-03-22 13:34:53 UTC
I can confirm the analysis is correct, good work. The issue

http://projects.theforeman.org/issues/19634

should be root cause of the problem. Except it was not merged in 1.15 but in 1.17 due to long review. There are TWO CHANGES associated with this problem:

https://github.com/theforeman/foreman/pull/4555/files
https://github.com/theforeman/smart-proxy/pull/532/files

We need to backport both, they are reasonable changes for 6.3 z-stream and I consider this an important bug to fix in 6.3. Affected: all customers.

WORKAROUND: Delete host, create new one

Comment 7 Ivan Necas 2018-06-08 10:37:53 UTC
Connecting redmine issue http://projects.theforeman.org/issues/19634 from this bug

Comment 8 Satellite Program 2018-06-08 12:19:03 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/19634 has been resolved.

Comment 9 Lukas Zapletal 2018-06-11 13:26:53 UTC
REL-ENG: Note there is a smart-proxy patch needed for the core part to work:

https://github.com/theforeman/smart-proxy/pull/532

Comment 14 Logan Kuhn 2018-08-13 13:48:58 UTC
Any chance I could get the RPMs with the fix once they are available?  I'm testing Satellite for deployment at my company and this is a rather annoying bug to deal with every time I rebuild a system.

Support case: 02153260

Comment 15 Lukas Zapletal 2018-08-14 06:25:30 UTC
Hello, we are currently increasing priority of this ticket and we will be able to create hotfix once it's aligned into z-stream release.

Comment 35 Lukas Zapletal 2019-03-07 08:29:51 UTC
Thanks

after another round of investigation I've found that this particular BZ is fixed for both 6.4.0 (GA) and 6.3.0 (GA) as we rebased on  Foreman 1.15.6.2 before the release of 6.3.0. Therefore if you see 409 DHCP conflicts, it is likely for a different reason. Please investigate and provide more details how to reproduce or provide an instance with reproducer.

Comment 37 Bryan Kearney 2019-03-07 09:01:09 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue https://projects.theforeman.org/issues/19634 has been resolved.

Comment 38 Logan Kuhn 2019-03-07 12:45:47 UTC
Steps to reproduce:
1. Create Host in Satellite
2. Fill out all the relevant fields
3. Build the host
4. Install it via PXE (probably not required, but this is what i've been doing)
5. Once it's built
6. Click Build again in Satellite on that host.
7. DHCP error 409 pops up every time.
8. To fix it, stop dhcpd, remove the hosts' entry in /var/lib/dhcpd/dhcpd.leases and restart dhcpd and then you can rebuild the host.

Error: Failed to enable logank-test1.wolfram.com for installation: ["Create DHCP Settings for logank-test1.wolfram.com task failed with the following error: ERF12-6899 [ProxyAPI::ProxyException]: Unable to set DHCP entry ([RestClient::Conflict]: 409 Conflict) for Capsule https://satellite-tst2.wolfram.com:9090/dhcp", "Failed to perform rollback on Remove DHCP Settings for logank-test1.wolfram.com - ERF12-6899 [ProxyAPI::ProxyException]: Unable to set DHCP entry ([RestClient::Conflict]: 409 Conflict) for Capsule https://satellite-tst2.wolfram.com:9090/dhcp"]

Let me know what other information you want me to provide... this is the same information that I provided initially though on my support ticket (02153260).

Comment 40 Logan Kuhn 2019-03-07 17:14:59 UTC
In case it helps, this is the versions we're using:
rpm -qa | grep -i satellite
satellite-tst2.wolfram.com-foreman-proxy-client-1.0-1.noarch
satellite-6.3.5-1.el7sat.noarch
tfm-rubygem-foreman_theme_satellite-1.0.4.19-1.el7sat.noarch
satellite-installer-6.3.0.12-1.el7sat.noarch
satellite-tst2.wolfram.com-qpid-router-client-1.0-1.noarch
satellite-tst2.wolfram.com-tomcat-1.0-1.noarch
satellite-tst2.wolfram.com-qpid-broker-1.0-2.noarch
satellite-tst2.wolfram.com-foreman-proxy-1.0-1.noarch
satellite-tst2.wolfram.com-puppet-client-1.0-1.noarch
satellite-tst2.wolfram.com-qpid-client-cert-1.0-1.noarch
satellite-common-6.3.5-1.el7sat.noarch
satellite-tst2.wolfram.com-apache-1.0-1.noarch
satellite-cli-6.3.5-1.el7sat.noarch
satellite-tst2.wolfram.com-foreman-client-1.0-1.noarch
satellite-tst2.wolfram.com-qpid-router-server-1.0-1.noarch

Comment 41 Lukas Zapletal 2019-03-08 13:30:38 UTC
Logan, thanks for the detailed repro steps and helpful RPM list. However, I have just performed this and it works just fine. There must be something different on your end with configuration. I tested with:

rpm -q satellite
satellite-6.3.5-1.el7sat.noarch

For the record here is the proxy log from the moment when I hit Build button again:

I, [2019-03-08T13:17:19.803213 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:19 +0000] "GET /tftp/serverName HTTP/1.1" 200 30 0.0008
I, [2019-03-08T13:17:19.862013 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:19 +0000] "GET /dhcp/192.168.199.0/mac/52:54:00:60:60:01 HTTP/1.1" 200 247 0.0007
I, [2019-03-08T13:17:19.922321 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:19 +0000] "GET /dhcp/192.168.199.0/ip/192.168.199.154 HTTP/1.1" 200 249 0.0010
I, [2019-03-08T13:17:19.979843 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:19 +0000] "GET /dhcp/192.168.199.0/mac/52:54:00:60:60:01 HTTP/1.1" 200 247 0.0010
I, [2019-03-08T13:17:20.221370 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:20 +0000] "GET /unattended/templateServer HTTP/1.1" 200 46 0.0003
I, [2019-03-08T13:17:20.289092 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:20 +0000] "POST /tftp/PXELinux/52:54:00:60:60:01 HTTP/1.1" 200 - 0.0016
I, [2019-03-08T13:17:20.327690 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:20 +0000] "POST /tftp/fetch_boot_file HTTP/1.1" 200 - 0.0018
I, [2019-03-08T13:17:20.349147 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:17:20 +0000] "POST /tftp/fetch_boot_file HTTP/1.1" 200 - 0.0023
E, [2019-03-08T13:18:00.993864 ] ERROR -- : Attempt to remove nonexistent client certificate for boyd-boteilho.nat.lan
I, [2019-03-08T13:18:00.994337 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:18:00 +0000] "DELETE /puppet/ca/boyd-boteilho.nat.lan HTTP/1.1" 404 74 1.2325
I, [2019-03-08T13:18:01.055048 ]  INFO -- : 127.0.0.1 - - [08/Mar/2019:13:18:01 +0000] "POST /puppet/ca/autosign/boyd-boteilho.nat.lan HTTP/1.1" 200 - 0.0009
I, [2019-03-08T13:18:01.777388 ]  INFO -- : 192.168.199.154 - - [08/Mar/2019:13:18:01 +0000] "GET /unattended/provision?token=d4142f85-1ff4-408d-b4fd-ae6493cb2fc1 HTTP/1.1" 200 4748 2.0773

Here is the Satellite log:

2019-03-08 13:17:19 19a447a8 [app] [I] Started PUT "/hosts/boyd-boteilho.nat.lan/setBuild?auth_object=boyd-boteilho.nat.lan&permission=build_hosts" for 192.168.199.1 at 2019-03-08 13:17:19 +0000
2019-03-08 13:17:19 19a447a8 [app] [I] Processing by HostsController#setBuild as HTML
2019-03-08 13:17:19 19a447a8 [app] [I]   Parameters: {"utf8"=>"✓", "authenticity_token"=>"xxx", "commit"=>"Build", "auth_object"=>"boyd-boteilho.nat.lan", "permission"=>"build_hosts", "id"=>"boyd-boteilho.nat.lan"}
2019-03-08 13:17:19 19a447a8 [app] [I] Current user: admin (administrator)
2019-03-08 13:17:19 19a447a8 [app] [I] Expire fragment views/tabs_and_title_records-3 (0.1ms)
2019-03-08 13:17:19 19a447a8 [app] [I] Fetching DHCP reservation boyd-boteilho.nat.lan for boyd-boteilho.nat.lan-52:54:00:60:60:01/192.168.199.154
2019-03-08 13:17:20 19a447a8 [templates] [I] Rendering template 'Kickstart default PXELinux'
2019-03-08 13:17:20 19a447a8 [app] [I] Redirected to https://sat63.nat.lan/hosts/boyd-boteilho.nat.lan
2019-03-08 13:17:20 19a447a8 [app] [I] Completed 302 Found in 778ms (ActiveRecord: 79.1ms)

As you can see, in my case DHCP record is (correctly) not being orchestrated at all. When user hit Build button, there is no need to rebuild DHCP because it's expected that the host can keep it's reserved IP address, therefore it only checks if it's present (Fetching DHCP reservation) and carries on.

Now, in my case I have provisioned my host in a subnet managed by Satellite, therefore it was assigned and reserved an IP address: 192.168.199.154. This IP address haven't changed so everything works fine. Note that since the server is using reserved IP, there is no need for a lease, therefore no lease exists and make a conflict.

However, there might be issues when IP or MAC address changes over time, either via Satellite or manually on that host. Let me know if you have performed anything like that.

Comment 42 Lukas Zapletal 2019-03-08 13:32:49 UTC
Having that said, the scenario you have described is not this bug, this one tracks problems when a host is discovered, a lease is created and then provisioned. This one was confirmed, fixed and verified. Let's work together in this BZ to identify the problem and we can create new BZ if needed.

Comment 43 Logan Kuhn 2019-03-08 13:37:45 UTC
Right, I mean if you have a dhcp reservation set you wouldn't expect there to be a lease to conflict, but if you remove that reservation and it's assigned a lease in the normal fashion of a dhcp request, are you able to replicate the issue?  Our setup does not have anything specific configured with DHCP, when it PXE boots, it requests a lease from the server.  Once the server finishes installing, if I try to rebuild it, the lease is still there and Satellite can't delete it, producing the 409 error.

I understand, I'm 100% fine moving this discussion to a new BZ if that is preferable, I just need this fixed so that i can put this server in production.

Comment 44 Lukas Zapletal 2019-03-13 14:46:27 UTC
So before I try, let me sum it up:

- install Satellite 6.3
- setup provisioning for PXE
- make sure that the default lease time is longer than Anaconda installation time (lease time defaults to 10 minutes so I'd probably need to extend that to 2 hours)
- PXE install a host
- once it completes, ensure the lease for the MAC address is still valid
- click on build button
- Satellite throws a DHCP conflict

Once you confirm, I will try to reproduce. Please review and provide me any other details, because I am pretty sure this is pretty normal workflow which is being tested by Satellite QA department for each release and it will work, longer lease time might be deviation from our standard configuration tho.

Comment 45 Logan Kuhn 2019-03-13 15:56:30 UTC
I'll be honest, i'm surprised it's a bug that exists, but here we are... I'll be thrilled to find out it's a configuration problem, but there really hasn't been a lot of configuration to the server.

Your summary is correct, my dhcp lease is set at, i think the default at "default-lease-time 43200"

If need be, I can do a screen recording of what I do and upload that as well.

Comment 46 Lukas Zapletal 2019-03-15 07:22:57 UTC
Ok I am reproducing now. Let's take the conversation back to the case, this BZ is not it.