1969908 – Deployment failed: dnsmasq DHCPOFFERing the same address twice

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1969908 - Deployment failed: dnsmasq DHCPOFFERing the same address twice

Summary: Deployment failed: dnsmasq DHCPOFFERing the same address twice

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	dnsmasq
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	beta
Target Release:	---
Assignee:	Petr Menšík
QA Contact:	rhel-cs-infra-services-qe
Docs Contact:
URL:
Whiteboard:
Depends On:	2028704
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-09 12:57 UTC by Raviv Bar-Tal
Modified:	2022-01-01 07:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-01 07:26:59 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	rpittau: needinfo+

Attachments	(Terms of Use)
master-0-1 screenshot (32.63 KB, image/png) 2021-06-09 13:00 UTC, Raviv Bar-Tal	no flags	Details
conductor and inspector logs (1.94 MB, application/gzip) 2021-06-09 13:06 UTC, Raviv Bar-Tal	no flags	Details
View All

Description Raviv Bar-Tal 2021-06-09 12:57:31 UTC

Description of problem:
our jobs sometime fails with this error:
""Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'\""

looking at the setup we can see that 2 of the masters were successfully deployed.
while the 3rd one (master-0-1) failed.
The failed master console show no error, but it looks like it never loaded the IPA. 
 

Version-Release number of selected component (if applicable):


How reproducible:
This has low reproducing rate, so we simple run deployment in a loop

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
Ironic on bootstrap should try to inspect the master again and provision it

Additional info:
Relevant logs and screenshot are attached

Comment 1 Raviv Bar-Tal 2021-06-09 13:00:07 UTC

Created attachment 1789563 [details]
master-0-1 screenshot

Comment 2 Raviv Bar-Tal 2021-06-09 13:06:24 UTC

Created attachment 1789564 [details]
conductor and inspector logs

Comment 3 Dmitry Tantsur 2021-06-15 15:00:31 UTC

I assume it's a virtual environment? Could you please watch the node booting to see what is actually happening (e.g. if it fails to PXE boot and just falls back to the disk)?

Is there anything special about this environment? Can you reproduce the same on other environments?

Comment 4 Raviv Bar-Tal 2021-06-17 06:32:45 UTC

This is a virtual environment, unfortunately it is no longer available, 
We use this environment to run our CI jobs, so there is nothing special about it.
I will keep a note about rebooting the machine and see what happens for the next environment we have.

Comment 5 Derek Higgins 2021-06-17 08:23:32 UTC

Also can you include the httpd and dnsmasq logs the next time,
they might help us figure out if the node attempted to PXE boot.

Comment 9 Raviv Bar-Tal 2021-06-24 12:58:37 UTC

httpd and dnsmask log were attached

Comment 10 Derek Higgins 2021-06-28 10:50:38 UTC

(In reply to Raviv Bar-Tal from comment #9)
> httpd and dnsmask log were attached

Thanks, looking at the httpd logs I can see that 2 of the nodes downloaded the PXE config as expected
172.22.0.233 - - [07/Jun/2021:12:09:08 +0000] "GET /dualboot.ipxe HTTP/1.1" 200 741 "-" "iPXE/1.0.0+"
172.22.0.104 - - [07/Jun/2021:12:09:13 +0000] "GET /dualboot.ipxe HTTP/1.1" 200 741 "-" "iPXE/1.0.0+"

But one didn't, the dnsmasq logs show that it did attempt DHCP, it looks to me like dnsmasq replied to a DHCPDISCOVER from 2 masters with the same IP
dnsmasq-dhcp: 3004428367 DHCPDISCOVER(ens4) 52:54:00:2e:bc:9e 
dnsmasq-dhcp: 3004428367 DHCPOFFER(ens4) 172.22.0.104 52:54:00:2e:bc:9e 
dnsmasq-dhcp: 3004428367 DHCPDISCOVER(ens4) 52:54:00:a2:5a:82 
dnsmasq-dhcp: 3004428367 DHCPOFFER(ens4) 172.22.0.104 52:54:00:a2:5a:82 

Then when each node requested the same IP, one got ACK'd and the other NACK'd
dnsmasq-dhcp: 3004428367 DHCPREQUEST(ens4) 172.22.0.104 52:54:00:2e:bc:9e 
dnsmasq-dhcp: 3004428367 DHCPACK(ens4) 172.22.0.104 52:54:00:2e:bc:9e 
dnsmasq-dhcp: 3004428367 DHCPREQUEST(ens4) 172.22.0.104 52:54:00:a2:5a:82 
dnsmasq-dhcp: 3004428367 DHCPNAK(ens4) 172.22.0.104 52:54:00:a2:5a:82 address in use

My guess then is that because PXE failed it fell back to booting from HD

I'm surprised dnsmasq would DHCPOFFER the same address twice, I suggest you attach the dnsmasq
version, config and command line. We can then assign the bug to the dnsmasq component to assess
if this is a legitimate bug.

Comment 12 Raviv Bar-Tal 2021-07-01 12:31:23 UTC

Hey, I attached the dnsmask inspect file, which have the version and CreateCommand in it.
Can you please assign it to dnsmask? I don't seem to find it

Comment 13 Derek Higgins 2021-07-01 13:15:58 UTC

(In reply to Raviv Bar-Tal from comment #12)
> Hey, I attached the dnsmask inspect file, which have the version and
> CreateCommand in it.
> Can you please assign it to dnsmask? I don't seem to find it

This isn't the file I was thinking of, you'll have to exec into the dnsmasq container to find it,
in the mean time I'll move this bug to dnsmasq as there might be enough here to assess it

This is openshift 4.8 so the version of dnsmasq would be dnsmasq-2.79-15.el8.x86_64

Comment 17 Petr Menšík 2021-09-17 09:16:00 UTC

Ah, of course. I meant bug #1998448, which is dealing with similar issue for IPv6.

Comment 18 Petr Menšík 2021-09-18 21:35:11 UTC

It is recommended by DHCP RFC [1] to not offer already offered address. But it is not required, so dnsmasq does not violate the RFC. Citation:

While not required for correct operation of DHCP, the server SHOULD
NOT reuse the selected network address before the client responds to
the server's DHCPOFFER message. The server may choose to record the
address as offered to the client.

I think solving this issue properly would mean creation of short term reservations after sending DHCPOFFER. Such change would be complex and with possible new regressions. Not sure what configuration exactly was passed to dnsmasq, guessing just from the logs. It would require creation of "soft" lease, which would not be offered again on DISCOVER, but if no free addresses were available, could be still leased on DHCPREQUEST. No similar concept exists either for IPv6 or IPv4 now.

1. https://www.rfc-editor.org/rfc/rfc2131#section-4.3.1

Comment 20 Petr Menšík 2021-12-11 02:32:33 UTC

This is very similar to bug #2028704 detected on OpenStack. Because there is already reproducer in that bug and more details, work would be done there. I think non-trivial change to upstream is required, have not found simpler fix so far. Needs to be discussed upstream.

Not closing as duplicate (yet), because it was there sooner and is on different component.

Comment 21 RHEL Program Management 2022-01-01 07:26:59 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.