Bug 2028704 - Even with an allocation range large enough, some introspected nodes fail to get IP address from DHCP on first boot
Summary: Even with an allocation range large enough, some introspected nodes fail to g...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: dnsmasq
Version: 16.2 (Train)
Hardware: x86_64
OS: All
urgent
urgent
Target Milestone: ---
: ---
Assignee: Petr Menšík
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On: 1998448
Blocks: 1969908
TreeView+ depends on / blocked
 
Reported: 2021-12-03 00:59 UTC by David Hill
Modified: 2023-05-04 18:32 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-04-17 14:51:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Prints a "selected addr" when selecting an IP. (2.21 KB, patch)
2021-12-04 14:14 UTC, David Hill
no flags Details | Diff
dnsmasq 2.86 additional change (3.23 KB, patch)
2021-12-08 00:32 UTC, Petr Menšík
no flags Details | Diff
dnsmasq 2.86 ping optimization (3.81 KB, patch)
2021-12-08 00:33 UTC, Petr Menšík
no flags Details | Diff
basic reproducer script (921 bytes, application/x-shellscript)
2021-12-09 00:38 UTC, Petr Menšík
no flags Details
dnsmasq.conf being used (1.21 KB, text/plain)
2021-12-09 01:47 UTC, David Hill
no flags Details
pcap (314.33 KB, application/vnd.tcpdump.pcap)
2021-12-09 13:47 UTC, David Hill
no flags Details
dnsmasq.log (1.06 MB, text/plain)
2021-12-09 13:49 UTC, David Hill
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-11162 0 None None None 2021-12-03 00:59:50 UTC
Red Hat Knowledge Base (Solution) 6594761 0 None None None 2021-12-16 15:48:46 UTC

Description David Hill 2021-12-03 00:59:33 UTC
Description of problem:
Even with an allocation range large enough, some introspected nodes fail to get IP address from DHCP on first boot as "no address available" is seen in the logs.    It's as if dnsmasq-dhcp would've allocated all the available IPs or there was a race condition where the IP it selected for this DHCPDISCOVER would've been allocated to another host.   This happens only when a bunch of nodes are introspected concurrently and decreases with the amount of introspected nodes.   I'm not sure if concurrent deployment of overcloud nodes would also trigger this issue as we do limit concurrent deployment to 10 in nova (if I'm not mistaken).


[root@undercloud-0-rhosp16 ironic-inspector]# cat dnsmasq.log  | grep 44:c6
Dec  2 18:11:33 dnsmasq[8]: inotify, new or changed file /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:11:33 dnsmasq-dhcp[8]: read /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:14:52 dnsmasq[8]: inotify, new or changed file /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:14:52 dnsmasq-dhcp[8]: read /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:16:00 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 18:16:00 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 18:16:09 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 18:16:31 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 18:35:08 dnsmasq[8]: inotify, new or changed file /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:35:08 dnsmasq-dhcp[8]: read /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:35:09 dnsmasq[8]: inotify, new or changed file /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:35:09 dnsmasq-dhcp[8]: read /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:36:03 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 
Dec  2 18:36:03 dnsmasq-dhcp[8]: DHCPOFFER(br-ctlplane) 192.0.2.104 52:54:00:32:44:c6 
Dec  2 18:36:03 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 
Dec  2 18:36:03 dnsmasq-dhcp[8]: DHCPOFFER(br-ctlplane) 192.0.2.104 52:54:00:32:44:c6 
Dec  2 18:36:11 dnsmasq-dhcp[8]: DHCPREQUEST(br-ctlplane) 192.0.2.104 52:54:00:32:44:c6 
Dec  2 18:36:11 dnsmasq-dhcp[8]: DHCPACK(br-ctlplane) 192.0.2.104 52:54:00:32:44:c6 
Dec  2 18:36:39 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 
Dec  2 18:36:39 dnsmasq-dhcp[8]: DHCPOFFER(br-ctlplane) 192.0.2.104 52:54:00:32:44:c6 
Dec  2 18:36:39 dnsmasq-dhcp[8]: DHCPREQUEST(br-ctlplane) 192.0.2.104 52:54:00:32:44:c6 
Dec  2 18:36:39 dnsmasq-dhcp[8]: DHCPACK(br-ctlplane) 192.0.2.104 52:54:00:32:44:c6 
Dec  2 18:38:48 dnsmasq[8]: inotify, new or changed file /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:38:48 dnsmasq-dhcp[8]: read /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 18:40:30 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 ignored
Dec  2 18:40:35 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 ignored
Dec  2 18:41:34 dnsmasq-dhcp[8]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 ignored
Dec  2 19:05:56 dnsmasq-dhcp[6]: read /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 19:44:01 dnsmasq[6]: inotify, new or changed file /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 19:44:01 dnsmasq-dhcp[6]: read /var/lib/ironic-inspector/dhcp-hostsdir/52:54:00:32:44:c6
Dec  2 19:45:43 dnsmasq-dhcp[6]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 19:45:43 dnsmasq-dhcp[6]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 19:45:50 dnsmasq-dhcp[6]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 19:46:14 dnsmasq-dhcp[6]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 no address available
Dec  2 19:53:25 dnsmasq-dhcp[6]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 
Dec  2 19:53:25 dnsmasq-dhcp[6]: DHCPOFFER(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:26 dnsmasq-dhcp[6]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 
Dec  2 19:53:26 dnsmasq-dhcp[6]: DHCPOFFER(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPREQUEST(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPACK(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPREQUEST(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPACK(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPREQUEST(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPACK(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPREQUEST(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:36 dnsmasq-dhcp[6]: DHCPACK(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:58 dnsmasq-dhcp[6]: DHCPDISCOVER(br-ctlplane) 52:54:00:32:44:c6 
Dec  2 19:53:58 dnsmasq-dhcp[6]: DHCPOFFER(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:58 dnsmasq-dhcp[6]: DHCPREQUEST(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 
Dec  2 19:53:58 dnsmasq-dhcp[6]: DHCPACK(br-ctlplane) 192.0.2.110 52:54:00:32:44:c6 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Hill 2021-12-03 01:43:08 UTC
I've configured a lease file and we can see the allocation pool is really underused and yet we get the "no address available" before we even reach half the available IP range.

Comment 12 Petr Menšík 2021-12-03 16:15:45 UTC
It seems address_allocate() does not change state except do_icmp_ping() function. Retries should receive exactly the same results. But icmp pings might return different result next time.

I could not find anything obvious in address_allocate(), why it should not assign any next free address. Unless there is not really any free address available. I am not sure dnsmasq can report such situation well. Or make it obvious why did that happen. I think printing number of still available addresses would help to find a reason for these failures. It is possible later retries are successful, because some previous leases already expired.

Comment 13 David Hill 2021-12-03 16:34:28 UTC
With my patch it gives this:

Dec  3 11:29:16 dnsmasq-dhcp[7]: 2827111017 sent size:  4 option:  3 router  192.0.2.1
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 available DHCP range: 192.0.2.80 -- 192.0.2.120
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 DHCPDISCOVER(br-ctlplane) 52:54:00:c6:ca:60 
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 tags: ctlplane-subnet, known, br-ctlplane
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 DHCPOFFER(br-ctlplane) 192.0.2.86 52:54:00:c6:ca:60 
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 requested options: 1:netmask, 28:broadcast, 2:time-offset, 121:classless-static-route, 
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 requested options: 15:domain-name, 6:dns-server, 12:hostname, 
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 requested options: 40:nis-domain, 41:nis-server, 42:ntp-server, 
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 requested options: 26:mtu, 119:domain-search, 3:router, 121:classless-static-route, 
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 requested options: 249, 33:static-route, 252, 42:ntp-server, 
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 requested options: 17:root-path
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 bootfile name: undionly.kpxe
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 server name: localhost.localdomain
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 next server: 192.0.2.1
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  1 option: 53 message-type  2
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  4 option: 54 server-identifier  192.0.2.1
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  4 option: 51 lease-time  10m
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  4 option: 58 T1  5m
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  4 option: 59 T2  8m45s
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  4 option:  1 netmask  255.255.255.0
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  4 option: 28 broadcast  192.0.2.255
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  2 option: 26 mtu  1500
Dec  3 11:29:16 dnsmasq-dhcp[7]: 934218282 sent size:  4 option:  3 router  192.0.2.1
Dec  3 11:29:16 dnsmasq-dhcp[7]: 3750542452 available DHCP range: 192.0.2.80 -- 192.0.2.120
Dec  3 11:29:16 dnsmasq-dhcp[7]: retry 0/10
Dec  3 11:29:19 dnsmasq-dhcp[7]: retry 1/10
Dec  3 11:29:22 dnsmasq-dhcp[7]: retry 2/10
Dec  3 11:29:22 dnsmasq-dhcp[7]: retry 3/10
Dec  3 11:29:25 dnsmasq-dhcp[7]: retry 4/10
Dec  3 11:29:28 dnsmasq-dhcp[7]: retry 5/10
Dec  3 11:29:31 dnsmasq-dhcp[7]: retry 6/10
Dec  3 11:29:34 dnsmasq-dhcp[7]: retry 7/10
Dec  3 11:29:37 dnsmasq-dhcp[7]: retry 8/10
Dec  3 11:29:40 dnsmasq-dhcp[7]: retry 9/10
Dec  3 11:29:43 dnsmasq-dhcp[7]: retry 10/10
Dec  3 11:29:46 dnsmasq-dhcp[7]: 3750542452 DHCPDISCOVER(br-ctlplane) 52:54:00:7b:d3:de no address available 3

But there was still 30 available IPs at that time.

Comment 14 David Hill 2021-12-03 17:50:34 UTC
Dec  3 12:49:56 dnsmasq-dhcp[8]: 3962316867 available DHCP range: 192.0.2.80 -- 192.0.2.120
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1459749056
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1476526272
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1493303488
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1510080704
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1526857920
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1543635136
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1560412352
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1577189568
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1593966784
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1610744000
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1627521216
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1644298432
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1661075648
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1677852864
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1694630080
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1711407296
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1728184512
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1744961728
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1761738944
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1778516160
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1795293376
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1812070592
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1828847808
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1845625024
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1862402240
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1879179456
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1895956672
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1912733888
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1929511104
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1946288320
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1963065536
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1979842752
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1996619968
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 2013397184
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1342308544
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1359085760
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1375862976
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1392640192
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1409417408
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1426194624
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1442971840
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG END addr: 1442971840
Dec  3 12:49:56 dnsmasq-dhcp[8]: 3962316867 DHCPDISCOVER(br-ctlplane) 52:54:00:a9:a4:35 no address available

Comment 15 David Hill 2021-12-03 17:55:29 UTC
At that time we had 15 out of 41 available IPs allocated and yet:

Dec  3 12:49:51 dnsmasq-dhcp[8]: 3697435191 DHCPDISCOVER(br-ctlplane) 52:54:00:c8:4f:7a 
Dec  3 12:49:51 dnsmasq-dhcp[8]: DEBUG addr: 1442971840
Dec  3 12:49:51 dnsmasq-dhcp[8]: DEBUG addr: 1459749056
Dec  3 12:49:51 dnsmasq-dhcp[8]: DEBUG addr: 1476526272
Dec  3 12:49:54 dnsmasq-dhcp[8]: 3438233111 DHCPDISCOVER(br-ctlplane) 52:54:00:b7:62:8e 
Dec  3 12:49:55 dnsmasq-dhcp[8]: DEBUG addr: 1459749056
Dec  3 12:49:55 dnsmasq-dhcp[8]: DEBUG addr: 1476526272
Dec  3 12:49:55 dnsmasq-dhcp[8]: 3438233111 DHCPDISCOVER(br-ctlplane) 52:54:00:b7:62:8e 
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1459749056
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1476526272
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1493303488
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1510080704
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1526857920
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1543635136
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1560412352
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1577189568
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1593966784
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1610744000
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1627521216
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1644298432
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1661075648
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1677852864
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1694630080
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1711407296
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1728184512
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1744961728
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1761738944
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1778516160
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1795293376
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1812070592
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1828847808
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1845625024
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1862402240
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1879179456
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1895956672
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1912733888
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1929511104
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1946288320
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1963065536
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1979842752
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1996619968
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 2013397184
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1342308544
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1359085760
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1375862976
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1392640192
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1409417408
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1426194624
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG addr: 1442971840
Dec  3 12:49:56 dnsmasq-dhcp[8]: DEBUG END addr: 1442971840
Dec  3 12:49:56 dnsmasq-dhcp[8]: 3962316867 DHCPDISCOVER(br-ctlplane) 52:54:00:a9:a4:35 no address available <==========================

Comment 16 David Hill 2021-12-03 18:28:17 UTC
Dec  3 13:19:50 dnsmasq-dhcp[7]: 639609606 available DHCP range: 192.0.2.80 -- 192.0.2.120
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1711407296
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1711407296
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1728184512
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1728184512
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1744961728
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1744961728
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1761738944
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1761738944
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1778516160
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1778516160
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1795293376
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1795293376
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1812070592
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1812070592
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1828847808
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1828847808
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1845625024
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1845625024
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1862402240
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1862402240
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1879179456
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1879179456
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1895956672
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1895956672
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1912733888
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1912733888
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1929511104
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1929511104
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1946288320
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1946288320
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1963065536
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1963065536
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1979842752
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1979842752
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1996619968
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 1996619968
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 2013397184
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG ping addr: 2013397184
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1342308544
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1359085760
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1375862976
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1392640192
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1409417408
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1426194624
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1442971840
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1459749056
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1476526272
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1493303488
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1510080704
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1526857920
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1543635136
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1560412352
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1577189568
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1593966784
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1610744000
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1627521216
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1644298432
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1661075648
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1677852864
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG addr: 1694630080
Dec  3 13:19:50 dnsmasq-dhcp[7]: DEBUG END addr: 1694630080
Dec  3 13:19:50 dnsmasq-dhcp[7]: 639609606 DHCPDISCOVER(br-ctlplane) 52:54:00:dd:bd:c8 no address available

Comment 17 David Hill 2021-12-03 18:29:19 UTC
--- dnsmasq-2.79/src/dhcp.c.orig	2018-03-17 14:39:23.000000000 -0400
+++ dnsmasq-2.79/src/dhcp.c	2021-12-03 13:05:55.530713689 -0500
@@ -766,6 +766,7 @@
 		 ((ntohl(addr.s_addr) & 0xff) != 0xff && ((ntohl(addr.s_addr) & 0xff) != 0x0))))
 	      {
 		struct ping_result *r;
+	    my_syslog(MS_DHCP | LOG_INFO, _("DEBUG ping addr: %d"), addr);
 		
 		if ((r = do_icmp_ping(now, addr, j, loopback)))
  		  {
@@ -790,10 +791,13 @@
 	    
 	    if (addr.s_addr == htonl(ntohl(c->end.s_addr) + 1))
 	      addr = c->start;
+
+	    my_syslog(MS_DHCP | LOG_INFO, _("DEBUG addr: %d"), addr);
 	    
 	  } while (addr.s_addr != start.s_addr);
 	}
 
+	    my_syslog(MS_DHCP | LOG_INFO, _("DEBUG END addr: %d"), addr);
   return 0;
 }

Comment 18 David Hill 2021-12-03 18:46:35 UTC
I think this is where it fails:
~~~
            if (!d &&
                !lease_find_by_addr(addr) &&
                !config_find_by_address(daemon->dhcp_conf, addr) &&
                (!IN_CLASSC(ntohl(addr.s_addr)) ||
                 ((ntohl(addr.s_addr) & 0xff) != 0xff && ((ntohl(addr.s_addr) & 0xff) != 0x0))))
~~~
because we don't see some of the IPs hitting the ping condition:
~~~
                if ((r = do_icmp_ping(now, addr, j, loopback)))
                  {
                    /* consec-ip mode: we offered this address for another client
                       (different hash) recently, don't offer it to this one. */
                    if (!option_bool(OPT_CONSEC_ADDR) || r->hash == j)
                      {
                        *addrp = addr;
                        return 1;
                      }
                  }
~~~

Comment 28 David Hill 2021-12-04 14:14:00 UTC
Created attachment 1844699 [details]
Prints a "selected addr" when selecting an IP.

Comment 30 David Hill 2021-12-04 15:11:57 UTC
I think the issue is here :

                    if (!option_bool(OPT_CONSEC_ADDR) || r->hash == j)


because we get this:
Dec  4 09:50:38 dnsmasq-dhcp[8]: 1006450191 available DHCP range: 192.0.2.80 -- 192.0.2.120
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.107 context 0 lease 1 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.108
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.108 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.108 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.109
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.109 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.109 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.110
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.110 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.110 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.111
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.111 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.111 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.112
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.112 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.112 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.113
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.113 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.113 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.114
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.114 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.114 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.115
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.115 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.115 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.116
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.116 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.116 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.117
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.117 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.117 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.118
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.118 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.118 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.119
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.119 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.119 context 0 lease 0 config 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.120
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.120 ping 1 epoch 0
Dec  4 09:50:38 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.120 context 0 lease 0 config 0


and no IP is selected even though many of them are available.

Comment 31 David Hill 2021-12-04 15:32:39 UTC
Issue appears to be that r->hash is sometimes undefined:

Dec  4 10:24:54 dnsmasq-dhcp[7]: 283874850 available DHCP range: 192.0.2.80 -- 192.0.2.120
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.108 context 0 lease 1 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.109 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.109 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.109 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.110 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.110 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.110 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.111 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.111 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.111 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.112 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.112 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.112 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.113 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.113 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.113 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.114 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.114 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.114 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.115 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.115 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.115 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.116 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.116 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.116 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.117 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.117 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.117 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.118 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.118 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.118 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.119 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.119 ping 1 epoch 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.119 context 0 lease 0 config 0
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: not null 192.0.2.120 0 -1591018936
Dec  4 10:24:54 dnsmasq-dhcp[7]: DEBUG addr: 192.0.2.120 ping 1 epoch 0

from this patch:
                if ((r = do_icmp_ping(now, addr, j, loopback)))
                  {
                    my_syslog(MS_DHCP | LOG_INFO, _("DEBUG addr: not null %s %d %d"), inet_ntoa(addr), r->hash, j);
                    /* consec-ip mode: we offered this address for another client
                       (different hash) recently, don't offer it to this one. */
                    if (!option_bool(OPT_CONSEC_ADDR) || r->hash == j)



and I suspect the issue comes from here:

  if ((count >= max) || option_bool(OPT_NO_PING) || loopback)
    {
      /* overloaded, or configured not to check, loopback interface, return "not in use" */
      dummy.hash = 0;
      return &dummy;
    }

Comment 32 David Hill 2021-12-04 15:43:56 UTC
Yep, that's it:

Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.92 -1102741403 -1755461948
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.92 ping 1 epoch 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.92 context 0 lease 0 config 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: Throtteling192.0.2.93
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.93 0 -1755461948
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.93 ping 1 epoch 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.93 context 0 lease 0 config 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: Throtteling192.0.2.94
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.94 0 -1755461948
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.94 ping 1 epoch 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.94 context 0 lease 0 config 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: Throtteling192.0.2.95
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.95 0 -1755461948
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.95 ping 1 epoch 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.95 context 0 lease 0 config 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: Throtteling192.0.2.96
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.96 0 -1755461948
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.96 ping 1 epoch 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.96 context 0 lease 0 config 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: Throtteling192.0.2.97
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: not null 192.0.2.97 0 -1755461948
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.97 ping 1 epoch 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: 192.0.2.97 context 0 lease 0 config 0
Dec  4 10:41:46 dnsmasq-dhcp[8]: DEBUG addr: Throtteling192.0.2.98



The DHCP ping throtteling functionnality appears to be either broken or too sensitive ... you can see above it hit that dummy.hash = 0 .

Comment 33 David Hill 2021-12-04 16:10:11 UTC
I change the behavior to kick in throttling only if count > 10 but we could probably do it many other ways:

  if (((count > 10) && (count >= max)) || option_bool(OPT_NO_PING) || loopback)
    {
      my_syslog(MS_DHCP | LOG_INFO, _("DEBUG addr: Throttling %s %d/%d"), inet_ntoa(addr),count,max);
      /* overloaded, or configured not to check, loopback interface, return "not in use" */
      dummy.hash = 0;
      return &dummy;
    }


... after how many pings is dnsmasq really overloaded ?   In this case, we got 6 ping in the last 30/3*0.6 so this means after 6 ping in the last few seconds, we stop pinging and all DHCPDISCOVERY are halted... I feel like this is a higher cost than just overloading dnsmasq with ping or returning something else than 0 ?  Perhaps we should just return the right hash always ?

Comment 34 David Hill 2021-12-04 16:12:46 UTC
with count > 10, I no longer hit my issue ...

Comment 39 David Hill 2021-12-04 17:57:33 UTC
I changed the 0.6 value to 0.7 and will see what happens.

Comment 40 David Hill 2021-12-04 18:12:55 UTC
Looks like with 0.7  it works fine too.

Comment 41 David Hill 2021-12-04 18:16:21 UTC
This is enough to solve my current issue:

[dhill@knox src]$ git diff
diff --git a/src/dhcp.c b/src/dhcp.c
index e500bc2..5f23407 100644
--- a/src/dhcp.c
+++ b/src/dhcp.c
@@ -710,7 +710,7 @@ struct ping_result *do_icmp_ping(time_t now, struct in_addr addr, unsigned int h
 {
   static struct ping_result dummy;
   struct ping_result *r, *victim = NULL;
-  int count, max = (int)(0.6 * (((float)PING_CACHE_TIME)/
+  int count, max = (int)(0.7 * (((float)PING_CACHE_TIME)/
                                ((float)PING_WAIT)));
 
   /* check if we failed to ping addr sometime in the last

Comment 45 Petr Menšík 2021-12-07 22:23:29 UTC
(In reply to David Hill from comment #37)
> I'm ashamed... the right fix is this:
> 
>       if (!option_bool(OPT_CONSEC_ADDR) || r->hash == j || r->hash == 0)
> 
> instead of:
>       if (!option_bool(OPT_CONSEC_ADDR) || r->hash == j)
> 
> ... right ?

This seems correct. This do_icmp_ping logic seems strange to me. It maxes check to 6 pings done in last 30 second. If it does not find recent ping or pings are disabled, it would return dummy ping result. Purpose of that were to limit number of pings done.

Of course this is done only when dhcp-sequential-ip is used. In the other case it would always skip hash check. I think pings with hash == 0 are done when dhcp request contained specific IP requested from client. That is tested when lease for requested address does not exist (anymore).

src/rfc2131.c:1115>
  else if (opt && address_available(context, addr, tagif_netid) && !lease_find_by_addr(addr) && 
		   !config_find_by_address(daemon->dhcp_conf, addr) && do_icmp_ping(now, addr, 0, loopback))

I think it should usually mean such address were requested in DISCOVER message, but no other record than ping is tracked for it. If we try it the second time from allocate_address, do_icmp_ping would find previous record. But because hash value is calculated only in allocate_address, it would not reuse previous ping. I think it should try to use hash also from opt ping. It would help preventing collisions, because hash should match for the same hwaddr request. Your change side-stepped it by assuming any ping with hash 0 is okay always.

I am not sure the logic is okay for busy networks. Be it 6 or 7, it is too little. It seems to me it should be derived from total addresses in range or something like that. It now ensures no more than 6 addresses pinged in last 30 seconds are taken into the account. I think it should prevent pinging the same address often. But if we have 253 address if offered range, I think 30 pings in last 30s is still quite okay. So if 15 machines/containers starts at the similar time, it would be better to ping them also. On the other hand if all machines use only DHCP, pinging might be disabled altogether.

Comment 46 Petr Menšík 2021-12-07 23:06:48 UTC
Okay, because of icmp_ping() is implemented in synchronous and blocking way, I don't think we want to increase number of pings. It should be just occasional, for higher performance it should be disabled. It would do just limited incoming queries processing in delay_dhcp, when icmp_ping is running. It might run up to 3 seconds in this function.

Comment 47 Petr Menšík 2021-12-07 23:24:26 UTC
There is missing upstream commit 0669ee7a69a004ce34fed41e50aa575f8e04427b [1], which should improve the situation. RHEL8 dnsmasq 2.79 assigns dummy.hash = 0, but that is then not accepted. Different change were made upstream with similar results to David's change. It ensures hashes would always match when dummy is returned. That change were first in 2.80 release. It might still hit corner case via opt requested address with 0 hash value, but with lesser probability.

1. http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=0669ee7a69a004ce34fed41e50aa575f8e04427b

Comment 49 Petr Menšík 2021-12-08 00:29:40 UTC
Posted a patch [1] to upstream, fixing the second place doing do_icmp_ping() to also use hash value. It seems to me increasing max value should not be the correct fix for this issue. It may just reduce likehood of this issue happening. Address assignment has to work both in case of "overload" and when ping did timeout. If increasing max would fix anything, then the algorithm is wrong.

1. https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2021q4/015728.html

Comment 50 Petr Menšík 2021-12-08 00:32:25 UTC
Created attachment 1845163 [details]
dnsmasq 2.86 additional change

Adds hash to ping used from requested client address, which does not have a lease. Either yet or not anymore. But if that address is not taken and not pinged, it should count as successful allocation check.

Comment 51 Petr Menšík 2021-12-08 00:33:24 UTC
Created attachment 1845164 [details]
dnsmasq 2.86 ping optimization

Comment 62 Petr Menšík 2021-12-08 12:04:56 UTC
(In reply to Petr Menšík from comment #49)
> Posted a patch [1] to upstream, fixing the second place doing do_icmp_ping()
> to also use hash value. It seems to me increasing max value should not be
> the correct fix for this issue. It may just reduce likehood of this issue
> happening. Address assignment has to work both in case of "overload" and
> when ping did timeout. If increasing max would fix anything, then the
> algorithm is wrong.
> 
> 1.
> https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2021q4/015728.html

Used wrong link, correct one is:
https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2021q4/015982.html

Comment 63 Petr Menšík 2021-12-08 12:27:15 UTC
There is also one additional upstream commit 9c0d445e [1] from 2.81 release, which fixes sequential address assignment after one address were refused. It might be related also to our case. But I have not seen any decline in provided logs. Not sure it should help us.

1. http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=9c0d445ef4abffa2b9342ad65e85ef425c1f83bb

Comment 68 Petr Menšík 2021-12-09 00:30:42 UTC
Oh my, that concept of ping and cache for it is implemented in very wrong way. I have prepared relative simple reproducer spawning number of dhclients in namespace. They don't even configure the interface, but just do some real DHCP transaction. When multiple clients do discover at the same time, dnsmasq with pings enabled process them quite slow way. Because it pings just the first client and then enters delay_dhcp from icmp_ping() function. Then it waits for 3 seconds and does not send any offers in the mean time. When ping does not receive answer, it means that is good. Only one OFFER is sent to client which were pinged and a new one is processed. It can ping this way for 18 seconds, then enters "overload" mode. That skips more pings and sends offer more often. After about 30s again it starts pinging again.

My build containing changes from comment #47, comment #50 and comment #63 improved the situation somehow. My test instances eventually were able to obtain an address, but it took 16 instances of clients almost 70 seconds to receive all addresses. That seems way too much. Pinging and waiting for just single client is a bad idea, I think that has to change in more recent versions.

I think no-ping should be used to avoid issues where many clients start at similar time. Take restoration after power outage as a good example. A good dhcp client may check address after that is offered. With current dnsmasq code pinging is not scalable. But comment #47 change is required for working no-ping in sequential mode.

Comment 69 Petr Menšík 2021-12-09 00:38:57 UTC
Created attachment 1845356 [details]
basic reproducer script

Uses netns and dhclient to emulate DHCP sessions. Starts multiple instances at the same time. Expects BRIDGE interface connected to listening dnsmasq instance, which is configured to provide dhcp.

Used just this in addition to default config:

interface=enXY # replace with a proper one

dhcp-range=192.168.120.30,192.168.120.50
log-dhcp
dhcp-sequential-ip

Comment 73 Petr Menšík 2021-12-09 01:17:31 UTC
With no-ping option in place, all 16 instances receive address under 8 seconds. I think that is how it should work. I would recommend using no-ping on every network, where clients behave in a good way and do not manually set IP addresses in DHCP ranges. Ie. where non-leased address conflicts can be prevented. I think IPv4 ping algorithm needs not a small rework to be more effective and usable. It is more workaround and performance improvement, but I would recommend it where possible.

Not only that, dnsmasq does not track 'offered' addresses at all. It partially uses ping cache as a workaround, but because its implementation is lacking, I do not think it is enough. I think short lived lease to offered host might be better, but has to be implemented well. Offers has to have lower priority than address requests. If more machines discovers than unassigned addresses pool size can supply, it would have to race for remaining addresses. Mentioned two upstream changes improves it, but they are not enough. My previous change does not fix the root of this issue also. I think it needs discussion with upstream to select a good way to improve it.

Comment 74 David Hill 2021-12-09 01:33:50 UTC
Would this work:
~~~
dhcp-range=192.168.120.30,192.168.120.50
log-dhcp
dhcp-sequential-ip
no-ping
~~~
?

Comment 75 David Hill 2021-12-09 01:38:15 UTC
Also, what's wrong with allowing more than 6 pings in 30 seconds ?  I like just tweaking that value and maybe we should just allow users to override that value through a configuration change.

How can I download your latest build ?  It looks like I can't from brew...

Comment 77 David Hill 2021-12-09 01:47:01 UTC
Created attachment 1845361 [details]
dnsmasq.conf being used

Comment 78 David Hill 2021-12-09 01:53:55 UTC
Ok so with this dnsmasq-2.79-22 package and with "no-ping" and "dhcp-sequential-ip", I get around 70% of the PXE booting VMs that will just break and get no IPs at all.

Comment 80 Petr Menšík 2021-12-09 02:07:33 UTC
I am not sure why more than 6 pings helps. I think it is some unintended side effect anyway. According to packet dumps pinging makes dhcpoffers painfully slow. I don't think it is wise to increase their count in this situation. If those pings were made parallel way, it would work nice. But we are not there yet.

Comment 81 David Hill 2021-12-09 13:47:18 UTC
Created attachment 1845474 [details]
pcap

Comment 82 David Hill 2021-12-09 13:49:20 UTC
Created attachment 1845475 [details]
dnsmasq.log

Comment 85 Petr Menšík 2021-12-11 02:25:28 UTC
(In reply to David Hill from comment #81)
> Created attachment 1845474 [details]
> pcap

Thank you David. I think it illustrates well real problem in dnsmasq. Which is present still in latest versions.

Used wireshark filter:
(dhcp.hw.mac_addr == 52:54:00:b9:d5:c0) || (dhcp.hw.mac_addr == 52:54:00:84:b8:a0)

$client1=52:54:00:b9:d5:c0
$client2=52:54:00:84:b8:a0

Problem is address 192.0.2.102 is offered to both clients.

$client1 requests first in frame 686. dnsmask ACKS in 689.
$client2 requests after that in frame 699. Requests multiple times. And gets multiple NAK responses, starting in frame 716.
$client2 would not fall back to discover phase after NAK, but requests again in frames 731 and 737. Of course with the same result.

dhclient seems to solve this misbehaviour of dnsmasq, but PXE clients on boot cannot. I think primary problem is in dnsmasq offering the same address to both clients in the first place. It still has free leases and it is not required. I think enabled pings mitigates this behaviour, because ping cache remembers 6 last hw addr hashes. If 9 machines start at the same time, it would happen even with ping enabled. With no-ping, it happens much often with dhcp-sequential-ip. I think mandatory to hit this issue is limited address pool. If available address pool is about the same size of started machines together, it would happen almost always.

Without dhcp-sequential-ip at least addr_epoch helps offering the same address multiple times. But the main issue is always there, just different chances to hit that. no-ping+dhcp-sequential-ip has highest chance to hit this issue I think.

It is related to bug #1998448, where similar issue happened on IPv6. We were able to side-step it by offering different address on ACK. I think it should too change behaviour and offer unique addresses to unique hw addresses if enough unused leases remains.
It seems ISC DHCP server creates lease right after discover, before it sends offer. It therefore would not offer multiple

Comment 86 Petr Menšík 2021-12-11 02:35:58 UTC
I think exactly this issue were reported already in bug #1969908, but I failed to spot what and why is wrong back then. It is more clear with reproducer now and pcap records for analysis. Keeping both open, because they have different components. Both should be fixed by a single fix however.

Comment 87 Petr Menšík 2021-12-11 02:44:08 UTC
Putting back condnack. Previously proposed upstream patch helps, but clearly not properly fixes the issue. Not in the way needed. I think temporary leases has to be created before sending offer. That way offers would receive different addresses, because address would be reserved to fixed hw address.

There might be still a race condition when multiple clients discovers, but only single remaining free lease is available. It has to allow second client to finish lease successfully even if client1 does not continue with request. But that should be solved only when not enough unoffered addresses is available and avoided otherwise.

Comment 88 Petr Menšík 2021-12-11 12:25:51 UTC
Pushed candidate fix to public branch at: https://github.com/InfrastructureServices/dnsmasq/tree/dhcp-temp-leases

Comment 90 Petr Menšík 2022-07-08 21:29:55 UTC
Posted patches to upstream for review.

https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2022q3/016471.html

Comment 91 Petr Menšík 2023-01-23 17:26:40 UTC
The upstream maintainer refused this proposed change and suggested fixed allocation for every machine should be done to prevent races. Do we have use cases, where we do not know which machines might be up?

In other words, when IP pool has 30 addresses, does the deployment sometime run different set of machines? I understand it does not include relative simple change in dnsmasq, but then requires non-trivial orchestration around to assign fixed IPs to all machines. Do we know use cases where this would not be even possible?

Do we know use cases where information for netbooted instances are not known in advance to configure their leases properly? I am not sure how to solve this issue with upstream refusing my proposed change. Can you help me gather minimal requirements needed from dnsmasq?

Comment 93 Fernando Royo 2023-04-17 14:51:59 UTC
[Bug Triage] We close as no info has been reported from the CU after several weeks, so if the CU provides user case info feel free to reopen.


Note You need to log in before you can comment on or make changes to this bug.