1536844 – Sometimes dhcp_release packet isn't reaching dnsmasq process because it's being reloaded

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1536844 - Sometimes dhcp_release packet isn't reaching dnsmasq process because it's being reloaded

Summary: Sometimes dhcp_release packet isn't reaching dnsmasq process because it's bei...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	dnsmasq
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	pre-dev-freeze
Target Release:	7.4
Assignee:	Petr Menšík
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1549614 1565615 1574212 1578414 1578415
TreeView+	depends on / blocked

Reported:	2018-01-21 16:07 UTC by David Hill
Modified:	2021-06-10 14:17 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1565615 (view as bug list)
Environment:
Last Closed:	2018-05-21 12:31:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dnsmasq release all script (228 bytes, application/x-shellscript) 2018-02-28 21:51 UTC, Petr Menšík	no flags	Details
mass lease creation script (854 bytes, application/x-shellscript) 2018-02-28 21:51 UTC, Petr Menšík	no flags	Details
mass lease creation with short delay. (986 bytes, application/x-shellscript) 2018-03-13 19:21 UTC, Petr Menšík	no flags	Details
Show Obsolete (1) View All

Description David Hill 2018-01-21 16:07:25 UTC

Description of problem:
dhcp_release isn't ran every time a VM is deleted so the lease stays in the lease file which prevents from allocating that IP back again.   The following simple loop permits to reproduce this issue fairly easily in the customer environment:

while true; do
  openstack server create rdcshjy_TestVm --image rdcshjy_TestVm --flavor vran-vrc-2 --nic net-id=om_ran --max 10
  sleep 120
  openstack server list
  for server in $(openstack server list | awk -F'|' '/TestVm/{print($2)}'); do openstack server delete $server; done
  sleep 120
done


This might be a race condition in the deletion steps so in order to make sure we're hitting this, I've asked the customer to retry with :

while true; do
  openstack server create rdcshjy_TestVm --image rdcshjy_TestVm --flavor vran-vrc-2 --nic net-id=om_ran --max 10
  sleep 120
  openstack server list
  for server in $(openstack server list | awk -F'|' '/TestVm/{print($2)}'); do openstack server delete $server; sleep 120; done
  sleep 120
done



Version-Release number of selected component (if applicable):


How reproducible:
Almost always

Steps to Reproduce:
1. Create VMs
2. Delete VMs
3.

Actual results:
sometimes the lease isn't released from the lease file / dnsmasq process

Expected results:
Should always be released

Additional info:

Comment 12 David Hill 2018-02-12 17:05:01 UTC

I'm wondering if this is not simply due to dnsmasq being reloaded while another dhcp_release is called.

This would look like a race condition:

        self._release_unused_leases()
        self._spawn_or_reload_process(reload_with_HUP=True)

So we send one (or a bunch) of dhcp_release and then we spawn or reload the dnsmasq process.   

The port unreachable returned via ICMP would indicate that the process might be reloading at that time for the other dhcp_release that was being sent at that time.

Comment 18 Petr Menšík 2018-02-28 21:51:09 UTC

Created attachment 1402126 [details]
dnsmasq release all script

Comment 19 Petr Menšík 2018-02-28 21:51:41 UTC

Created attachment 1402127 [details]
mass lease creation script

Comment 29 Petr Menšík 2018-03-13 19:21:55 UTC

Created attachment 1407701 [details]
mass lease creation with short delay.

Use with -w 100 to allocate 100 leases with delay, -n 100 without delay.

Comment 64 Brian Haley 2018-04-02 15:05:03 UTC

Regarding this from Comment #58:

> First problem is that _release_lease never retries on any error

Do you think it's worthwhile to change the agent to retry on failure, maybe 3 times with a slight delay between each?  The only downside is that if we have a large number of leases to release and try multiple times for each, we could increase the time it takes to reload things.  And setting the delay to say, 0.1 seconds, might not be enough time between tries for whatever caused the error to clear.

Comment 65 David Hill 2018-04-03 02:30:05 UTC

Since it's UDP , I'm not sure the dhcp_release call will get an error here:

        try:
            ip_wrapper.netns.execute(cmd, run_as_root=True)
        except RuntimeError as e:
            # when failed to release single lease there's
            # no need to propagate error further
            LOG.warning(_LW('DHCP release failed for %(cmd)s. '
                            'Reason: %(e)s'), {'cmd': cmd, 'e': e})


The only time it'll fail is when the command actually fails to load... unless I'm forgetting something here ?

Comment 68 Miguel Angel Ajo 2018-04-03 12:22:00 UTC

Why don't we monitor the lease file dnsmasq generates to make sure we only SIGHUP it only once the lease has been cleared? We could retry based on that file until we verify it has been properly cleaned up.

And fail after some amount of attempts.

As @Brian said, this would slow the process down, but would make it more robust regardless of having or not a bug in dnsmasq, which comment 2 could be indicative of.

Comment 70 Brian Haley 2018-04-12 01:37:49 UTC

WIP patch for watching lease file for success in neutron dhcp-agent:

https://review.openstack.org/#/c/560703/

Tracking progress in https://bugzilla.redhat.com/show_bug.cgi?id=1565615

Comment 74 Brian Haley 2018-05-02 19:38:32 UTC

The neutron change has merged upstream, and it's unclear to me if there is any remaining work for RHEL to do here, so this can probably be closed if that is the case.

Note You need to log in before you can comment on or make changes to this bug.

amuller
bhaley
chrisw
dhill
dwojewod
jlebon
jlibosva
ljozsa
majopela
miabbott
mlinden
nyechiel
oblaut
pablo.iranzo
pemensik
psklenar
ragiman
srevivo
thozza
vcojot