Bug 1536844

Summary: Sometimes dhcp_release packet isn't reaching dnsmasq process because it's being reloaded
Product: Red Hat Enterprise Linux 7 Reporter: David Hill <dhill>
Component: dnsmasqAssignee: Petr Menšík <pemensik>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ofer Blaut <oblaut>
Severity: urgent Docs Contact:
Priority: medium    
Version: 7.4CC: amuller, bhaley, chrisw, dhill, dwojewod, jlebon, jlibosva, ljozsa, majopela, miabbott, mlinden, nyechiel, oblaut, pablo.iranzo, pemensik, psklenar, ragiman, srevivo, thozza, vcojot
Target Milestone: pre-dev-freeze   
Target Release: 7.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1565615 (view as bug list) Environment:
Last Closed: 2018-05-21 12:31:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1549614, 1565615, 1574212, 1578414, 1578415    
Attachments:
Description Flags
dnsmasq release all script
none
mass lease creation script
none
mass lease creation with short delay. none

Description David Hill 2018-01-21 16:07:25 UTC
Description of problem:
dhcp_release isn't ran every time a VM is deleted so the lease stays in the lease file which prevents from allocating that IP back again.   The following simple loop permits to reproduce this issue fairly easily in the customer environment:

while true; do
  openstack server create rdcshjy_TestVm --image rdcshjy_TestVm --flavor vran-vrc-2 --nic net-id=om_ran --max 10
  sleep 120
  openstack server list
  for server in $(openstack server list | awk -F'|' '/TestVm/{print($2)}'); do openstack server delete $server; done
  sleep 120
done


This might be a race condition in the deletion steps so in order to make sure we're hitting this, I've asked the customer to retry with :

while true; do
  openstack server create rdcshjy_TestVm --image rdcshjy_TestVm --flavor vran-vrc-2 --nic net-id=om_ran --max 10
  sleep 120
  openstack server list
  for server in $(openstack server list | awk -F'|' '/TestVm/{print($2)}'); do openstack server delete $server; sleep 120; done
  sleep 120
done



Version-Release number of selected component (if applicable):


How reproducible:
Almost always

Steps to Reproduce:
1. Create VMs
2. Delete VMs
3.

Actual results:
sometimes the lease isn't released from the lease file / dnsmasq process

Expected results:
Should always be released

Additional info:

Comment 12 David Hill 2018-02-12 17:05:01 UTC
I'm wondering if this is not simply due to dnsmasq being reloaded while another dhcp_release is called.

This would look like a race condition:

        self._release_unused_leases()
        self._spawn_or_reload_process(reload_with_HUP=True)

So we send one (or a bunch) of dhcp_release and then we spawn or reload the dnsmasq process.   

The port unreachable returned via ICMP would indicate that the process might be reloading at that time for the other dhcp_release that was being sent at that time.

Comment 18 Petr Menšík 2018-02-28 21:51:09 UTC
Created attachment 1402126 [details]
dnsmasq release all script

Comment 19 Petr Menšík 2018-02-28 21:51:41 UTC
Created attachment 1402127 [details]
mass lease creation script

Comment 29 Petr Menšík 2018-03-13 19:21:55 UTC
Created attachment 1407701 [details]
mass lease creation with short delay.

Use with -w 100 to allocate 100 leases with delay, -n 100 without delay.

Comment 64 Brian Haley 2018-04-02 15:05:03 UTC
Regarding this from Comment #58:

> First problem is that _release_lease never retries on any error

Do you think it's worthwhile to change the agent to retry on failure, maybe 3 times with a slight delay between each?  The only downside is that if we have a large number of leases to release and try multiple times for each, we could increase the time it takes to reload things.  And setting the delay to say, 0.1 seconds, might not be enough time between tries for whatever caused the error to clear.

Comment 65 David Hill 2018-04-03 02:30:05 UTC
Since it's UDP , I'm not sure the dhcp_release call will get an error here:

        try:
            ip_wrapper.netns.execute(cmd, run_as_root=True)
        except RuntimeError as e:
            # when failed to release single lease there's
            # no need to propagate error further
            LOG.warning(_LW('DHCP release failed for %(cmd)s. '
                            'Reason: %(e)s'), {'cmd': cmd, 'e': e})


The only time it'll fail is when the command actually fails to load... unless I'm forgetting something here ?

Comment 68 Miguel Angel Ajo 2018-04-03 12:22:00 UTC
Why don't we monitor the lease file dnsmasq generates to make sure we only SIGHUP it only once the lease has been cleared? We could retry based on that file until we verify it has been properly cleaned up.

And fail after some amount of attempts.

As @Brian said, this would slow the process down, but would make it more robust regardless of having or not a bug in dnsmasq, which comment 2 could be indicative of.

Comment 70 Brian Haley 2018-04-12 01:37:49 UTC
WIP patch for watching lease file for success in neutron dhcp-agent:

https://review.openstack.org/#/c/560703/

Tracking progress in https://bugzilla.redhat.com/show_bug.cgi?id=1565615

Comment 74 Brian Haley 2018-05-02 19:38:32 UTC
The neutron change has merged upstream, and it's unclear to me if there is any remaining work for RHEL to do here, so this can probably be closed if that is the case.