Bug 1536844
| Summary: | Sometimes dhcp_release packet isn't reaching dnsmasq process because it's being reloaded | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | David Hill <dhill> | ||||||||
| Component: | dnsmasq | Assignee: | Petr Menšík <pemensik> | ||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Ofer Blaut <oblaut> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 7.4 | CC: | amuller, bhaley, chrisw, dhill, dwojewod, jlebon, jlibosva, ljozsa, majopela, miabbott, mlinden, nyechiel, oblaut, pablo.iranzo, pemensik, psklenar, ragiman, srevivo, thozza, vcojot | ||||||||
| Target Milestone: | pre-dev-freeze | ||||||||||
| Target Release: | 7.4 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1565615 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2018-05-21 12:31:05 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1549614, 1565615, 1574212, 1578414, 1578415 | ||||||||||
| Attachments: |
|
||||||||||
I'm wondering if this is not simply due to dnsmasq being reloaded while another dhcp_release is called.
This would look like a race condition:
self._release_unused_leases()
self._spawn_or_reload_process(reload_with_HUP=True)
So we send one (or a bunch) of dhcp_release and then we spawn or reload the dnsmasq process.
The port unreachable returned via ICMP would indicate that the process might be reloading at that time for the other dhcp_release that was being sent at that time.
Created attachment 1402126 [details]
dnsmasq release all script
Created attachment 1402127 [details]
mass lease creation script
Created attachment 1407701 [details]
mass lease creation with short delay.
Use with -w 100 to allocate 100 leases with delay, -n 100 without delay.
Regarding this from Comment #58: > First problem is that _release_lease never retries on any error Do you think it's worthwhile to change the agent to retry on failure, maybe 3 times with a slight delay between each? The only downside is that if we have a large number of leases to release and try multiple times for each, we could increase the time it takes to reload things. And setting the delay to say, 0.1 seconds, might not be enough time between tries for whatever caused the error to clear. Since it's UDP , I'm not sure the dhcp_release call will get an error here:
try:
ip_wrapper.netns.execute(cmd, run_as_root=True)
except RuntimeError as e:
# when failed to release single lease there's
# no need to propagate error further
LOG.warning(_LW('DHCP release failed for %(cmd)s. '
'Reason: %(e)s'), {'cmd': cmd, 'e': e})
The only time it'll fail is when the command actually fails to load... unless I'm forgetting something here ?
Why don't we monitor the lease file dnsmasq generates to make sure we only SIGHUP it only once the lease has been cleared? We could retry based on that file until we verify it has been properly cleaned up. And fail after some amount of attempts. As @Brian said, this would slow the process down, but would make it more robust regardless of having or not a bug in dnsmasq, which comment 2 could be indicative of. WIP patch for watching lease file for success in neutron dhcp-agent: https://review.openstack.org/#/c/560703/ Tracking progress in https://bugzilla.redhat.com/show_bug.cgi?id=1565615 The neutron change has merged upstream, and it's unclear to me if there is any remaining work for RHEL to do here, so this can probably be closed if that is the case. |
Description of problem: dhcp_release isn't ran every time a VM is deleted so the lease stays in the lease file which prevents from allocating that IP back again. The following simple loop permits to reproduce this issue fairly easily in the customer environment: while true; do openstack server create rdcshjy_TestVm --image rdcshjy_TestVm --flavor vran-vrc-2 --nic net-id=om_ran --max 10 sleep 120 openstack server list for server in $(openstack server list | awk -F'|' '/TestVm/{print($2)}'); do openstack server delete $server; done sleep 120 done This might be a race condition in the deletion steps so in order to make sure we're hitting this, I've asked the customer to retry with : while true; do openstack server create rdcshjy_TestVm --image rdcshjy_TestVm --flavor vran-vrc-2 --nic net-id=om_ran --max 10 sleep 120 openstack server list for server in $(openstack server list | awk -F'|' '/TestVm/{print($2)}'); do openstack server delete $server; sleep 120; done sleep 120 done Version-Release number of selected component (if applicable): How reproducible: Almost always Steps to Reproduce: 1. Create VMs 2. Delete VMs 3. Actual results: sometimes the lease isn't released from the lease file / dnsmasq process Expected results: Should always be released Additional info: