Bug 1023818

Summary: Neutron DHCP agent failing to keep dnsmasq in sync when lots of instances are created/deleted at a time
Product: Red Hat OpenStack Reporter: Graeme Gillies <ggillies>
Component: openstack-neutronAssignee: Maru Newby <mnewby>
Status: CLOSED ERRATA QA Contact: Roey Dekel <rdekel>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: ahoness, chrisw, ctrianta, danken, dmaley, ggillies, lpeer, mlopes, mwagner, nyechiel, oblaut, pep, rdekel, sclewis, sputhenp, tvvcox, twilson, yeylon
Target Milestone: z3Keywords: Reopened, ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2013.2.2-1.el6ost Doc Type: Bug Fix
Doc Text:
Networking previously failed to process agent heartbeats in a timely fashion when overloaded, causing agents to be considered in a 'down' state. Since Networking was only sending network changes to the DHCP Agent if it was 'up', an overloaded service would fail to send notifications to the agent. As a result, the DHCP agent would be unable to serve requests for ports for which it was not notified. With this fix, Networking has been updated to always send DHCP notifications even if the target agent is considered 'down'. In addition, an error message will be logged if a DHCP notification cannot be sent due to unavailable DHCP agents. DHCP agents will now always be notified of network changes, and will be able to eventually respond to DHCP requests for configured ports.
Story Points: ---
Clone Of:
: 1074344 (view as bug list) Environment:
Last Closed: 2014-03-25 19:23:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1066642, 1074344    
Attachments:
Description Flags
nova list after stress deletion
none
nova list moment after sending stress deletion cmd
none
nova compute log after stress deletion
none
nova list which shows bug is resolved none

Description Graeme Gillies 2013-10-28 01:51:40 UTC
Hi,

We are using a medium sized RHOS 3.0 installation and have encountered multiple problems when provisioning and removing a large number of instances at once (between 50 and 100).

What we find is in some cases, the dhcp agent fails to keep the dnsmasq processes in sync, either not adding entries correctly in time, or even worse, not removing old entries, so one ip gets used by multiple mac addresses (which then fails to produce a response to instance DHCP discover).

Looking at bugs upstream it seems this is a known issue

https://bugs.launchpad.net/neutron/+bug/1191768
https://bugs.launchpad.net/neutron/+bug/1185916

There have been some fixes in those upstream issues which have been applied to master. Can we ensure these fixes have been backported into RHOS 3.0?

Here is an example from our DNSMASQ leases file which clearly demonstrates the problem (note 192.168.0.124 and 192.168.0.117)

fa:16:3e:b9:09:5a,host-192-168-0-21.openstacklocal,192.168.0.21
fa:16:3e:b4:e6:8a,host-192-168-0-26.openstacklocal,192.168.0.26
fa:16:3e:27:ed:dd,host-192-168-0-121.openstacklocal,192.168.0.121
fa:16:3e:9e:2a:98,host-192-168-0-3.openstacklocal,192.168.0.3
fa:16:3e:01:2f:27,host-192-168-0-23.openstacklocal,192.168.0.23
fa:16:3e:f7:21:ba,host-192-168-0-117.openstacklocal,192.168.0.117
fa:16:3e:e9:09:da,host-192-168-0-131.openstacklocal,192.168.0.131
fa:16:3e:e2:24:0e,host-192-168-0-124.openstacklocal,192.168.0.124
fa:16:3e:41:47:43,host-192-168-0-1.openstacklocal,192.168.0.1
fa:16:3e:b7:ce:1f,host-192-168-0-25.openstacklocal,192.168.0.25
fa:16:3e:86:8e:c3,host-192-168-0-22.openstacklocal,192.168.0.22
fa:16:3e:ea:e9:1b,host-192-168-0-2.openstacklocal,192.168.0.2
fa:16:3e:a6:67:60,host-192-168-0-120.openstacklocal,192.168.0.120
fa:16:3e:07:0e:e9,host-192-168-0-110.openstacklocal,192.168.0.110
fa:16:3e:a0:50:a9,host-192-168-0-115.openstacklocal,192.168.0.115
fa:16:3e:86:55:cc,host-192-168-0-116.openstacklocal,192.168.0.116
fa:16:3e:27:5b:aa,host-192-168-0-117.openstacklocal,192.168.0.117
fa:16:3e:51:c6:35,host-192-168-0-118.openstacklocal,192.168.0.118
fa:16:3e:1f:45:65,host-192-168-0-122.openstacklocal,192.168.0.122
fa:16:3e:a3:0f:c1,host-192-168-0-123.openstacklocal,192.168.0.123
fa:16:3e:f1:e3:51,host-192-168-0-124.openstacklocal,192.168.0.124

Any additional information I can provide let me know. I can provide access to an environment the replicates the problem if need be.

Regards,

Graeme

Comment 2 Terry Wilson 2013-11-14 18:27:12 UTC
The fix mentioned in the above bug has made it in via normal rebasing off of upstream releases. Any chance you could try out 2013.1.4 and see if this is still an issue?

Comment 3 Graeme Gillies 2013-11-14 23:16:46 UTC
I am running openstack-quantum-2013.1.4-3.el6ost.noarch and haven't seen the problem yet (only been a few days). I'll give it another week or so and see if I see the problem happen again

Comment 4 lpeer 2013-11-17 07:42:53 UTC
Thanks Craeme.
I'm closing the bug for now if you encounter the issue again please reopen.

Comment 5 Graeme Gillies 2013-11-17 22:00:50 UTC
Hi,

We are hitting the problem again, see below dnsmasq (address 192.168.0.133)

fa:16:3e:e7:8d:2b,host-192-168-0-126.openstacklocal,192.168.0.126
fa:16:3e:b9:09:5a,host-192-168-0-21.openstacklocal,192.168.0.21
fa:16:3e:ab:55:d6,host-192-168-0-121.openstacklocal,192.168.0.121
fa:16:3e:32:53:d1,host-192-168-0-127.openstacklocal,192.168.0.127
fa:16:3e:b4:e6:8a,host-192-168-0-26.openstacklocal,192.168.0.26
fa:16:3e:1a:2f:79,host-192-168-0-143.openstacklocal,192.168.0.143
fa:16:3e:9e:2a:98,host-192-168-0-3.openstacklocal,192.168.0.3
fa:16:3e:01:2f:27,host-192-168-0-23.openstacklocal,192.168.0.23
fa:16:3e:c5:aa:af,host-192-168-0-139.openstacklocal,192.168.0.139
fa:16:3e:a9:80:67,host-192-168-0-135.openstacklocal,192.168.0.135
fa:16:3e:e9:09:da,host-192-168-0-131.openstacklocal,192.168.0.131
fa:16:3e:80:67:f4,host-192-168-0-119.openstacklocal,192.168.0.119
fa:16:3e:d6:6a:0e,host-192-168-0-125.openstacklocal,192.168.0.125
fa:16:3e:41:47:43,host-192-168-0-1.openstacklocal,192.168.0.1
fa:16:3e:00:3c:04,host-192-168-0-130.openstacklocal,192.168.0.130
fa:16:3e:d9:d9:d1,host-192-168-0-138.openstacklocal,192.168.0.138
fa:16:3e:b7:ce:1f,host-192-168-0-25.openstacklocal,192.168.0.25
fa:16:3e:86:8e:c3,host-192-168-0-22.openstacklocal,192.168.0.22
fa:16:3e:87:25:b5,host-192-168-0-136.openstacklocal,192.168.0.136
fa:16:3e:ea:e9:1b,host-192-168-0-2.openstacklocal,192.168.0.2
fa:16:3e:9c:2b:5d,host-192-168-0-137.openstacklocal,192.168.0.137
fa:16:3e:92:9f:17,host-192-168-0-124.openstacklocal,192.168.0.124
fa:16:3e:98:ff:51,host-192-168-0-134.openstacklocal,192.168.0.134
fa:16:3e:0c:3d:bb,host-192-168-0-140.openstacklocal,192.168.0.140
fa:16:3e:13:b4:80,host-192-168-0-133.openstacklocal,192.168.0.133
fa:16:3e:b6:84:ca,host-192-168-0-141.openstacklocal,192.168.0.141
fa:16:3e:58:b2:c4,host-192-168-0-145.openstacklocal,192.168.0.145
fa:16:3e:62:ad:d1,host-192-168-0-146.openstacklocal,192.168.0.146
fa:16:3e:2d:7f:be,host-192-168-0-147.openstacklocal,192.168.0.147
fa:16:3e:13:d8:21,host-192-168-0-149.openstacklocal,192.168.0.149
fa:16:3e:28:73:07,host-192-168-0-151.openstacklocal,192.168.0.151
fa:16:3e:b6:bb:21,host-192-168-0-152.openstacklocal,192.168.0.152
fa:16:3e:6f:2c:aa,host-192-168-0-153.openstacklocal,192.168.0.153
fa:16:3e:08:5f:7e,host-192-168-0-154.openstacklocal,192.168.0.154
fa:16:3e:cf:85:df,host-192-168-0-128.openstacklocal,192.168.0.128
fa:16:3e:2b:44:b8,host-192-168-0-129.openstacklocal,192.168.0.129
fa:16:3e:bd:05:5b,host-192-168-0-132.openstacklocal,192.168.0.132
fa:16:3e:03:39:dd,host-192-168-0-133.openstacklocal,192.168.0.133

Regards,

Graeme

Comment 6 Maru Newby 2013-11-19 18:22:26 UTC
Can you reproduce on Havana or only on Grizzly?  I need to know whether we need to target an upstream fix to get into 4.0 or whether the problem is 3.0 only.

Comment 7 Graeme Gillies 2013-11-20 04:11:27 UTC
I haven't tried in havana yet. I don't have ready access to a havana environment right now, so I might take me a few weeks to get one up and try to reproduce.

Comment 8 Maru Newby 2013-11-25 16:43:47 UTC
I can confirm that this bug is exhibited upstream.  Attempting to boot 50 instances resulted in a hosts file that was missing 8 of the 50 ip addresses.

Comment 9 Maru Newby 2013-11-25 21:36:24 UTC
Updating to point to the proper upstream bug.

Comment 11 Maru Newby 2013-12-17 06:39:17 UTC
There is a patch under review upstream that should reduce incidence of the problem and make it more visible (via logging) if it does occur:

https://review.openstack.org/#/c/61168/

A refactor of the dhcp agent to improve scalability is under way but not yet ready for review.

Comment 12 Maru Newby 2014-01-07 06:04:49 UTC
The refactoring effort has gained a blueprint: 

https://blueprints.launchpad.net/neutron/+spec/eventually-consistent-dhcp-agent

Comment 14 Roey Dekel 2014-02-11 14:28:35 UTC
The refactoring effort depends on bug #1046782

Comment 15 Maru Newby 2014-02-11 19:07:49 UTC
Upstream patch to send notifications regardless of agent status has merged to stable/havana and will appear in the next stable release: 

https://review.openstack.org/#/c/65590/

Comment 17 Roey Dekel 2014-02-23 14:01:56 UTC
Created attachment 866662 [details]
nova list after stress deletion

Comment 18 Roey Dekel 2014-02-23 14:09:53 UTC
Created attachment 866663 [details]
nova list moment after sending stress deletion cmd

Comment 19 Roey Dekel 2014-02-23 14:13:27 UTC
Created attachment 866678 [details]
nova compute log after stress deletion

Comment 20 Roey Dekel 2014-02-23 14:15:19 UTC
Tried to verify on Havana with:

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
puddle: 2014-02-21.1
openstack-neutron-2013.2.2-1.el6ost.noarch

Reproduce Steps:
----------------
1. Setup environment with tenant-network (internal VLAN) and update qoutas for:
    instances - 100
    cores - 100
    ports - 150
2. Boot a vm to verify working setup
3. Boot 90 vms parallely:
    # for i in $(seq 1 90); do nova boot stress-${i} --flavor 1 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
4. Verify working VM's with non-identical IP's
5. Delete 90 vms parallely:
    # for i in $(seq 1 90); do nova delete boot stress-${i}  & done

Expected Results:
-----------------
Environment return to same status before stress boot.

Results:
--------
26 VM's at ERROR
3 VM's at ACTIVE
(attachment 866662 [details])

Comments:
---------
1. Step 4 - verified with the next cmd which showed 92 (91 VM'S + DHCP):
    # cat /var/lib/neutron/dhcp/cef6f793-0b5f-429f-9e4b-f9e4e70dbca3/host | cut -d"," -f3 | sort -n | uniq | wc -l
2. Attached is a moment after sending the deletion cmd (step 5) - attachment 866663 [details].
3. Attached is log for nova compute which indicates a problem with deleted port - probably it was deleted before the vm was deleted - attachment 866678 [details].

Comment 21 Roey Dekel 2014-02-24 08:09:14 UTC
In advance of Comment 20.

Cleared the not deleted VM's and tried to clear again, sequentialy this time.

Reproduce Steps:
----------------
1. Boot 90 vms parallely:
    # for i in $(seq 1 90); do nova boot stress-${i} --flavor 1 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
2. Delete 90 vms sequently:
    # for i in $(seq 1 90); do nova delete boot stress-${i}  ; done

Expected Results:
-----------------
1. 90 new ACTIVE VM's with valid IP's.
2. Deleted 90 VM's

Results:
--------
1. 3 VM's stuck on BUILD.
2. 3 VM's on BUILD - weren't deleted.
   1 VM - ERROR
   1 VM - ACTIVE (as nothing happend)

Comment 22 Roey Dekel 2014-02-24 08:10:30 UTC
On comment 21 I meant further to comment 20.

Comment 23 Maru Newby 2014-02-24 17:34:02 UTC
The fix in question is included in the latest release, but can only improve reliability as opposed to guaranteeing it, as per a previous comment:

"There is a patch under review upstream that should _reduce_ incidence of the problem and make it more visible (via logging) if it does occur."

The DHCP agent not keeping in sync is only one of a number of contributing factors that can result in a failed VM boot.  VM boot can also race port wiring, and the only way to fix this is to have Nova wait for Neutron to notify it that a port has become active.  This is under active development upstream:

https://review.openstack.org/#/c/75253/

So, short answer, your testing is exhibiting known scalability failure modes that are known not to be resolved by the upstream fix in question, but are the subject of continued upstream efforts.

Comment 24 Maru Newby 2014-02-27 06:43:57 UTC
Roey: The deletion errors you are reporting are very curious, and should be investigated further.  However, it is unlikely that they are caused by the dnsmasq instances falling out of sync since dhcp negotiation is not involved in vm deletion.  The only reason the bug subject mentions deletion is in ensuring a consistently high rate of vm boot in the face of limited resources.

Can you please file a separate bug against nova regarding vm deletion failing?  Whether Neutron is contributing to the problem or not, handling of errors during VM deletion is Nova's responsibility.

Comment 25 lpeer 2014-03-06 10:56:54 UTC
After a discussion with ofer moving back to on-qa, the original verification of this bug did not verify the fix and the original problem that was reported.

Comment 28 Roey Dekel 2014-03-14 10:57:43 UTC
Created attachment 874323 [details]
nova list which shows bug is resolved

Comment 29 Roey Dekel 2014-03-14 11:00:38 UTC
Verified on Havana with:


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
puddle: 2014-02-28.3
openstack-neutron-2013.2.2-1.el6ost.noarch

Reproduce Steps:
----------------
1. Setup environment with tenant-network (internal VLAN) update qoutas for:
    instances - 1024
    cores - 1024
    ports - 1024
2. Setup new mini flavor with 64MB RAM and no disk, and mini image (like cirros 0.3.1).
2. Boot a vm to verify working setup
3. Boot 128 vms parallely:
    # for i in $(seq 1 128); do nova boot stress-${i} --flavor 0 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
4. Verify working VM's with non-identical IP's

Expected Results:
-----------------
128 working VM's with IP in network.

Results:
--------
91 working VM's
5 VM's with 2 IP's
32 VM's in ERROR state due to insufficient nova resources.
(attachment 874323 [details])

Commetns:
---------
1. Environment status described below.

Comment 30 Roey Dekel 2014-03-14 11:11:48 UTC
Verified on Havana with:


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
puddle: 2014-02-28.3
openstack-neutron-2013.2.2-1.el6ost.noarch

Reproduce Steps:
----------------
1. Setup environment with tenant-network (internal VLAN) update qoutas for:
    instances - 1024
    cores - 1024
    ports - 1024
2. Setup new mini flavor with 64MB RAM and no disk, and mini image (like cirros 0.3.1).
2. Boot a vm to verify working setup
3. Boot 128 vms parallely:
    # for i in $(seq 1 128); do nova boot stress-${i} --flavor 0 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
4. Verify working VM's with non-identical IP's

Expected Results:
-----------------
128 working VM's with IP in network.

Results:
--------
91 working VM's
5 VM's with 2 IP's
32 VM's in ERROR state due to insufficient nova resources.
(attachment 874323 [details])

Commetns:
---------
1. Environment status described below.
2. Bug for 2 IP's for VM: Launchpad 1292077
3. Bug for not informative nova log (VM's launched at ERROR): Launchpad 1292090
4. This bug lead to other bugs, but the original was fixed.

Environment Status:
-------------------
[root@cougar16 ~(keystone_admin)]# nova image-list
+--------------------------------------+--------------+--------+--------+
| ID | Name | Status | Server |
+--------------------------------------+--------------+--------+--------+
| fd375460-5380-4dda-a71b-4a616064126e | cirros-0.3.1 | ACTIVE | |
+--------------------------------------+--------------+--------+--------+

[root@cougar16 ~(keystone_admin)]# nova flavor-list
+----+-----------+-----------+------+-----------+------+-------+-------------+-----------+
| ID | Name | Memory_MB | Disk | Ephemeral | Swap | VCPUs | RXTX_Factor | Is_Public |
+----+-----------+-----------+------+-----------+------+-------+-------------+-----------+
| 0 | mini | 64 | 0 | 0 | | 1 | 1.0 | True |
| 1 | m1.tiny | 512 | 1 | 0 | | 1 | 1.0 | True |
| 2 | m1.small | 2048 | 20 | 0 | | 1 | 1.0 | True |
| 3 | m1.medium | 4096 | 40 | 0 | | 2 | 1.0 | True |
| 4 | m1.large | 8192 | 80 | 0 | | 4 | 1.0 | True |
| 5 | m1.xlarge | 16384 | 160 | 0 | | 8 | 1.0 | True |
+----+-----------+-----------+------+-----------+------+-------+-------------+-----------+

[root@cougar16 ~(keystone_admin)]# nova quota-show
+-----------------------------+-------+
| Quota | Limit |
+-----------------------------+-------+
| instances | 1024 |
| cores | 1024 |
| ram | 51200 |
| floating_ips | 10 |
| fixed_ips | -1 |
| metadata_items | 128 |
| injected_files | 5 |
| injected_file_content_bytes | 10240 |
| injected_file_path_bytes | 255 |
| key_pairs | 100 |
| security_groups | 10 |
| security_group_rules | 20 |
+-----------------------------+-------+

[root@cougar16 ~(keystone_admin)]# neutron quota-show
+---------------------+-------+
| Field | Value |
+---------------------+-------+
| floatingip | 50 |
| network | 10 |
| port | 1024 |
| router | 10 |
| security_group | 10 |
| security_group_rule | 100 |
| subnet | 10 |
+---------------------+-------+

Comment 32 errata-xmlrpc 2014-03-25 19:23:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0334.html