Bug 1023818
Summary: | Neutron DHCP agent failing to keep dnsmasq in sync when lots of instances are created/deleted at a time | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Graeme Gillies <ggillies> | ||||||||||
Component: | openstack-neutron | Assignee: | Maru Newby <mnewby> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Roey Dekel <rdekel> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 3.0 | CC: | ahoness, chrisw, ctrianta, danken, dmaley, ggillies, lpeer, mlopes, mwagner, nyechiel, oblaut, pep, rdekel, sclewis, sputhenp, tvvcox, twilson, yeylon | ||||||||||
Target Milestone: | z3 | Keywords: | Reopened, ZStream | ||||||||||
Target Release: | 4.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | openstack-neutron-2013.2.2-1.el6ost | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
Networking previously failed to process agent heartbeats in a timely fashion when overloaded, causing agents to be considered in a 'down' state. Since Networking was only sending network changes to the DHCP Agent if it was 'up', an overloaded service would fail to send notifications to the agent.
As a result, the DHCP agent would be unable to serve requests for ports for which it was not notified.
With this fix, Networking has been updated to always send DHCP notifications even if the target agent is considered 'down'. In addition, an error message will be logged if a DHCP notification cannot be sent due to unavailable DHCP agents. DHCP agents will now always be notified of network changes, and will be able to eventually respond to DHCP requests for configured ports.
|
Story Points: | --- | ||||||||||
Clone Of: | |||||||||||||
: | 1074344 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2014-03-25 19:23:11 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1066642, 1074344 | ||||||||||||
Attachments: |
|
Description
Graeme Gillies
2013-10-28 01:51:40 UTC
The fix mentioned in the above bug has made it in via normal rebasing off of upstream releases. Any chance you could try out 2013.1.4 and see if this is still an issue? I am running openstack-quantum-2013.1.4-3.el6ost.noarch and haven't seen the problem yet (only been a few days). I'll give it another week or so and see if I see the problem happen again Thanks Craeme. I'm closing the bug for now if you encounter the issue again please reopen. Hi, We are hitting the problem again, see below dnsmasq (address 192.168.0.133) fa:16:3e:e7:8d:2b,host-192-168-0-126.openstacklocal,192.168.0.126 fa:16:3e:b9:09:5a,host-192-168-0-21.openstacklocal,192.168.0.21 fa:16:3e:ab:55:d6,host-192-168-0-121.openstacklocal,192.168.0.121 fa:16:3e:32:53:d1,host-192-168-0-127.openstacklocal,192.168.0.127 fa:16:3e:b4:e6:8a,host-192-168-0-26.openstacklocal,192.168.0.26 fa:16:3e:1a:2f:79,host-192-168-0-143.openstacklocal,192.168.0.143 fa:16:3e:9e:2a:98,host-192-168-0-3.openstacklocal,192.168.0.3 fa:16:3e:01:2f:27,host-192-168-0-23.openstacklocal,192.168.0.23 fa:16:3e:c5:aa:af,host-192-168-0-139.openstacklocal,192.168.0.139 fa:16:3e:a9:80:67,host-192-168-0-135.openstacklocal,192.168.0.135 fa:16:3e:e9:09:da,host-192-168-0-131.openstacklocal,192.168.0.131 fa:16:3e:80:67:f4,host-192-168-0-119.openstacklocal,192.168.0.119 fa:16:3e:d6:6a:0e,host-192-168-0-125.openstacklocal,192.168.0.125 fa:16:3e:41:47:43,host-192-168-0-1.openstacklocal,192.168.0.1 fa:16:3e:00:3c:04,host-192-168-0-130.openstacklocal,192.168.0.130 fa:16:3e:d9:d9:d1,host-192-168-0-138.openstacklocal,192.168.0.138 fa:16:3e:b7:ce:1f,host-192-168-0-25.openstacklocal,192.168.0.25 fa:16:3e:86:8e:c3,host-192-168-0-22.openstacklocal,192.168.0.22 fa:16:3e:87:25:b5,host-192-168-0-136.openstacklocal,192.168.0.136 fa:16:3e:ea:e9:1b,host-192-168-0-2.openstacklocal,192.168.0.2 fa:16:3e:9c:2b:5d,host-192-168-0-137.openstacklocal,192.168.0.137 fa:16:3e:92:9f:17,host-192-168-0-124.openstacklocal,192.168.0.124 fa:16:3e:98:ff:51,host-192-168-0-134.openstacklocal,192.168.0.134 fa:16:3e:0c:3d:bb,host-192-168-0-140.openstacklocal,192.168.0.140 fa:16:3e:13:b4:80,host-192-168-0-133.openstacklocal,192.168.0.133 fa:16:3e:b6:84:ca,host-192-168-0-141.openstacklocal,192.168.0.141 fa:16:3e:58:b2:c4,host-192-168-0-145.openstacklocal,192.168.0.145 fa:16:3e:62:ad:d1,host-192-168-0-146.openstacklocal,192.168.0.146 fa:16:3e:2d:7f:be,host-192-168-0-147.openstacklocal,192.168.0.147 fa:16:3e:13:d8:21,host-192-168-0-149.openstacklocal,192.168.0.149 fa:16:3e:28:73:07,host-192-168-0-151.openstacklocal,192.168.0.151 fa:16:3e:b6:bb:21,host-192-168-0-152.openstacklocal,192.168.0.152 fa:16:3e:6f:2c:aa,host-192-168-0-153.openstacklocal,192.168.0.153 fa:16:3e:08:5f:7e,host-192-168-0-154.openstacklocal,192.168.0.154 fa:16:3e:cf:85:df,host-192-168-0-128.openstacklocal,192.168.0.128 fa:16:3e:2b:44:b8,host-192-168-0-129.openstacklocal,192.168.0.129 fa:16:3e:bd:05:5b,host-192-168-0-132.openstacklocal,192.168.0.132 fa:16:3e:03:39:dd,host-192-168-0-133.openstacklocal,192.168.0.133 Regards, Graeme Can you reproduce on Havana or only on Grizzly? I need to know whether we need to target an upstream fix to get into 4.0 or whether the problem is 3.0 only. I haven't tried in havana yet. I don't have ready access to a havana environment right now, so I might take me a few weeks to get one up and try to reproduce. I can confirm that this bug is exhibited upstream. Attempting to boot 50 instances resulted in a hosts file that was missing 8 of the 50 ip addresses. Updating to point to the proper upstream bug. There is a patch under review upstream that should reduce incidence of the problem and make it more visible (via logging) if it does occur: https://review.openstack.org/#/c/61168/ A refactor of the dhcp agent to improve scalability is under way but not yet ready for review. The refactoring effort has gained a blueprint: https://blueprints.launchpad.net/neutron/+spec/eventually-consistent-dhcp-agent The refactoring effort depends on bug #1046782 Upstream patch to send notifications regardless of agent status has merged to stable/havana and will appear in the next stable release: https://review.openstack.org/#/c/65590/ Created attachment 866662 [details]
nova list after stress deletion
Created attachment 866663 [details]
nova list moment after sending stress deletion cmd
Created attachment 866678 [details]
nova compute log after stress deletion
Tried to verify on Havana with: Version-Release number of selected component (if applicable): ------------------------------------------------------------- puddle: 2014-02-21.1 openstack-neutron-2013.2.2-1.el6ost.noarch Reproduce Steps: ---------------- 1. Setup environment with tenant-network (internal VLAN) and update qoutas for: instances - 100 cores - 100 ports - 150 2. Boot a vm to verify working setup 3. Boot 90 vms parallely: # for i in $(seq 1 90); do nova boot stress-${i} --flavor 1 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done 4. Verify working VM's with non-identical IP's 5. Delete 90 vms parallely: # for i in $(seq 1 90); do nova delete boot stress-${i} & done Expected Results: ----------------- Environment return to same status before stress boot. Results: -------- 26 VM's at ERROR 3 VM's at ACTIVE (attachment 866662 [details]) Comments: --------- 1. Step 4 - verified with the next cmd which showed 92 (91 VM'S + DHCP): # cat /var/lib/neutron/dhcp/cef6f793-0b5f-429f-9e4b-f9e4e70dbca3/host | cut -d"," -f3 | sort -n | uniq | wc -l 2. Attached is a moment after sending the deletion cmd (step 5) - attachment 866663 [details]. 3. Attached is log for nova compute which indicates a problem with deleted port - probably it was deleted before the vm was deleted - attachment 866678 [details]. In advance of Comment 20. Cleared the not deleted VM's and tried to clear again, sequentialy this time. Reproduce Steps: ---------------- 1. Boot 90 vms parallely: # for i in $(seq 1 90); do nova boot stress-${i} --flavor 1 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done 2. Delete 90 vms sequently: # for i in $(seq 1 90); do nova delete boot stress-${i} ; done Expected Results: ----------------- 1. 90 new ACTIVE VM's with valid IP's. 2. Deleted 90 VM's Results: -------- 1. 3 VM's stuck on BUILD. 2. 3 VM's on BUILD - weren't deleted. 1 VM - ERROR 1 VM - ACTIVE (as nothing happend) On comment 21 I meant further to comment 20. The fix in question is included in the latest release, but can only improve reliability as opposed to guaranteeing it, as per a previous comment: "There is a patch under review upstream that should _reduce_ incidence of the problem and make it more visible (via logging) if it does occur." The DHCP agent not keeping in sync is only one of a number of contributing factors that can result in a failed VM boot. VM boot can also race port wiring, and the only way to fix this is to have Nova wait for Neutron to notify it that a port has become active. This is under active development upstream: https://review.openstack.org/#/c/75253/ So, short answer, your testing is exhibiting known scalability failure modes that are known not to be resolved by the upstream fix in question, but are the subject of continued upstream efforts. Roey: The deletion errors you are reporting are very curious, and should be investigated further. However, it is unlikely that they are caused by the dnsmasq instances falling out of sync since dhcp negotiation is not involved in vm deletion. The only reason the bug subject mentions deletion is in ensuring a consistently high rate of vm boot in the face of limited resources. Can you please file a separate bug against nova regarding vm deletion failing? Whether Neutron is contributing to the problem or not, handling of errors during VM deletion is Nova's responsibility. After a discussion with ofer moving back to on-qa, the original verification of this bug did not verify the fix and the original problem that was reported. Created attachment 874323 [details]
nova list which shows bug is resolved
Verified on Havana with:
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
puddle: 2014-02-28.3
openstack-neutron-2013.2.2-1.el6ost.noarch
Reproduce Steps:
----------------
1. Setup environment with tenant-network (internal VLAN) update qoutas for:
instances - 1024
cores - 1024
ports - 1024
2. Setup new mini flavor with 64MB RAM and no disk, and mini image (like cirros 0.3.1).
2. Boot a vm to verify working setup
3. Boot 128 vms parallely:
# for i in $(seq 1 128); do nova boot stress-${i} --flavor 0 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
4. Verify working VM's with non-identical IP's
Expected Results:
-----------------
128 working VM's with IP in network.
Results:
--------
91 working VM's
5 VM's with 2 IP's
32 VM's in ERROR state due to insufficient nova resources.
(attachment 874323 [details])
Commetns:
---------
1. Environment status described below.
Verified on Havana with:
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
puddle: 2014-02-28.3
openstack-neutron-2013.2.2-1.el6ost.noarch
Reproduce Steps:
----------------
1. Setup environment with tenant-network (internal VLAN) update qoutas for:
instances - 1024
cores - 1024
ports - 1024
2. Setup new mini flavor with 64MB RAM and no disk, and mini image (like cirros 0.3.1).
2. Boot a vm to verify working setup
3. Boot 128 vms parallely:
# for i in $(seq 1 128); do nova boot stress-${i} --flavor 0 --image cirros-0.3.1 --nic net-id=`neutron net-list | grep netInt | cut -d" " -f2` & done
4. Verify working VM's with non-identical IP's
Expected Results:
-----------------
128 working VM's with IP in network.
Results:
--------
91 working VM's
5 VM's with 2 IP's
32 VM's in ERROR state due to insufficient nova resources.
(attachment 874323 [details])
Commetns:
---------
1. Environment status described below.
2. Bug for 2 IP's for VM: Launchpad 1292077
3. Bug for not informative nova log (VM's launched at ERROR): Launchpad 1292090
4. This bug lead to other bugs, but the original was fixed.
Environment Status:
-------------------
[root@cougar16 ~(keystone_admin)]# nova image-list
+--------------------------------------+--------------+--------+--------+
| ID | Name | Status | Server |
+--------------------------------------+--------------+--------+--------+
| fd375460-5380-4dda-a71b-4a616064126e | cirros-0.3.1 | ACTIVE | |
+--------------------------------------+--------------+--------+--------+
[root@cougar16 ~(keystone_admin)]# nova flavor-list
+----+-----------+-----------+------+-----------+------+-------+-------------+-----------+
| ID | Name | Memory_MB | Disk | Ephemeral | Swap | VCPUs | RXTX_Factor | Is_Public |
+----+-----------+-----------+------+-----------+------+-------+-------------+-----------+
| 0 | mini | 64 | 0 | 0 | | 1 | 1.0 | True |
| 1 | m1.tiny | 512 | 1 | 0 | | 1 | 1.0 | True |
| 2 | m1.small | 2048 | 20 | 0 | | 1 | 1.0 | True |
| 3 | m1.medium | 4096 | 40 | 0 | | 2 | 1.0 | True |
| 4 | m1.large | 8192 | 80 | 0 | | 4 | 1.0 | True |
| 5 | m1.xlarge | 16384 | 160 | 0 | | 8 | 1.0 | True |
+----+-----------+-----------+------+-----------+------+-------+-------------+-----------+
[root@cougar16 ~(keystone_admin)]# nova quota-show
+-----------------------------+-------+
| Quota | Limit |
+-----------------------------+-------+
| instances | 1024 |
| cores | 1024 |
| ram | 51200 |
| floating_ips | 10 |
| fixed_ips | -1 |
| metadata_items | 128 |
| injected_files | 5 |
| injected_file_content_bytes | 10240 |
| injected_file_path_bytes | 255 |
| key_pairs | 100 |
| security_groups | 10 |
| security_group_rules | 20 |
+-----------------------------+-------+
[root@cougar16 ~(keystone_admin)]# neutron quota-show
+---------------------+-------+
| Field | Value |
+---------------------+-------+
| floatingip | 50 |
| network | 10 |
| port | 1024 |
| router | 10 |
| security_group | 10 |
| security_group_rule | 100 |
| subnet | 10 |
+---------------------+-------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0334.html |