Bug 1348998

Summary:	Removing network fails with "internal server error" in HA environment
Product:	Red Hat OpenStack	Reporter:	Arie Bregman <abregman>
Component:	openstack-neutron	Assignee:	Hynek Mlnarik <hmlnarik>
Status:	CLOSED NOTABUG	QA Contact:	Toni Freger <tfreger>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.0 (Kilo)	CC:	abregman, amuller, chrisw, nyechiel, srevivo
Target Milestone:	async	Keywords:	AutomationBlocker, ZStream
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Simultaneous deletion of subnet while running automatic allocation of DHCP ports cause exception thrown due to database deadlock or concurrently updated database rows. Consequence: Deleting of the subnet fails, either of the following exceptions is logged to neutron server log: * StaleDataError: UPDATE statement on table 'ports' expected to update 1 row(s); 0 were matched. * DBDeadlock: Deadlock found when trying to get lock; try restarting transaction Fix: Restart the transaction when this condition is encountered Result: Subnet is correctly deleted	Story Points:	---
Clone Of:
Clones:	1351101 (view as bug list)		Environment:
Last Closed:	2016-09-12 14:59:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351101

Description Arie Bregman 2016-06-22 13:11:54 UTC

Description of problem:

When running one of the following tempest tests:

tempest.api.network.test_networks.NetworksTest.test_create_delete_subnet_with_gw
tempest.api.network.test_networks.NetworksIpV6Test.test_update_subnet_gw_dns_host_routes_dhcp
tempest.api.network.test_networks.NetworksIpV6TestAttrs.test_update_subnet_gw_dns_host_routes_dhcp
tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_router_rescheduling

It fails when reaching this function in the code: 'self.networks_client.delete_network(net_id)'

From neutron server logs:
---------------------------
2016-06-22 12:14:26.597 3414 TRACE oslo_messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 920, in _non_result
2016-06-22 12:14:26.597 3414 TRACE oslo_messaging.rpc.dispatcher     "This result object does not return rows. "
2016-06-22 12:14:26.597 3414 TRACE oslo_messaging.rpc.dispatcher ResourceClosedError: This result object does not return rows. It has been closed automatically.
----------------------------
And:

2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher     flush_context.execute()
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher     rec.execute(self)
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher     uow
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 170, in save_obj
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher     mapper, table, update)
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 712, in _emit_update_statements
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher     (table.description, len(records), rows))
2016-06-22 12:14:48.009 3416 TRACE oslo_messaging.rpc.dispatcher StaleDataError: UPDATE statement on table 'ports' expected to update 1 row(s); 0 were matched.
-------------------------------------------------------------------------

Version-Release number of selected component (if applicable):

openstack-neutron-ml2-2015.1.2-14.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-2.el7ost.noarch
python-neutron-2015.1.2-14.el7ost.noarch
openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-14.el7ost.noarch
openstack-neutron-metering-agent-2015.1.2-14.el7ost.noarch
openstack-neutron-2015.1.2-14.el7ost.noarch
python-neutron-lbaas-2015.1.2-2.el7ost.noarch
python-neutronclient-2.4.0-2.el7ost.noarch
openstack-neutron-common-2015.1.2-14.el7ost.noarch

How reproducible: 100%


Steps to Reproduce:
1. Install RHOSP 7 (latest puddle) on HA setup (3 controllers, 2 compute)
2. Run the above tempest tests

Actual results: Tests are failing


Expected results: All tests passed successfully


Additional info: Below is a link to the CI job which identified the errors

Comment 2 Hynek Mlnarik 2016-06-24 11:31:01 UTC

The primary cause of the issue is that DHCP ports are created while processing subnet_delete [1]. The bug was fixed in Liberty by retrying the operation [2].

[1] https://bugs.launchpad.net/neutron/+bug/1357055/comments/36
[2] https://review.openstack.org/#/c/171848/

Comment 4 Hynek Mlnarik 2016-06-24 12:41:40 UTC

The bug is triggered also by database deadlock that is resolved in [1]. Including the relevant commit.

[1] https://review.openstack.org/#/c/191540/

Comment 10 Arie Bregman 2016-07-18 07:09:53 UTC

3 controllers

Comment 15 Arie Bregman 2016-07-21 08:17:15 UTC

The issue hasn't been resolved.
It works for 2 controllers, but not for 3. We should identify why there is an internal server error when using 3 controllers.

Comment 16 Hynek Mlnarik 2016-07-21 08:48:24 UTC

(In reply to Arie Bregman from comment #15)
> The issue hasn't been resolved.
> It works for 2 controllers, but not for 3. We should identify why there is
> an internal server error when using 3 controllers.

See comments 2 and 3. Reason is race between dhcp port creation and subnet deletion caused by tempest tests together with IP address pool starvation. The proper way to fix that is to fix the tempest tests or adjust number of controller nodes to reflect number of available IP addresses.