Bug 1552545

Summary: OVN fails to update port to active, fails spawning VMs
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: python-networking-ovnAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: abregman, amuller, apevec, bcafarel, dalvarez, jschluet, lhh, lmartins, majopela, nyechiel
Target Milestone: betaKeywords: AutomationBlocker, Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-4.0.1-0.20180420150809.c7c16d4.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:35:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Eran Kuris 2018-03-07 10:15:59 UTC
Description of problem:
 1. neutron_tempest_plugin.scenario.test_security_groups.NetworkSecGroupTest.test_multiple_ports_secgroup_inheritance

2. neutron_tempest_plugin.scenario.test_security_groups.NetworkSecGroupTest.test_remote_group




The above tests failed with error: 
AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2018-03-05T23:21:09Z", "hostId": "248b78ea19890439f433401639f2717b69fa20188da4a97001e9db00", "OS-SRV-USG:terminated_at": null, "key_name": "tempest-keypair-test-729596425", "name": "tempest-server-test-2070747699", "created": "2018-03-05T23:15:55Z", "tenant_id": "af49717e8c9946d9928a431aa1a2d3af", "os-extended-volumes:volumes_attached": [], "metadata": {}}}
2018-03-05 18:21:11,646 12141 INFO     [tempest.lib.common.rest_client] Request (NetworkSecGroupTest:_run_cleanups): 404 GET http://10.0.0.108:8774/v2.1/servers/cddaeaa2-9cf8-48a4-b30f-f6fc4c4926d3 0.076s
2018-03-05 18:21:11,646 12141 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '404', u'content-length': '111', 'content-location': 'http://10.0.0.108:8774/v2.1/servers/cddaeaa2-9cf8-48a4-b30f-f6fc4c4926d3', u'x-compute-request-id': 'req-8bb8f904-3c55-419f-bf9c-92dedc8e63d4', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'server': 'Apache', u'openstack-api-version': 'compute 2.1', u'connection': 'close', u'x-openstack-nova-api-version': '2.1', u'date': 'Mon, 05 Mar 2018 23:21:11 GMT', u'content-type': 'application/json; charset=UTF-8', u'x-openstack-request-id': 'req-8bb8f904-3c55-419f-bf9c-92dedc8e63d4'}
        Body: {"itemNotFound": {"message": "Instance cddaeaa2-9cf8-48a4-b30f-f6fc4c4926d3 could not be found.", "code": 404}}
}}}

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/neutron_tempest_plugin/scenario/test_security_groups.py", line 284, in test_multiple_ports_secgroup_inheritance
    ports=ports)
  File "/usr/lib/python2.7/site-packages/neutron_tempest_plugin/scenario/test_security_groups.py", line 66, in create_vm_testing_sec_grp
    const.SERVER_STATUS_ACTIVE)
  File "/usr/lib/python2.7/site-packages/tempest/common/waiters.py", line 76, in wait_for_server_status
    server_id=server_id)
tempest.exceptions.BuildErrorException: Server f754f76d-7bb5-4ef0-9aea-96c1ab6aebf0 failed to build and is in ERROR status
Details: {u'message': u'Build of instance f754f76d-7bb5-4ef0-9aea-96c1ab6aebf0 aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'created': u'2018-03-05T23:21:07Z'}

Version-Release number of selected component (if applicable):
OSP13 - puddle  2018-03-02.2 

How reproducible:
always

Steps to Reproduce:
1. run CI job 
2.
3.

Actual results:
failed

Expected results:
pass

Additional info:
logs: 
https://drive.google.com/open?id=1oFVLWVojdnunb-P-Qg3nIl4PASJyNiPr

Comment 4 Eran Kuris 2018-04-17 09:13:36 UTC
cat /etc/yum.repos.d/latest-installed 
13   -p 2018-04-10.2

Manual steps to reproduce on OVN-HA with DVR and VLAN configuration: 

1.openstack network create net-64-2
2.openstack subnet create --subnet-range 10.0.2.0/24  --network net-64-2 --dhcp subnet_4_2
3.openstack network create net-64-1
4.openstack subnet create --subnet-range 10.0.1.0/24  --network net-64-1 --dhcp subnet_4_1
5.openstack router create Router_eNet
6.openstack router add subnet Router_eNet subnet_4_2
7.openstack router add subnet Router_eNet subnet_4_1
8.openstack router set --external-gateway nova Router_eNet


in nova-compute there is error:
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [req-54f38f84-ef69-4e5c-832e-62a4bde2c526 dd50544ada704bd28354e42a36b9137d a3b8b186235a4d1cbffc9305765e9888 - default default] [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82] Instance failed to spawn: VirtualInterfaceCreateException: Virtual Interface creation failed
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82] Traceback (most recent call last):
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2238, in _build_resources
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]     yield resources
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2022, in _build_and_run_instance
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]     block_device_info=block_device_info)
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3066, in spawn
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]     destroy_disks_on_failure=True)
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5518, in _create_domain_and_network
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82]     raise exception.VirtualInterfaceCreateException()
2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance: 84d16a42-8ab3-4b30-836b-c161d95ddd82] VirtualInterfaceCreateException: Virtual Interface creation failed

Comment 5 Lucas Alvares Gomes 2018-04-17 15:11:20 UTC
(In reply to Eran Kuris from comment #4)
> cat /etc/yum.repos.d/latest-installed 
> 13   -p 2018-04-10.2
> 
> Manual steps to reproduce on OVN-HA with DVR and VLAN configuration: 
> 
> 1.openstack network create net-64-2
> 2.openstack subnet create --subnet-range 10.0.2.0/24  --network net-64-2
> --dhcp subnet_4_2
> 3.openstack network create net-64-1
> 4.openstack subnet create --subnet-range 10.0.1.0/24  --network net-64-1
> --dhcp subnet_4_1
> 5.openstack router create Router_eNet
> 6.openstack router add subnet Router_eNet subnet_4_2
> 7.openstack router add subnet Router_eNet subnet_4_1
> 8.openstack router set --external-gateway nova Router_eNet
> 
> 
> in nova-compute there is error:
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager
> [req-54f38f84-ef69-4e5c-832e-62a4bde2c526 dd50544ada704bd28354e42a36b9137d
> a3b8b186235a4d1cbffc9305765e9888 - default default] [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82] Instance failed to spawn:
> VirtualInterfaceCreateException: Virtual Interface creation failed
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82] Traceback (most recent call last):
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2238, in
> _build_resources
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]     yield resources
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2022, in
> _build_and_run_instance
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]    
> block_device_info=block_device_info)
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File
> "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3066,
> in spawn
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]     destroy_disks_on_failure=True)
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]   File
> "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5518,
> in _create_domain_and_network
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82]     raise
> exception.VirtualInterfaceCreateException()
> 2018-04-17 07:16:13.354 1 ERROR nova.compute.manager [instance:
> 84d16a42-8ab3-4b30-836b-c161d95ddd82] VirtualInterfaceCreateException:
> Virtual Interface creation failed

Adding more to Eran's comment here.

I was investigating this a problem at Eran's environment today and I noticed that the VirtualInterfaceCreateException was being raised due to a timeout waiting on the Neutron port to become UP [0].

Tracing the logs in Neutron to figure out why the port was marked as DOWN I found this error in one of the controllers node referent to the port UUID of the instance that I was tracing [1].

The IDL from OVSDB monitor was raising [1] indicating that it didn't have a valid lock [2]. That error also prevented the set_port_state_up() method to be invoked [3] which would then mark the neutron provision block as completed [4] and change the port status to UP.

Note that, only one controller received the notification to set the status for that specific port to UP, I grepped the other two controllers looking for the "OVN reports status up for port: <port uuid>" log message but it didn't have any output.

This seems to be a problem with the locking in the python code for OVSDB. This patch upstream might be relevant for this bug: https://review.openstack.org/#/c/560295/

[0] https://github.com/openstack/nova/blob/stable/queens/nova/virt/libvirt/driver.py#L5580-L5593

[1] http://paste.openstack.org/show/719365/

[2] A OVSDB lock is used to make sure we only have 1 instance of OVSDB Monitor running in the entire cluster

[3] https://github.com/openstack/networking-ovn/blob/df880435d3a7e98aaa8a0ba3a0a9b81286129586/networking_ovn/ovsdb/ovsdb_monitor.py#L106

[4] https://github.com/openstack/networking-ovn/blob/df880435d3a7e98aaa8a0ba3a0a9b81286129586/networking_ovn/ml2/mech_driver.py#L707-L711

Comment 7 Lucas Alvares Gomes 2018-04-18 08:07:41 UTC
Hi Eran,

This patch here should fix this problem and unblock you https://review.openstack.org/#/c/561995

I will backport it to OSP 13.0 as soon as we get the fix merged! It's a was a tricky problem...

Comment 8 Jon Schlueter 2018-04-19 15:25:57 UTC
patch proposed to stable/queens upstream and is merged on master.

Comment 17 errata-xmlrpc 2018-06-27 13:35:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086