Bug 1285879

Summary: Race condition puts ovs agent in resync
Product: Red Hat OpenStack Reporter: Brent Eagles <beagles>
Component: openstack-neutronAssignee: lpeer <lpeer>
Status: CLOSED DUPLICATE QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: amuller, chrisw, nyechiel, yeylon
Target Milestone: ---   
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-17 02:58:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brent Eagles 2015-11-26 20:25:22 UTC
Cloned from launchpad bug 1499488.

Description:

The following code is from neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent.treat_devices_added_or_updated():

        devices_details_list = (
            self.plugin_rpc.get_devices_details_list_and_failed_devices(
                self.context,
                devices,
                self.agent_id,
                self.conf.host))
        if devices_details_list.get('failed_devices'):
            #TODO(rossella_s) handle better the resync in next patches,
            # this is just to preserve the current behavior
            raise DeviceListRetrievalError(devices=devices)

        devices = devices_details_list.get('devices')
        vif_by_id = self.int_br.get_vifs_by_ids(
            [vif['device'] for vif in devices])

The race condition comes in between get_devices_details_list_and_failed_devices() and get_vifs_by_ids().  If a VM is deleted in that time, then the OVS port goes away and get_vifs_by_ids() raises an exception, which bumps us out to the exception handler in rpc_loop and puts us in resync, causing the next rpc_loop to rescan ALL ports.  On a highly scaled system, this resync can take many minutes, in which time new plug requests all timeout.

get_vifs_by_ids() was added under this patch: https://review.openstack.org/#/c/186734/

The reason the exception is raised due to the missing port is because this new get_vifs_by_id method is not passing if_exists=True on the call to get_ports_attributes().  A grep within that file shows every other call to get_ports_attributes passing if_exists=True.

I believe the fix is to simply start passing if_exists=True in get_vifs_by_ids.

Specification URL (additional information):

https://bugs.launchpad.net/neutron/+bug/1499488

Comment 1 Assaf Muller 2015-12-17 02:58:39 UTC
Will be resolved via OSP 8 rebase before GA.

*** This bug has been marked as a duplicate of bug 1289994 ***