Bug 1312450

Summary: openvswitch-agent timing out on "ovs-vsctl set port" operations
Product: Red Hat OpenStack Reporter: Matt Flusche <mflusche>
Component: puppet-neutronAssignee: Brent Eagles <beagles>
Status: CLOSED CURRENTRELEASE QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: high    
Version: 6.0 (Juno)CC: amuller, beagles, chrisw, ihrachys, jjoyce, jlibosva, jschluet, mflusche, nyechiel, rcernin, sclewis, scorcora, slinaber, srevivo, tvignaud, twilson
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-16 17:36:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Matt Flusche 2016-02-26 17:29:10 UTC
Description of problem:
On servers with a relatively high number of neutron net namespaces, during restart we are seeing the following in openvswitch-agent.log:

Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'ha-f1ff1111-1f', 'tag=236']
Exit code: 242
Stdout: ''
Stderr: '2016-02-25T23:18:18Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'

On a system that contains 1022 namespaces; this error was seen 52 times during startup (for different ports); however, for the ports that generated this error they appear to have been configured correctly.

Version-Release number of selected component (if applicable):
openvswitch-2.4.0-1.el7.x86_64
openstack-neutron-openvswitch-2014.2.3-26.el7ost.rhbz1281583.noarch
python-neutron-2014.2.3-26.el7ost.rhbz1281583.noarch


How reproducible:
I believe customer can reproduce on demand

Steps to Reproduce:
1. Restart the neutron server running l3-agent, dhcp-agent, lbaas-agent
2.
3.

Actual results:
Exit code: 242
Stdout: ''
Stderr: '2016-02-25T23:18:18Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'

Expected results:
No error

Additional info:
It does appear the ovs port does get set correctly.


        Port "ha-f1ff1111-1f"
            tag: 236
            Interface "ha-f1ff1111-1f"
                type: internal

Comment 13 Terry Wilson 2016-04-13 19:56:58 UTC
> I think we want to increase the default timeout so other customers will not be affected by this. What do you think Terry?

I'd be ok with increasing the timeout; though 10 seconds seems like a long time for an ovs-vsctl execution. I'd like to see some profiling to see what is taking so long.

Terry

Comment 14 Assaf Muller 2017-02-13 21:59:14 UTC
(In reply to Terry Wilson from comment #13)
> > I think we want to increase the default timeout so other customers will not be affected by this. What do you think Terry?
> 
> I'd be ok with increasing the timeout; though 10 seconds seems like a long
> time for an ovs-vsctl execution. I'd like to see some profiling to see what
> is taking so long.

This came up again in https://access.redhat.com/support/cases/#/case/01787838.

I suspect that the original default of 10 was not chosen via scientific methods, and that 20 would be equally as arbitrary, but would have our customers avoid these type of issues. I'd suggest we either:

1) Bump the default significantly.
2) Stick to 10 but improve the error handling so that when we do hit a timeout, DEBUG level logs would dump profiling and hypervisor CPU consumption data to help us understand why we hit the timeout.

Obviously (1) is significantly more feasible :)

> 
> Terry

Comment 15 Ihar Hrachyshka 2017-11-07 20:12:59 UTC
BTW I proposed to rename the option: https://review.openstack.org/#/c/518391/ so any fix applied here should count on the new name.

Comment 16 Ihar Hrachyshka 2017-11-07 20:15:03 UTC
What's the expected fix here? A neutron bump? A bump via tripleo/puppet? No bump but better error handling? Nothing at all?

Comment 17 Brent Eagles 2017-11-16 17:28:06 UTC
This keeps falling to the back of my queue. It should be a simple fix to puppet-neutron, I'll look.

Comment 18 Jakub Libosvar 2017-11-16 17:36:06 UTC
We've switched to ovsdb native interface. If you still experience timeout issues in newer OSP product, please feel free to re-open this bug.