Bug 1312450 - openvswitch-agent timing out on "ovs-vsctl set port" operations
Summary: openvswitch-agent timing out on "ovs-vsctl set port" operations
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-neutron
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 12.0 (Pike)
Assignee: Brent Eagles
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-26 17:29 UTC by Matt Flusche
Modified: 2017-11-16 17:36 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-16 17:36:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 425623 0 None ABANDONED Increase vsctl/ovsdb timeout 2020-09-02 20:53:15 UTC
Red Hat Knowledge Base (Solution) 2939941 0 None None None 2017-02-22 10:05:24 UTC

Description Matt Flusche 2016-02-26 17:29:10 UTC
Description of problem:
On servers with a relatively high number of neutron net namespaces, during restart we are seeing the following in openvswitch-agent.log:

Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'ha-f1ff1111-1f', 'tag=236']
Exit code: 242
Stdout: ''
Stderr: '2016-02-25T23:18:18Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'

On a system that contains 1022 namespaces; this error was seen 52 times during startup (for different ports); however, for the ports that generated this error they appear to have been configured correctly.

Version-Release number of selected component (if applicable):
openvswitch-2.4.0-1.el7.x86_64
openstack-neutron-openvswitch-2014.2.3-26.el7ost.rhbz1281583.noarch
python-neutron-2014.2.3-26.el7ost.rhbz1281583.noarch


How reproducible:
I believe customer can reproduce on demand

Steps to Reproduce:
1. Restart the neutron server running l3-agent, dhcp-agent, lbaas-agent
2.
3.

Actual results:
Exit code: 242
Stdout: ''
Stderr: '2016-02-25T23:18:18Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'

Expected results:
No error

Additional info:
It does appear the ovs port does get set correctly.


        Port "ha-f1ff1111-1f"
            tag: 236
            Interface "ha-f1ff1111-1f"
                type: internal

Comment 13 Terry Wilson 2016-04-13 19:56:58 UTC
> I think we want to increase the default timeout so other customers will not be affected by this. What do you think Terry?

I'd be ok with increasing the timeout; though 10 seconds seems like a long time for an ovs-vsctl execution. I'd like to see some profiling to see what is taking so long.

Terry

Comment 14 Assaf Muller 2017-02-13 21:59:14 UTC
(In reply to Terry Wilson from comment #13)
> > I think we want to increase the default timeout so other customers will not be affected by this. What do you think Terry?
> 
> I'd be ok with increasing the timeout; though 10 seconds seems like a long
> time for an ovs-vsctl execution. I'd like to see some profiling to see what
> is taking so long.

This came up again in https://access.redhat.com/support/cases/#/case/01787838.

I suspect that the original default of 10 was not chosen via scientific methods, and that 20 would be equally as arbitrary, but would have our customers avoid these type of issues. I'd suggest we either:

1) Bump the default significantly.
2) Stick to 10 but improve the error handling so that when we do hit a timeout, DEBUG level logs would dump profiling and hypervisor CPU consumption data to help us understand why we hit the timeout.

Obviously (1) is significantly more feasible :)

> 
> Terry

Comment 15 Ihar Hrachyshka 2017-11-07 20:12:59 UTC
BTW I proposed to rename the option: https://review.openstack.org/#/c/518391/ so any fix applied here should count on the new name.

Comment 16 Ihar Hrachyshka 2017-11-07 20:15:03 UTC
What's the expected fix here? A neutron bump? A bump via tripleo/puppet? No bump but better error handling? Nothing at all?

Comment 17 Brent Eagles 2017-11-16 17:28:06 UTC
This keeps falling to the back of my queue. It should be a simple fix to puppet-neutron, I'll look.

Comment 18 Jakub Libosvar 2017-11-16 17:36:06 UTC
We've switched to ovsdb native interface. If you still experience timeout issues in newer OSP product, please feel free to re-open this bug.


Note You need to log in before you can comment on or make changes to this bug.