Bug 1565508

Summary: Scalability problems with security group rules.
Product: Red Hat OpenStack Reporter: Jiří Mencák <jmencak>
Component: openstack-neutronAssignee: Assaf Muller <amuller>
Status: CLOSED CURRENTRELEASE QA Contact: Toni Freger <tfreger>
Severity: medium Docs Contact:
Priority: high    
Version: 11.0 (Ocata)CC: amuller, bcafarel, bhaley, chrisw, jlibosva, jmencak, njohnston, srevivo
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-39
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-18 15:46:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openvswitch-agent.log and some environment information
none
ovs-vsctl show on the controller-0 none

Description Jiří Mencák 2018-04-10 07:43:28 UTC
Created attachment 1419724 [details]
openvswitch-agent.log and some environment information

Description of problem:
During a default installation of OpenShift on OSP https://github.com/openshift/openshift-ansible/tree/master/playbooks/openstack on an OSP 11 deployment with 3 controllers and 38 computes networking problems begin to surface at around ~1000 VMs.  There are frequent ssh-session disconnects from the controllers and the VMs at this point.  The dmesg output on the controllers shows a lot of "net_ratelimit: XYZ callbacks suppressed" messages and there are failures in openvswitch-agent.log

Version-Release number of selected component (if applicable):
OSP 11

How reproducible:
Always when trying a default OCP on OSP install in the environment describe above.

Steps to Reproduce:
1. Try a similar OCP on OSP scaleup.  I believe that by increasing the number of security group rules this problem will become apparent even sooner than at ~1000 VMs.

Actual results:
Networking failures, kernel DoS(ed) on the controllers and errors in openvswitch-agent.log

AMQP server on overcloud-controller-1.internalapi.localdomain:5672 is unrea
chable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 43592

2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs
_neutron_agent.py", line 311, in _report_state
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     True)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     timeout=timeout, retry=retry)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578,
 in send
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567,
 in _send
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     result = self._waiter.wait(msg_id, timeout)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459,
 in wait
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 347,
 in get
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID d12f23c014f84fb4a033e39c8e1d
21e7

Expected results:
None of the problems described above.

Additional info:
Workaround: By making the the security group rules very permissive, I was able to go beyond ~2100 VMs in the same environment.  There is also a high number of iptable rules on the the computes.

https://bugs.launchpad.net/neutron/+bug/1432858

Comment 2 Jakub Libosvar 2018-04-10 11:09:39 UTC
The errors from OVS agent are based on that agent can't talk to rabbit bus because of network stack being choked. A few questions:

Can you observe any process like ovs-vswitchd spiking CPU utilization? We've been having issues with such lately.

Do you use l2 population?

How does the host networking on controllers look like - is the management network (API calls, rabbitmq) using OVS bridges? You can provide such information by issuing "ovs-vsctl show" command. "ip a" would be helpful too.(In reply to jmencak from comment #0)

> Additional info:
> Workaround: By making the the security group rules very permissive, I was
> able to go beyond ~2100 VMs in the same environment.  There is also a high
> number of iptable rules on the the computes.

Does it mean compute nodes also suffer networking issues or it's just an observation?

Comment 3 Jiří Mencák 2018-04-12 05:57:36 UTC
Created attachment 1420702 [details]
ovs-vsctl show on the controller-0

Adding ovs-vsctl show on the controller-0 from the controller.  Had to reinstall OpenStack, but the deployment should be exactly the same.  As for the compute nodes suffering the same problem, this needs to be verifed, but the ssh disconnects from the controllers were more frequent.

Comment 8 Nate Johnston 2019-04-11 16:15:56 UTC
Picking this back up; can you try this on Rocky?  There have been multiple improvements in security group efficiency in recent months.  Thanks!

Comment 11 Bernard Cafarelli 2019-04-18 15:46:28 UTC
Closing as per comment #8, there have been many changes in recent releases related to optimization (including security groups), so currently supported releases should behave much better in this situation