Created attachment 1419724 [details] openvswitch-agent.log and some environment information Description of problem: During a default installation of OpenShift on OSP https://github.com/openshift/openshift-ansible/tree/master/playbooks/openstack on an OSP 11 deployment with 3 controllers and 38 computes networking problems begin to surface at around ~1000 VMs. There are frequent ssh-session disconnects from the controllers and the VMs at this point. The dmesg output on the controllers shows a lot of "net_ratelimit: XYZ callbacks suppressed" messages and there are failures in openvswitch-agent.log Version-Release number of selected component (if applicable): OSP 11 How reproducible: Always when trying a default OCP on OSP install in the environment describe above. Steps to Reproduce: 1. Try a similar OCP on OSP scaleup. I believe that by increasing the number of security group rules this problem will become apparent even sooner than at ~1000 VMs. Actual results: Networking failures, kernel DoS(ed) on the controllers and errors in openvswitch-agent.log AMQP server on overcloud-controller-1.internalapi.localdomain:5672 is unrea chable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: 43592 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state! 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last): 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs _neutron_agent.py", line 311, in _report_state 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent timeout=timeout, retry=retry) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent result = self._waiter.wait(msg_id, timeout) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id) 2018-04-10 04:02:24.930 146411 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID d12f23c014f84fb4a033e39c8e1d 21e7 Expected results: None of the problems described above. Additional info: Workaround: By making the the security group rules very permissive, I was able to go beyond ~2100 VMs in the same environment. There is also a high number of iptable rules on the the computes. https://bugs.launchpad.net/neutron/+bug/1432858
The errors from OVS agent are based on that agent can't talk to rabbit bus because of network stack being choked. A few questions: Can you observe any process like ovs-vswitchd spiking CPU utilization? We've been having issues with such lately. Do you use l2 population? How does the host networking on controllers look like - is the management network (API calls, rabbitmq) using OVS bridges? You can provide such information by issuing "ovs-vsctl show" command. "ip a" would be helpful too.(In reply to jmencak from comment #0) > Additional info: > Workaround: By making the the security group rules very permissive, I was > able to go beyond ~2100 VMs in the same environment. There is also a high > number of iptable rules on the the computes. Does it mean compute nodes also suffer networking issues or it's just an observation?
Created attachment 1420702 [details] ovs-vsctl show on the controller-0 Adding ovs-vsctl show on the controller-0 from the controller. Had to reinstall OpenStack, but the deployment should be exactly the same. As for the compute nodes suffering the same problem, this needs to be verifed, but the ssh disconnects from the controllers were more frequent.
Picking this back up; can you try this on Rocky? There have been multiple improvements in security group efficiency in recent months. Thanks!
Closing as per comment #8, there have been many changes in recent releases related to optimization (including security groups), so currently supported releases should behave much better in this situation