Description of problem: Noticed the issue of port failed to bind on a router's external gateway interface. Digging in we found it failed to bind because ovs agent was dead. Looking at ovs debug logs we see a neutron rootwrap process seems to be taking it's socket and it can not start correctly? Workaround is to kill the rootwrap process , restart neutron-openv, clear the router gateway and reattach it .. See notes below in additional info section. We need to figure out why this keeps happening and fix it. Version-Release number of selected component (if applicable): openstack-neutron-9.2.0-2.el7ost.noarch openstack-neutron-openvswitch-9.2.0-2.el7ost.noarch How reproducible: Seems to keep happening. Steps to Reproduce: 1. unknown 2. 3. Actual results: port failed to bind prevents pinging floating ips. Expected results: openvswitch agent up and no port binding failures. Additional info: /notes ###port binding failed due to dead ovs agent. 2017-05-08 16:02:19.921 580770 WARNING neutron.plugins.ml2.drivers.mech_agent [req-533fe3ea-77c8-40fc-b58f-4ab050c81720 - - - - -] Refusing to bind port 544d8ec2-d351-4e49-baf1-8b67b6fb482a to dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2017, 4, 14, 0, 9, 37), 'availability_zone': None, 'alive': False, 'topic': u'N/A', 'host': u'overcloud-controller-2.localdomain', 'agent_type': u'Open vSwitch agent', 'resource_versions': {u'SubPort': u'1.0', u'QosPolicy': u'1.3', u'Trunk': u'1.0'}, 'created_at': datetime.datetime(2016, 10, 6, 17, 56, 19), 'started_at': datetime.datetime(2017, 4, 11, 1, 29, 5), 'id': u'1afbdf75-05d0-4f10-bab1-380c3ce846bc', 'configurations': {u'ovs_hybrid_plug': True, u'in_distributed_mode': False, u'datapath_type': u'system', u'vhostuser_socket_dir': u'/var/run/openvswitch', u'tunneling_ip': u'192.168.3.17', u'arp_responder_enabled': False, u'devices': 44, u'ovs_capabilities': {u'datapath_types': [u'netdev', u'system'], u'iface_types': [u'geneve', u'gre', u'internal', u'ipsec_gre', u'lisp', u'patch', u'stt', u'system', u'tap', u'vxlan']}, u'log_agent_heartbeats': False, u'l2_population': False, u'tunnel_types': [u'vxlan'], u'extensions': [u'qos'], u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex'}}} 2017-05-08 16:02:19.921 580770 ERROR neutron.plugins.ml2.managers [req-533fe3ea-77c8-40fc-b58f-4ab050c81720 - - - - -] Failed to bind port 544d8ec2-d351-4e49-baf1-8b67b6fb482a on host overcloud-controller-2.localdomain for vnic_type normal using segments [{'segmentation_id': 3291, 'physical_network': u'datacentre', 'id': u'867a59b9-d4d8-42c9-bda8-a1f54cccab88', 'network_type': u'vlan'}] ###agent-list | 1afbdf75-05d0-4f10-bab1-380c3ce846bc | Open vSwitch agent | overcloud-controller-2.localdomain | | xxx | True | neutron-openvswitch-agent | ###debug ovs logs 2017-05-10 18:22:00.481 874615 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 97, in __call__ self.ofp_ssl_listen_port) File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 120, in server_loop datapath_connection_factory) File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 117, in __init__ self.server = eventlet.listen(listen_info) File "/usr/lib/python2.7/site-packages/eventlet/convenience.py", line 43, in listen sock.bind(addr) File "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) error: [Errno 98] Address already in use 2017-05-10 18:22:00.482 874615 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:146 2017-05-10 18:22:01.123 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): AddBridgeCommand(datapath_type=system, may_exist=True, name=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98 2017-05-10 18:22:01.124 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125 2017-05-10 18:22:01.124 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): SetFailModeCommand(bridge=br-int, mode=secure) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98 2017-05-10 18:22:01.124 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125 2017-05-10 18:22:01.125 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbSetCommand(table=Bridge, col_values=(('protocols', ['OpenFlow10', 'OpenFlow13']),), record=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98 2017-05-10 18:22:01.125 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125 2017-05-10 18:22:01.126 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): SetControllerCommand(bridge=br-int, targets=['tcp:127.0.0.1:6633']) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98 2017-05-10 18:22:01.270 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbGetCommand(column=controller, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98 2017-05-10 18:22:01.270 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125 2017-05-10 18:22:01.271 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbSetCommand(table=Controller, col_values=(('connection_mode', 'out-of-band'),), record=aa338024-faea-45ef-b00e-2e5e02e85597) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98 2017-05-10 18:22:01.384 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98 2017-05-10 18:22:01.385 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125 2017-05-10 18:22:01.385 874615 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [-] Bridge br-int has datapath-ID 0000b2e9944d644f ### non working controller. root@overcloud-controller-2 neutron]# netstat -tulpn |grep 6633 tcp 51 0 127.0.0.1:6633 0.0.0.0:* LISTEN 170970/sudo [root@overcloud-controller-2 neutron]# [root@overcloud-controller-2 neutron]# ps aux |grep 170970 root 170970 0.0 0.0 193332 2792 ? S Apr11 0:00 sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf ### working system.. [root@overcloud-controller-1 ~]# netstat -tulpn |grep 6633 tcp 0 0 127.0.0.1:6633 0.0.0.0:* LISTEN 275273/python2 [root@overcloud-controller-1 ~]# [root@overcloud-controller-1 ~]# ps aux|grep 275273 neutron 275273 5.6 0.0 394056 99824 ? Ss Apr19 1770:07 /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent --log-file /var/log/neutron/openvswitch-agent.log #### kill rootwrap on controller2 to allow neutron-openv to start up correctlly. +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | 0b42df33-b5dc-4ffe-b452-9d4eaf2f2701 | L3 agent | overcloud-controller-1.localdomain | nova | :-) | True | neutron-l3-agent | | 1afbdf75-05d0-4f10-bab1-380c3ce846bc | Open vSwitch agent | overcloud-controller-2.localdomain | | :-) | True | neutron-openvswitch-agent |
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
This seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1425507, the rootwrap issue was fixed in 11 and should be backported to 10. Assigning to Ihar for the backport and further triage.
In ovs agent logs, we can see that the agent was stopped, but crashed because of: https://bugs.launchpad.net/neutron/+bug/1589746 We need new ryu (4.4+) to fix this crash. We also need new oslo.rootwrap that will clean up orphans. And we need neutron packaging workaround to kill orphans on package update (won't solve the issue completely, but it's a reasonable thing to at least fix update between minor versions of OSP10).
The agent crashed leaving orphans because of old ryu. This bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1397017
This bug is really mostly in external components: we need newer ryu and backports in python-oslo-rootwrap. I put the relevant dependency bugs in Depends On field. I still leave this bug on openstack-neutron because there is a small thing we can do here, backporting a patch that will clean those orphaned rootwrap daemons on package update. It won't help in general, but at least will cover agent restarts triggered by package upgrades. I put ryu in Fixed in Version field in addition to openstack-neutron for reference. The actual bug tracking ryu version bump is in a separate rhbz.
Not clear why I don't get flags here. Added rhos-flags.
Thanks but no, you shouldn't move this bug to ryu. This bug will track a package level workaround for upgrades, while dependency bugs will track ryu and oslo.rootwrap.
Have another customer hitting the same issue. fyi.
Verified on OSP10Z3 openstack-neutron-9.3.1-2.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1594
*** Bug 1393972 has been marked as a duplicate of this bug. ***
*** Bug 1467496 has been marked as a duplicate of this bug. ***