Bug 1450223 - Port failed to bind because openvswitch agent dies.
Summary: Port failed to bind because openvswitch agent dies.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 10.0 (Newton)
Assignee: Ihar Hrachyshka
QA Contact: Eran Kuris
URL:
Whiteboard:
: 1393972 1467496 (view as bug list)
Depends On: 1397017 1451082 1456476
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-11 21:56 UTC by Jeremy
Modified: 2023-02-22 23:02 UTC (History)
12 users (show)

Fixed In Version: openstack-neutron-9.2.0-12.el7ost python-ryu-4.9-2.1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1451082 (view as bug list)
Environment:
Last Closed: 2017-06-28 15:31:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 464708 0 'None' MERGED Release new stable oslo.rootwrap releases 2020-11-05 17:31:55 UTC
RDO 6648 0 None None None 2017-05-15 17:41:04 UTC
Red Hat Issue Tracker OSP-4633 0 None None None 2022-08-16 13:00:18 UTC
Red Hat Product Errata RHBA-2017:1594 0 normal SHIPPED_LIVE openstack-neutron bug fix advisory 2017-06-28 19:13:28 UTC

Description Jeremy 2017-05-11 21:56:36 UTC
Description of problem:

Noticed the issue of port failed to bind on a router's external gateway interface. Digging in we found it failed to bind because ovs agent was dead. Looking at ovs debug logs we see a neutron rootwrap process seems to be taking it's socket and it can not start correctly? Workaround is to kill the rootwrap process , restart neutron-openv, clear the router gateway and reattach it .. See notes below in additional info section. We need to figure out why this keeps happening and fix it.


Version-Release number of selected component (if applicable):
openstack-neutron-9.2.0-2.el7ost.noarch  
openstack-neutron-openvswitch-9.2.0-2.el7ost.noarch   
How reproducible:
Seems to keep happening.

Steps to Reproduce:
1. unknown
2.
3.

Actual results:

port failed to bind prevents pinging floating ips.
Expected results:
openvswitch agent up and no port binding failures.



Additional info:

/notes
###port binding failed due to dead ovs agent.
2017-05-08 16:02:19.921 580770 WARNING neutron.plugins.ml2.drivers.mech_agent [req-533fe3ea-77c8-40fc-b58f-4ab050c81720 - - - - -] Refusing to bind port 544d8ec2-d351-4e49-baf1-8b67b6fb482a to dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2017, 4, 14, 0, 9, 37), 'availability_zone': None, 'alive': False, 'topic': u'N/A', 'host': u'overcloud-controller-2.localdomain', 'agent_type': u'Open vSwitch agent', 'resource_versions': {u'SubPort': u'1.0', u'QosPolicy': u'1.3', u'Trunk': u'1.0'}, 'created_at': datetime.datetime(2016, 10, 6, 17, 56, 19), 'started_at': datetime.datetime(2017, 4, 11, 1, 29, 5), 'id': u'1afbdf75-05d0-4f10-bab1-380c3ce846bc', 'configurations': {u'ovs_hybrid_plug': True, u'in_distributed_mode': False, u'datapath_type': u'system', u'vhostuser_socket_dir': u'/var/run/openvswitch', u'tunneling_ip': u'192.168.3.17', u'arp_responder_enabled': False, u'devices': 44, u'ovs_capabilities': {u'datapath_types': [u'netdev', u'system'], u'iface_types': [u'geneve', u'gre', u'internal', u'ipsec_gre', u'lisp', u'patch', u'stt', u'system', u'tap', u'vxlan']}, u'log_agent_heartbeats': False, u'l2_population': False, u'tunnel_types': [u'vxlan'], u'extensions': [u'qos'], u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex'}}}
2017-05-08 16:02:19.921 580770 ERROR neutron.plugins.ml2.managers [req-533fe3ea-77c8-40fc-b58f-4ab050c81720 - - - - -] Failed to bind port 544d8ec2-d351-4e49-baf1-8b67b6fb482a on host overcloud-controller-2.localdomain for vnic_type normal using segments [{'segmentation_id': 3291, 'physical_network': u'datacentre', 'id': u'867a59b9-d4d8-42c9-bda8-a1f54cccab88', 'network_type': u'vlan'}]


###agent-list
| 1afbdf75-05d0-4f10-bab1-380c3ce846bc | Open vSwitch agent | overcloud-controller-2.localdomain |                   | xxx   | True           | neutron-openvswitch-agent |


###debug ovs logs
2017-05-10 18:22:00.481 874615 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 97, in __call__
    self.ofp_ssl_listen_port)
  File "/usr/lib/python2.7/site-packages/ryu/controller/controller.py", line 120, in server_loop
    datapath_connection_factory)
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 117, in __init__
    self.server = eventlet.listen(listen_info)
  File "/usr/lib/python2.7/site-packages/eventlet/convenience.py", line 43, in listen
    sock.bind(addr)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 98] Address already in use

2017-05-10 18:22:00.482 874615 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:146
2017-05-10 18:22:01.123 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): AddBridgeCommand(datapath_type=system, may_exist=True, name=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98
2017-05-10 18:22:01.124 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125
2017-05-10 18:22:01.124 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): SetFailModeCommand(bridge=br-int, mode=secure) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98
2017-05-10 18:22:01.124 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125
2017-05-10 18:22:01.125 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbSetCommand(table=Bridge, col_values=(('protocols', ['OpenFlow10', 'OpenFlow13']),), record=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98
2017-05-10 18:22:01.125 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125

2017-05-10 18:22:01.126 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): SetControllerCommand(bridge=br-int, targets=['tcp:127.0.0.1:6633']) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98

2017-05-10 18:22:01.270 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbGetCommand(column=controller, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98
2017-05-10 18:22:01.270 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125
2017-05-10 18:22:01.271 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbSetCommand(table=Controller, col_values=(('connection_mode', 'out-of-band'),), record=aa338024-faea-45ef-b00e-2e5e02e85597) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98
2017-05-10 18:22:01.384 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-int) do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:98
2017-05-10 18:22:01.385 874615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py:125
2017-05-10 18:22:01.385 874615 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [-] Bridge br-int has datapath-ID 0000b2e9944d644f

### non working controller.
root@overcloud-controller-2 neutron]# netstat -tulpn |grep 6633
tcp       51      0 127.0.0.1:6633          0.0.0.0:*               LISTEN      170970/sudo
[root@overcloud-controller-2 neutron]#


[root@overcloud-controller-2 neutron]# ps aux |grep 170970
root      170970  0.0  0.0 193332  2792 ?        S    Apr11   0:00 sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf




### working system..
[root@overcloud-controller-1 ~]# netstat -tulpn |grep 6633
tcp        0      0 127.0.0.1:6633          0.0.0.0:*               LISTEN      275273/python2
[root@overcloud-controller-1 ~]#

[root@overcloud-controller-1 ~]# ps aux|grep 275273
neutron   275273  5.6  0.0 394056 99824 ?        Ss   Apr19 1770:07 /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent --log-file /var/log/neutron/openvswitch-agent.log


#### kill rootwrap on controller2 to allow neutron-openv to start up correctlly.

+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| 0b42df33-b5dc-4ffe-b452-9d4eaf2f2701 | L3 agent           | overcloud-controller-1.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| 1afbdf75-05d0-4f10-bab1-380c3ce846bc | Open vSwitch agent | overcloud-controller-2.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |

Comment 1 Red Hat Bugzilla Rules Engine 2017-05-11 21:56:44 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 4 Assaf Muller 2017-05-15 13:55:37 UTC
This seems related to https://bugzilla.redhat.com/show_bug.cgi?id=1425507, the rootwrap issue was fixed in 11 and should be backported to 10. Assigning to Ihar for the backport and further triage.

Comment 7 Ihar Hrachyshka 2017-05-15 18:14:14 UTC
In ovs agent logs, we can see that the agent was stopped, but crashed because of: https://bugs.launchpad.net/neutron/+bug/1589746 We need new ryu (4.4+) to fix this crash. We also need new oslo.rootwrap that will clean up orphans. And we need neutron packaging workaround to kill orphans on package update (won't solve the issue completely, but it's a reasonable thing to at least fix update between minor versions of OSP10).

Comment 8 Ihar Hrachyshka 2017-05-15 18:21:41 UTC
The agent crashed leaving orphans because of old ryu. This bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1397017

Comment 9 Ihar Hrachyshka 2017-05-15 18:34:12 UTC
This bug is really mostly in external components: we need newer ryu and backports in python-oslo-rootwrap. I put the relevant dependency bugs in Depends On field. I still leave this bug on openstack-neutron because there is a small thing we can do here, backporting a patch that will clean those orphaned rootwrap daemons on package update. It won't help in general, but at least will cover agent restarts triggered by package upgrades.

I put ryu in Fixed in Version field in addition to openstack-neutron for reference. The actual bug tracking ryu version bump is in a separate rhbz.

Comment 10 Ihar Hrachyshka 2017-05-15 20:07:48 UTC
Not clear why I don't get flags here. Added rhos-flags.

Comment 13 Ihar Hrachyshka 2017-05-16 20:54:09 UTC
Thanks but no, you shouldn't move this bug to ryu. This bug will track a package level workaround for upgrades, while dependency bugs will track ryu and oslo.rootwrap.

Comment 14 Jeremy 2017-05-17 15:09:13 UTC
Have another customer hitting the same issue. fyi.

Comment 16 Eran Kuris 2017-06-01 06:07:47 UTC
Verified on OSP10Z3 openstack-neutron-9.3.1-2.el7ost.noarch

Comment 18 errata-xmlrpc 2017-06-28 15:31:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1594

Comment 19 Jakub Libosvar 2017-07-03 14:35:47 UTC
*** Bug 1393972 has been marked as a duplicate of this bug. ***

Comment 20 Bob Fournier 2017-07-20 13:14:03 UTC
*** Bug 1467496 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.