Description of problem: During OSP10->OSP11 upgrade compute nodes lose network connectivity. As a result the upgrade process gets stuck because nova-compute is not able to start because it's not able to reach the rabbitmq servers running on controller nodes. This is the compute node upgrade output: http://paste.openstack.org/show/598878/ From what I can the issue appears to be related to openvswitch: [root@overcloud-novacompute-1 ~]# tail -f /var/log/openvswitch/ovs-vswitchd.log 2017-02-14T19:00:59.068Z|05074|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:00:59.068Z|05075|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:07.067Z|05076|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:07.067Z|05077|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:07.067Z|05078|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:07.067Z|05079|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:15.067Z|05080|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:15.067Z|05081|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:15.067Z|05082|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:15.067Z|05083|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:23.067Z|05084|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:23.067Z|05085|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:23.067Z|05086|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2017-02-14T19:01:23.067Z|05087|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused) The interface use for reaching the rabbitmq servers(vlan200) is part of the br-infra bridge: [root@overcloud-novacompute-1 ~]# ovs-vsctl list-ports br-infra eth1 phy-br-infra vlan200 neutron-openvswitch-agent is stopped: [root@overcloud-novacompute-1 ~]# systemctl status neutron-openvswitch-agent ● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Tue 2017-02-14 16:15:08 UTC; 2h 48min ago Main PID: 44934 (code=exited, status=0/SUCCESS) Feb 13 09:25:37 overcloud-novacompute-1 systemd[1]: Started OpenStack Neutron Open vSwitch Agent. Feb 13 09:25:38 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, s...erate reports. Feb 13 09:25:39 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future. Feb 13 09:25:39 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "rpc_backend" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future. Feb 13 09:25:41 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications". Feb 13 09:25:41 overcloud-novacompute-1 sudo[45004]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf Feb 13 09:25:41 overcloud-novacompute-1 ovs-vsctl[45011]: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl --timeout=10 --oneline --format=json -- --id=@manager create Manager "target=\"ptcp:6640:127.0.0.1\"" -- add Open_vS...options @manager Feb 13 09:25:47 overcloud-novacompute-1 sudo[45195]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport,external_ids --format=json Feb 14 16:15:07 overcloud-novacompute-1 systemd[1]: Stopping OpenStack Neutron Open vSwitch Agent... Feb 14 16:15:08 overcloud-novacompute-1 systemd[1]: Stopped OpenStack Neutron Open vSwitch Agent. Hint: Some lines were ellipsized, use -l to show in full. Version-Release number of selected component (if applicable): In OSP10 we have the following openvswitch packages: python-openvswitch-2.5.0-2.el7.noarch openvswitch-2.5.0-2.el7.x86_64 openstack-neutron-openvswitch-9.1.2-0.20170128064429.42853ea.el7.centos.noarch In OSP11 it looks they get upgraded to 2.6.
Created attachment 1256172 [details] compute upgrade log
Fix upstream here https://review.openstack.org/#/c/436990/ here by Marios, flipping back to DF DFG.
After applying patch 436990 I am still seeing sporadic issues with compute nodes losing network connectivity. I suspect this is a different issue as the messages in the ovs-vswitchd.log now show a 'connection timed out' error. The process listening on 6633 is sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf. [root@overcloud-compute-1 heat-admin]# tail -f /var/log/openvswitch/ovs-vswitchd.log 2017-03-13T14:24:40.594Z|00094|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connecting... 2017-03-13T14:24:40.594Z|00095|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: connecting... 2017-03-13T14:24:44.593Z|00096|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.593Z|00097|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging 2017-03-13T14:24:44.594Z|00098|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.594Z|00099|rconn|INFO|br-int<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging 2017-03-13T14:24:44.594Z|00100|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.594Z|00101|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging 2017-03-13T14:24:44.594Z|00102|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.594Z|00103|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging
I noticed a few things based on a quick review: * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10. * It's using --nopostun, but it should not be needed unless you can't have a service restart at all. * OVS can't communicate with local controller (127.0.0.1), so sounds like another agent is having a problem. * With bridges in secure mode and without a controller, OVS will not allow flows to pass causing connectivity issues on those bridges. Please clarify.
(In reply to Flavio Leitner from comment #6) > I noticed a few things based on a quick review: > * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10. This is what we have in the latest OSP10 puddle: openvswitch-2.6.1-8.git20161206.el7fdb.x86_64.rpm > * It's using --nopostun, but it should not be needed unless you can't have a > service restart at all. > * OVS can't communicate with local controller (127.0.0.1), so sounds like > another agent is having a problem. > * With bridges in secure mode and without a controller, OVS will not allow > flows to pass causing connectivity issues on those bridges. > > Please clarify. From compute nodes upgrade perspective we're doing it via a single command: 'upgrade-non-controller.sh --upgrade $node' so the upgrade process should be transparent to the user. Please let me know if there's anything you want me to check on the compute node during the upgrade process, while the connectivity is lost.
(In reply to Flavio Leitner from comment #6) > I noticed a few things based on a quick review: > * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10. > * It's using --nopostun, but it should not be needed unless you can't have a > service restart at all. > * OVS can't communicate with local controller (127.0.0.1), so sounds like > another agent is having a problem. > * With bridges in secure mode and without a controller, OVS will not allow > flows to pass causing connectivity issues on those bridges. > > Please clarify. Mike Burns and I talked about this on IRC, OSP 10 and 11 will get 2.6.1-10 in the next puddle.
I looked yesterday to environment Marius had and he was able to reproduce. The cause was in used python-ryu version that contains a bug (see LP 1589746). There was an error during neutron-openvswitch-agent shutdown that led to open port 6633. Next start attempt of ovs-agent fails on binding to that port and hence ovs-agent can't connect to openflow controller. Which leads to missing NORMAL action flows on bridge (this is something I don't understand, why the flows disappeared as the bridges are in secure mode).
This is an issue in ryu, I'm taking this BZ.
I am able to constantly reproduce this issue on environments with a higher number of compute nodes. From the openvswitch-agent.log it looks that it's the ryu bug: 2017-03-28 23:37:41.954 43811 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 545, in close # This semaphore prevents parallel execution of this function, File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 528, in uninstantiate def uninstantiate(self, name): KeyError: 'ofctl_service' Moreover this issue appears to be affecting not only overcloud upgrades but undercloud upgrade as well, preventing operations such as adding overcloud nodes. Please see bug 1436729 and 1432028 for reference.
Created attachment 1267264 [details] openvsiwtch-agent.log Adding the openvsiwtch-agent.log
(In reply to Marius Cornea from comment #16) > Created attachment 1267264 [details] > openvsiwtch-agent.log > > Adding the openvsiwtch-agent.log Could one workaround be killing all the neutron-rootwrap-daemon processes once we have stopped the agents on the node? That should be ok, as long as all the other dataplane-related neutron services are down already. Even if any of them is running, they will automatically relaunch the neutron-rootwrap-daemon as necessary, only if the service is running and executing commands, any ongoing command could fail.
(In reply to Miguel Angel Ajo from comment #17) > Could one workaround be killing all the neutron-rootwrap-daemon processes > once we have stopped the agents on the node? Btw, this was proposed by @dalvarez before, and I didn't remember that we had independent "rootwrap-daemon" binary names per services, and I thought we had the risk of killing other services rootwrap-daemons, but if it's only neutron, it's more manageable IMO.
(In reply to Miguel Angel Ajo from comment #17) > (In reply to Marius Cornea from comment #16) > > Created attachment 1267264 [details] > > openvsiwtch-agent.log > > > > Adding the openvsiwtch-agent.log > > Could one workaround be killing all the neutron-rootwrap-daemon processes > once we have stopped the agents on the node? > FWIW killing the neutron-rootwrap-daemon and restarting neutron-openvswitch-agent manually allowed me to recover connectivity and unstuck the upgrade process.
(In reply to Marius Cornea from comment #19) > (In reply to Miguel Angel Ajo from comment #17) > > (In reply to Marius Cornea from comment #16) > > > Created attachment 1267264 [details] > > > openvsiwtch-agent.log > > > > > > Adding the openvsiwtch-agent.log > > > > Could one workaround be killing all the neutron-rootwrap-daemon processes > > once we have stopped the agents on the node? > > > > FWIW killing the neutron-rootwrap-daemon and restarting > neutron-openvswitch-agent manually allowed me to recover connectivity and > unstuck the upgrade process. Marius, for an automated (I hope) fail-proof solution we may need: 1) bringing down the neutron-openvswitch-agent (along with neutron-l3-agent or neutron-dhcp-agent) 2) killing all the neutron-rootwrap-daemons 3) updating the packages 4) starting the services again. This is important because, if you kill the daemon, then the service could start the neutron-rootwrap-daemon again before you try to restart the service.
*** Bug 1436729 has been marked as a duplicate of this bug. ***
(In reply to Miguel Angel Ajo from comment #20) > (In reply to Marius Cornea from comment #19) > > (In reply to Miguel Angel Ajo from comment #17) > > > (In reply to Marius Cornea from comment #16) > > > > Created attachment 1267264 [details] > > > > openvsiwtch-agent.log > > > > > > > > Adding the openvsiwtch-agent.log > > > > > > Could one workaround be killing all the neutron-rootwrap-daemon processes > > > once we have stopped the agents on the node? > > > > > > > FWIW killing the neutron-rootwrap-daemon and restarting > > neutron-openvswitch-agent manually allowed me to recover connectivity and > > unstuck the upgrade process. > > Marius, for an automated (I hope) fail-proof solution we may need: > > 1) bringing down the neutron-openvswitch-agent (along with neutron-l3-agent > or neutron-dhcp-agent) > 2) killing all the neutron-rootwrap-daemons > 3) updating the packages > 4) starting the services again. > > This is important because, if you kill the daemon, then the service could > start the neutron-rootwrap-daemon again before you try to restart the > service. Since host ip address are static and not related to neutron, how the suggest fix will solve it ? See - https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c7 https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c17
(In reply to Ofer Blaut from comment #22) > > Since host ip address are static and not related to neutron, how the suggest > fix will solve it ? > > See - > https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c7 > https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c17 I don't get the question. The fix is not related to ip address. The ryu app always binds to localhost port 6640 (can be configured in config file). Once rootwrap-daemon holding the port is killed, the port can be used in new neutron-openvswtich-agent process.
Adjusting the patch to point to stable/ocata. Not sure if it helps, but it's cleaner anyway.
Push back to ON_DEV as the rdo patch did not land yet.
Adding an alternate workaround at the RPM level which will fix it.
*** Bug 1434484 has been marked as a duplicate of this bug. ***
I haven't been able to reproduce the issue reported in the initial report with the latest build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245