Bug 1425507

Summary:

OSP10 -> OSP11 upgrade: compute nodes lose network connectivity and upgrade gets stuck

Product:

Red Hat OpenStack

Reporter:

Marius Cornea <mcornea>

Component:

openstack-tripleo-heat-templates

Assignee:

Jakub Libosvar <jlibosva>

Status:

CLOSED ERRATA

QA Contact:

Marius Cornea <mcornea>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

11.0 (Ocata)

CC:

amuller, aschultz, bfournie, ccamacho, chrisw, dbecker, dsneddon, fleitner, jcoufal, jlibosva, jschluet, lruzicka, majopela, mandreou, mburns, mcornea, morazi, nyechiel, oblaut, rbartal, rhel-osp-director-maint, samccann, sasha, sathlang, skramaja, srevivo, yprokule

Target Milestone:

Keywords:

Bugfix, Triaged

Target Release:

11.0 (Ocata)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openstack-neutron-10.0.0-12.el7ost openstack-tripleo-heat-templates-6.0.0-0.5.el7ost

Doc Type:

Bug Fix

Doc Text:

If stopping the neutron-openvswitch-agent service, the stopping process sometimes took too long to exit gracefully and was killed by systemd. In this case, a running neutron-rootwrap-daemon remained in the system, which prevented the neutron-openvswitch-agent service to restart. The problem has been fixed. Now, an rpm scriplet detects the orphaned neutron-rootwrap-daemon and terminates it. As a result, the neutron-openvswitch-agent service starts and restarts successfully.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-05-17 20:01:25 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
compute upgrade log	none
openvsiwtch-agent.log	none

Description Marius Cornea 2017-02-21 15:27:55 UTC

Description of problem:
During OSP10->OSP11 upgrade compute nodes lose network connectivity. As a result the upgrade process gets stuck because nova-compute is not able to start because it's not able to reach the rabbitmq servers running on controller nodes.

This is the compute node upgrade output:
http://paste.openstack.org/show/598878/

From what I can the issue appears to be related to openvswitch:

[root@overcloud-novacompute-1 ~]# tail -f /var/log/openvswitch/ovs-vswitchd.log
2017-02-14T19:00:59.068Z|05074|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:00:59.068Z|05075|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05076|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05077|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05078|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:07.067Z|05079|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05080|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05081|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05082|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:15.067Z|05083|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05084|rconn|WARN|br-ex<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05085|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05086|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2017-02-14T19:01:23.067Z|05087|rconn|WARN|br-infra<->tcp:127.0.0.1:6633: connection failed (Connection refused)

The interface use for reaching the rabbitmq servers(vlan200) is part of the br-infra bridge:

[root@overcloud-novacompute-1 ~]# ovs-vsctl list-ports br-infra
eth1
phy-br-infra
vlan200

neutron-openvswitch-agent is stopped:

[root@overcloud-novacompute-1 ~]# systemctl status neutron-openvswitch-agent
● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2017-02-14 16:15:08 UTC; 2h 48min ago
 Main PID: 44934 (code=exited, status=0/SUCCESS)

Feb 13 09:25:37 overcloud-novacompute-1 systemd[1]: Started OpenStack Neutron Open vSwitch Agent.
Feb 13 09:25:38 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, s...erate reports.
Feb 13 09:25:39 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
Feb 13 09:25:39 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "rpc_backend" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
Feb 13 09:25:41 overcloud-novacompute-1 neutron-openvswitch-agent[44934]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
Feb 13 09:25:41 overcloud-novacompute-1 sudo[45004]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
Feb 13 09:25:41 overcloud-novacompute-1 ovs-vsctl[45011]: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl --timeout=10 --oneline --format=json -- --id=@manager create Manager "target=\"ptcp:6640:127.0.0.1\"" -- add Open_vS...options @manager
Feb 13 09:25:47 overcloud-novacompute-1 sudo[45195]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport,external_ids --format=json
Feb 14 16:15:07 overcloud-novacompute-1 systemd[1]: Stopping OpenStack Neutron Open vSwitch Agent...
Feb 14 16:15:08 overcloud-novacompute-1 systemd[1]: Stopped OpenStack Neutron Open vSwitch Agent.
Hint: Some lines were ellipsized, use -l to show in full.

Version-Release number of selected component (if applicable):

In OSP10 we have the following openvswitch packages:
python-openvswitch-2.5.0-2.el7.noarch
openvswitch-2.5.0-2.el7.x86_64
openstack-neutron-openvswitch-9.1.2-0.20170128064429.42853ea.el7.centos.noarch

In OSP11 it looks they get upgraded to 2.6.

Comment 1 Marius Cornea 2017-02-21 15:28:41 UTC

Created attachment 1256172 [details]
compute upgrade log

Comment 3 Assaf Muller 2017-02-23 22:56:44 UTC

Fix upstream here https://review.openstack.org/#/c/436990/ here by Marios, flipping back to DF DFG.

Comment 5 Marius Cornea 2017-03-13 14:45:30 UTC

After applying patch 436990 I am still seeing sporadic issues with compute nodes losing network connectivity. 

I suspect this is a different issue as the messages in the ovs-vswitchd.log now show a 'connection timed out' error. The process listening on 6633 is sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf.


[root@overcloud-compute-1 heat-admin]# tail -f /var/log/openvswitch/ovs-vswitchd.log 
2017-03-13T14:24:40.594Z|00094|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connecting...
2017-03-13T14:24:40.594Z|00095|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: connecting...
2017-03-13T14:24:44.593Z|00096|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connection timed out
2017-03-13T14:24:44.593Z|00097|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging
2017-03-13T14:24:44.594Z|00098|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connection timed out
2017-03-13T14:24:44.594Z|00099|rconn|INFO|br-int<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging
2017-03-13T14:24:44.594Z|00100|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connection timed out
2017-03-13T14:24:44.594Z|00101|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging
2017-03-13T14:24:44.594Z|00102|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: connection timed out
2017-03-13T14:24:44.594Z|00103|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging

Comment 6 Flavio Leitner 2017-03-15 14:25:37 UTC

I noticed a few things based on a quick review:
* It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10.
* It's using --nopostun, but it should not be needed unless you can't have a service restart at all.
* OVS can't communicate with local controller (127.0.0.1), so sounds like another agent is having a problem.
* With bridges in secure mode and without a controller, OVS will not allow flows to pass causing connectivity issues on those bridges.

Please clarify.

Comment 7 Marius Cornea 2017-03-15 15:06:23 UTC

(In reply to Flavio Leitner from comment #6)
> I noticed a few things based on a quick review:
> * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10.

This is what we have in the latest OSP10 puddle: openvswitch-2.6.1-8.git20161206.el7fdb.x86_64.rpm

> * It's using --nopostun, but it should not be needed unless you can't have a
> service restart at all.
> * OVS can't communicate with local controller (127.0.0.1), so sounds like
> another agent is having a problem.
> * With bridges in secure mode and without a controller, OVS will not allow
> flows to pass causing connectivity issues on those bridges.
> 
> Please clarify.

From compute nodes upgrade perspective we're doing it via a single command: 'upgrade-non-controller.sh --upgrade $node' so the upgrade process should be transparent to the user. Please let me know if there's anything you want me to check on the compute node during the upgrade process, while the connectivity is lost.

Comment 8 Assaf Muller 2017-03-15 16:52:31 UTC

(In reply to Flavio Leitner from comment #6)
> I noticed a few things based on a quick review:
> * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10.
> * It's using --nopostun, but it should not be needed unless you can't have a
> service restart at all.
> * OVS can't communicate with local controller (127.0.0.1), so sounds like
> another agent is having a problem.
> * With bridges in secure mode and without a controller, OVS will not allow
> flows to pass causing connectivity issues on those bridges.
> 
> Please clarify.

Mike Burns and I talked about this on IRC, OSP 10 and 11 will get 2.6.1-10 in the next puddle.

Comment 9 Jakub Libosvar 2017-03-15 17:21:10 UTC

I looked yesterday to environment Marius had and he was able to reproduce. The cause was in used python-ryu version that contains a bug (see LP 1589746).

There was an error during neutron-openvswitch-agent shutdown that led to open port 6633. Next start attempt of ovs-agent fails on binding to that port and hence ovs-agent can't connect to openflow controller. Which leads to missing NORMAL action flows on bridge (this is something I don't understand, why the flows disappeared as the bridges are in secure mode).

Comment 10 Jakub Libosvar 2017-03-16 17:31:59 UTC

This is an issue in ryu, I'm taking this BZ.

Comment 15 Marius Cornea 2017-03-29 08:48:39 UTC

I am able to constantly reproduce this issue on environments with a higher number of compute nodes. From the openvswitch-agent.log it looks that it's the ryu bug:

2017-03-28 23:37:41.954 43811 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 545, in close
    # This semaphore prevents parallel execution of this function,
  File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 528, in uninstantiate
    def uninstantiate(self, name):
KeyError: 'ofctl_service'


Moreover this issue appears to be affecting not only overcloud upgrades but undercloud upgrade as well, preventing operations such as adding overcloud nodes. Please see bug 1436729 and 1432028 for reference.

Comment 16 Marius Cornea 2017-03-29 09:02:54 UTC

Created attachment 1267264 [details]
openvsiwtch-agent.log

Adding the openvsiwtch-agent.log

Comment 17 Miguel Angel Ajo 2017-03-29 10:09:34 UTC

(In reply to Marius Cornea from comment #16)
> Created attachment 1267264 [details]
> openvsiwtch-agent.log
> 
> Adding the openvsiwtch-agent.log

Could one workaround be killing all the neutron-rootwrap-daemon processes once we have stopped the agents on the node?

That should be ok, as long as all the other dataplane-related neutron services are down already. Even if any of them is running, they will automatically relaunch the neutron-rootwrap-daemon as necessary, only if the service is running and executing commands, any ongoing command could fail.

Comment 18 Miguel Angel Ajo 2017-03-29 10:11:57 UTC

(In reply to Miguel Angel Ajo from comment #17)

> Could one workaround be killing all the neutron-rootwrap-daemon processes
> once we have stopped the agents on the node?

Btw, this was proposed by @dalvarez before, and I didn't remember that we had
independent "rootwrap-daemon" binary names per services, and I thought we had the risk of killing other services rootwrap-daemons, but if it's only neutron, it's more manageable IMO.

Comment 19 Marius Cornea 2017-03-29 10:17:34 UTC

(In reply to Miguel Angel Ajo from comment #17)
> (In reply to Marius Cornea from comment #16)
> > Created attachment 1267264 [details]
> > openvsiwtch-agent.log
> > 
> > Adding the openvsiwtch-agent.log
> 
> Could one workaround be killing all the neutron-rootwrap-daemon processes
> once we have stopped the agents on the node?
> 

FWIW killing the neutron-rootwrap-daemon and restarting neutron-openvswitch-agent manually allowed me to recover connectivity and unstuck the upgrade process.

Comment 20 Miguel Angel Ajo 2017-03-29 10:50:00 UTC

(In reply to Marius Cornea from comment #19)
> (In reply to Miguel Angel Ajo from comment #17)
> > (In reply to Marius Cornea from comment #16)
> > > Created attachment 1267264 [details]
> > > openvsiwtch-agent.log
> > > 
> > > Adding the openvsiwtch-agent.log
> > 
> > Could one workaround be killing all the neutron-rootwrap-daemon processes
> > once we have stopped the agents on the node?
> > 
> 
> FWIW killing the neutron-rootwrap-daemon and restarting
> neutron-openvswitch-agent manually allowed me to recover connectivity and
> unstuck the upgrade process.

Marius, for an automated (I hope) fail-proof solution we may need:

1) bringing down the neutron-openvswitch-agent (along with neutron-l3-agent or neutron-dhcp-agent)
2) killing all the neutron-rootwrap-daemons
3) updating the packages
4) starting the services again.

This is important because, if you kill the daemon, then the service could start the neutron-rootwrap-daemon again before you try to restart the service.

Comment 21 Raviv Bar-Tal 2017-03-29 15:03:57 UTC

*** Bug 1436729 has been marked as a duplicate of this bug. ***

Comment 22 Ofer Blaut 2017-03-30 04:43:22 UTC

(In reply to Miguel Angel Ajo from comment #20)
> (In reply to Marius Cornea from comment #19)
> > (In reply to Miguel Angel Ajo from comment #17)
> > > (In reply to Marius Cornea from comment #16)
> > > > Created attachment 1267264 [details]
> > > > openvsiwtch-agent.log
> > > > 
> > > > Adding the openvsiwtch-agent.log
> > > 
> > > Could one workaround be killing all the neutron-rootwrap-daemon processes
> > > once we have stopped the agents on the node?
> > > 
> > 
> > FWIW killing the neutron-rootwrap-daemon and restarting
> > neutron-openvswitch-agent manually allowed me to recover connectivity and
> > unstuck the upgrade process.
> 
> Marius, for an automated (I hope) fail-proof solution we may need:
> 
> 1) bringing down the neutron-openvswitch-agent (along with neutron-l3-agent
> or neutron-dhcp-agent)
> 2) killing all the neutron-rootwrap-daemons
> 3) updating the packages
> 4) starting the services again.
> 
> This is important because, if you kill the daemon, then the service could
> start the neutron-rootwrap-daemon again before you try to restart the
> service.

Since host ip address are static and not related to neutron, how the suggest fix will solve it ? 

See - 
https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c7
https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c17

Comment 23 Jakub Libosvar 2017-03-30 10:19:53 UTC

(In reply to Ofer Blaut from comment #22)
> 
> Since host ip address are static and not related to neutron, how the suggest
> fix will solve it ? 
> 
> See - 
> https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c7
> https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c17

I don't get the question. The fix is not related to ip address. The ryu app always binds to localhost port 6640 (can be configured in config file). Once rootwrap-daemon holding the port is killed, the port can be used in new neutron-openvswtich-agent process.

Comment 24 Sofer Athlan-Guyot 2017-03-30 11:47:54 UTC

Adjusting the patch to point to stable/ocata.  Not sure if it helps, but it's cleaner anyway.

Comment 25 Sofer Athlan-Guyot 2017-03-30 12:07:49 UTC

Push back to ON_DEV as the rdo patch did not land yet.

Comment 26 Miguel Angel Ajo 2017-03-30 12:17:45 UTC

Adding an alternate workaround at the RPM level which will fix it.

Comment 29 Raviv Bar-Tal 2017-04-06 11:17:21 UTC

*** Bug 1434484 has been marked as a duplicate of this bug. ***

Comment 33 Marius Cornea 2017-04-24 10:07:17 UTC

I haven't been able to reproduce the issue reported in the initial report with the latest build.

Comment 34 Raviv Bar-Tal 2017-04-24 13:22:56 UTC

*** Bug 1436729 has been marked as a duplicate of this bug. ***

Comment 40 errata-xmlrpc 2017-05-17 20:01:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245