Bug 1425507
Summary: | OSP10 -> OSP11 upgrade: compute nodes lose network connectivity and upgrade gets stuck | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||||
Component: | openstack-tripleo-heat-templates | Assignee: | Jakub Libosvar <jlibosva> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 11.0 (Ocata) | CC: | amuller, aschultz, bfournie, ccamacho, chrisw, dbecker, dsneddon, fleitner, jcoufal, jlibosva, jschluet, lruzicka, majopela, mandreou, mburns, mcornea, morazi, nyechiel, oblaut, rbartal, rhel-osp-director-maint, samccann, sasha, sathlang, skramaja, srevivo, yprokule | ||||||
Target Milestone: | rc | Keywords: | Bugfix, Triaged | ||||||
Target Release: | 11.0 (Ocata) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | openstack-neutron-10.0.0-12.el7ost openstack-tripleo-heat-templates-6.0.0-0.5.el7ost | Doc Type: | Bug Fix | ||||||
Doc Text: |
If stopping the neutron-openvswitch-agent service, the stopping process sometimes took too long to exit gracefully and was killed by systemd. In this case, a running neutron-rootwrap-daemon remained in the system, which prevented the neutron-openvswitch-agent service to restart.
The problem has been fixed. Now, an rpm scriplet detects the orphaned neutron-rootwrap-daemon and terminates it. As a result, the neutron-openvswitch-agent service starts and restarts successfully.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-05-17 20:01:25 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Marius Cornea
2017-02-21 15:27:55 UTC
Created attachment 1256172 [details]
compute upgrade log
Fix upstream here https://review.openstack.org/#/c/436990/ here by Marios, flipping back to DF DFG. After applying patch 436990 I am still seeing sporadic issues with compute nodes losing network connectivity. I suspect this is a different issue as the messages in the ovs-vswitchd.log now show a 'connection timed out' error. The process listening on 6633 is sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf. [root@overcloud-compute-1 heat-admin]# tail -f /var/log/openvswitch/ovs-vswitchd.log 2017-03-13T14:24:40.594Z|00094|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connecting... 2017-03-13T14:24:40.594Z|00095|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: connecting... 2017-03-13T14:24:44.593Z|00096|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.593Z|00097|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging 2017-03-13T14:24:44.594Z|00098|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.594Z|00099|rconn|INFO|br-int<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging 2017-03-13T14:24:44.594Z|00100|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.594Z|00101|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging 2017-03-13T14:24:44.594Z|00102|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: connection timed out 2017-03-13T14:24:44.594Z|00103|rconn|INFO|br-infra<->tcp:127.0.0.1:6633: continuing to retry connections in the background but suppressing further logging I noticed a few things based on a quick review: * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10. * It's using --nopostun, but it should not be needed unless you can't have a service restart at all. * OVS can't communicate with local controller (127.0.0.1), so sounds like another agent is having a problem. * With bridges in secure mode and without a controller, OVS will not allow flows to pass causing connectivity issues on those bridges. Please clarify. (In reply to Flavio Leitner from comment #6) > I noticed a few things based on a quick review: > * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10. This is what we have in the latest OSP10 puddle: openvswitch-2.6.1-8.git20161206.el7fdb.x86_64.rpm > * It's using --nopostun, but it should not be needed unless you can't have a > service restart at all. > * OVS can't communicate with local controller (127.0.0.1), so sounds like > another agent is having a problem. > * With bridges in secure mode and without a controller, OVS will not allow > flows to pass causing connectivity issues on those bridges. > > Please clarify. From compute nodes upgrade perspective we're doing it via a single command: 'upgrade-non-controller.sh --upgrade $node' so the upgrade process should be transparent to the user. Please let me know if there's anything you want me to check on the compute node during the upgrade process, while the connectivity is lost. (In reply to Flavio Leitner from comment #6) > I noticed a few things based on a quick review: > * It's upgrading to ovs 2.6.1-8, while it should have been 2.6.1-10. > * It's using --nopostun, but it should not be needed unless you can't have a > service restart at all. > * OVS can't communicate with local controller (127.0.0.1), so sounds like > another agent is having a problem. > * With bridges in secure mode and without a controller, OVS will not allow > flows to pass causing connectivity issues on those bridges. > > Please clarify. Mike Burns and I talked about this on IRC, OSP 10 and 11 will get 2.6.1-10 in the next puddle. I looked yesterday to environment Marius had and he was able to reproduce. The cause was in used python-ryu version that contains a bug (see LP 1589746). There was an error during neutron-openvswitch-agent shutdown that led to open port 6633. Next start attempt of ovs-agent fails on binding to that port and hence ovs-agent can't connect to openflow controller. Which leads to missing NORMAL action flows on bridge (this is something I don't understand, why the flows disappeared as the bridges are in secure mode). This is an issue in ryu, I'm taking this BZ. I am able to constantly reproduce this issue on environments with a higher number of compute nodes. From the openvswitch-agent.log it looks that it's the ryu bug: 2017-03-28 23:37:41.954 43811 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 545, in close # This semaphore prevents parallel execution of this function, File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 528, in uninstantiate def uninstantiate(self, name): KeyError: 'ofctl_service' Moreover this issue appears to be affecting not only overcloud upgrades but undercloud upgrade as well, preventing operations such as adding overcloud nodes. Please see bug 1436729 and 1432028 for reference. Created attachment 1267264 [details]
openvsiwtch-agent.log
Adding the openvsiwtch-agent.log
(In reply to Marius Cornea from comment #16) > Created attachment 1267264 [details] > openvsiwtch-agent.log > > Adding the openvsiwtch-agent.log Could one workaround be killing all the neutron-rootwrap-daemon processes once we have stopped the agents on the node? That should be ok, as long as all the other dataplane-related neutron services are down already. Even if any of them is running, they will automatically relaunch the neutron-rootwrap-daemon as necessary, only if the service is running and executing commands, any ongoing command could fail. (In reply to Miguel Angel Ajo from comment #17) > Could one workaround be killing all the neutron-rootwrap-daemon processes > once we have stopped the agents on the node? Btw, this was proposed by @dalvarez before, and I didn't remember that we had independent "rootwrap-daemon" binary names per services, and I thought we had the risk of killing other services rootwrap-daemons, but if it's only neutron, it's more manageable IMO. (In reply to Miguel Angel Ajo from comment #17) > (In reply to Marius Cornea from comment #16) > > Created attachment 1267264 [details] > > openvsiwtch-agent.log > > > > Adding the openvsiwtch-agent.log > > Could one workaround be killing all the neutron-rootwrap-daemon processes > once we have stopped the agents on the node? > FWIW killing the neutron-rootwrap-daemon and restarting neutron-openvswitch-agent manually allowed me to recover connectivity and unstuck the upgrade process. (In reply to Marius Cornea from comment #19) > (In reply to Miguel Angel Ajo from comment #17) > > (In reply to Marius Cornea from comment #16) > > > Created attachment 1267264 [details] > > > openvsiwtch-agent.log > > > > > > Adding the openvsiwtch-agent.log > > > > Could one workaround be killing all the neutron-rootwrap-daemon processes > > once we have stopped the agents on the node? > > > > FWIW killing the neutron-rootwrap-daemon and restarting > neutron-openvswitch-agent manually allowed me to recover connectivity and > unstuck the upgrade process. Marius, for an automated (I hope) fail-proof solution we may need: 1) bringing down the neutron-openvswitch-agent (along with neutron-l3-agent or neutron-dhcp-agent) 2) killing all the neutron-rootwrap-daemons 3) updating the packages 4) starting the services again. This is important because, if you kill the daemon, then the service could start the neutron-rootwrap-daemon again before you try to restart the service. *** Bug 1436729 has been marked as a duplicate of this bug. *** (In reply to Miguel Angel Ajo from comment #20) > (In reply to Marius Cornea from comment #19) > > (In reply to Miguel Angel Ajo from comment #17) > > > (In reply to Marius Cornea from comment #16) > > > > Created attachment 1267264 [details] > > > > openvsiwtch-agent.log > > > > > > > > Adding the openvsiwtch-agent.log > > > > > > Could one workaround be killing all the neutron-rootwrap-daemon processes > > > once we have stopped the agents on the node? > > > > > > > FWIW killing the neutron-rootwrap-daemon and restarting > > neutron-openvswitch-agent manually allowed me to recover connectivity and > > unstuck the upgrade process. > > Marius, for an automated (I hope) fail-proof solution we may need: > > 1) bringing down the neutron-openvswitch-agent (along with neutron-l3-agent > or neutron-dhcp-agent) > 2) killing all the neutron-rootwrap-daemons > 3) updating the packages > 4) starting the services again. > > This is important because, if you kill the daemon, then the service could > start the neutron-rootwrap-daemon again before you try to restart the > service. Since host ip address are static and not related to neutron, how the suggest fix will solve it ? See - https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c7 https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c17 (In reply to Ofer Blaut from comment #22) > > Since host ip address are static and not related to neutron, how the suggest > fix will solve it ? > > See - > https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c7 > https://bugzilla.redhat.com/show_bug.cgi?id=1371840#c17 I don't get the question. The fix is not related to ip address. The ryu app always binds to localhost port 6640 (can be configured in config file). Once rootwrap-daemon holding the port is killed, the port can be used in new neutron-openvswtich-agent process. Adjusting the patch to point to stable/ocata. Not sure if it helps, but it's cleaner anyway. Push back to ON_DEV as the rdo patch did not land yet. Adding an alternate workaround at the RPM level which will fix it. *** Bug 1434484 has been marked as a duplicate of this bug. *** I haven't been able to reproduce the issue reported in the initial report with the latest build. *** Bug 1436729 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245 |