Bug 1387498
Summary: | Stop and start openvswitch does not start neutron-openvswitch-agent, restart does | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sadique Puthen <sputhenp> |
Component: | openstack-neutron | Assignee: | Assaf Muller <amuller> |
Status: | CLOSED DUPLICATE | QA Contact: | Toni Freger <tfreger> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 10.0 (Newton) | CC: | amuller, chrisw, ihrachys, mschuppe, nyechiel, sputhenp, srevivo, vaggarwa, vcojot |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-12 09:02:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Ihar Hrachyshka
2016-10-21 12:15:27 UTC
The origin of this bug is that the OVS agent should be resilient to OVS crashing. We know that stopping and starting the OVS service via systemd does not start the OVS agent for the reasons Ihar explained above. Sadique, can you test that killing the OVS processes behaves correctly, that OVS is started again by systemd, and that the OVS agent recovers from OVS crashing correctly? (In reply to Assaf Muller from comment #2) > The origin of this bug is that the OVS agent should be resilient to OVS > crashing. We know that stopping and starting the OVS service via systemd > does not start the OVS agent for the reasons Ihar explained above. Sadique, > can you test that killing the OVS processes behaves correctly, that OVS is > started again by systemd, and that the OVS agent recovers from OVS crashing > correctly? I tested this and I can see that crashing the process by killing the pid of openvswitch does not even restart the openvswitch at all which is not good. Has openvswitch service been configured to auto restart on crash? I can't see Restart= set in systemd unit file. When I tried to set as below. Restart=always RestartSec=3 I got below. Oct 21 16:49:08 overcloud-novacompute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing. So we need to do crash and recovery test for openvswitch. I am using osp-10 beta for this test. (In reply to Sadique Puthen from comment #3) > (In reply to Assaf Muller from comment #2) > > The origin of this bug is that the OVS agent should be resilient to OVS > > crashing. We know that stopping and starting the OVS service via systemd > > does not start the OVS agent for the reasons Ihar explained above. Sadique, > > can you test that killing the OVS processes behaves correctly, that OVS is > > started again by systemd, and that the OVS agent recovers from OVS crashing > > correctly? > > I tested this and I can see that crashing the process by killing the pid of > openvswitch does not even restart the openvswitch at all which is not good. > Has openvswitch service been configured to auto restart on crash? I can't > see Restart= set in systemd unit file. When I tried to set as below. > > Restart=always > RestartSec=3 > > I got below. > > Oct 21 16:49:08 overcloud-novacompute-1.localdomain systemd[1]: > openvswitch.service has Restart= setting other than no, which isn't allowed > for Type=oneshot services. Refusing. > > So we need to do crash and recovery test for openvswitch. I am using osp-10 > beta for this test. I spoke too soon. I just did "kill pid" that didn't restart openvswitch process. Now I did "kill -11 pid". The latter, which more equivalent to a crash, in fact restarts openvswitch and reinstates all flows. Assaf, I would like reopen this as Flavio already clarified that there should not be any difference while stopping and starting and restarting openvswitch. Agree? Assaf, IHAC who also came up with same query : ~~~ following sequence on one of the controllers * pcs cluster standby $(hostname -s) * wait a few minutes * pcs cluster stop * reboot results in a configuration where I cannot ping the interfaces connected to the OVS bridge. In this situation the openvswitch service is started and running (active) while neutron-openvswitch-agent is disabled and inactive. These are the flows after reboot: ##################### [root@controller-1 ~]# uptime 10:08:02 up 1 min, 1 user, load average: 1.35, 0.78, 0.30 [root@controller-1 ~]# ovs-ofctl dump-flows br-ex NXST_FLOW reply (xid=0x4): ##################### If I think about it a bit more the following probably is the problem: Pacemaker/Corosync are talking to other controllers via an interface that is managed by openvswitch which itself is controlled by neutron-openvswitch-agent that will not be started as Pacemaker/Corosync cannot talk to their HA partners. If I manually start neutron-openvswitch-agent then the flows are restored and I can ping the various interfaces connected to br-ex. Then I can also "pcs cluster start" and "pcs cluster unstandby $(hostname -s)" and continue using the cloud. If this theory is correct then the documentation should probably be updated to discourage users from using an ovs managed interface as management interface of the overcloud. ~~~ I guess it's better to reopen this bug. Kindly let us know your thoughts on it. A workaround for this is to add an "ovs-ofctl add-flow" to /etc/rc.local that the basic flow gets added and the node can join the cluster on boot: # cat /etc/rc.local ~~~ #!/bin/bash ovs-ofctl add-flow br-ex priority=0,actions=normal touch /var/lock/subsys/local ~~~ Make sure that the rc.local can be run: # chmod +x /etc/rc.d/rc.local From tests and feedback with this network got restored when system came up: $ pcs cluster stop overcoud-controller-X $ reboot Ok, what Vikrant and I were looking at seem to be a duplicate of 1386299 and the root cause that the bridge mode has changed to secure mode : https://bugzilla.redhat.com/show_bug.cgi?id=1386299#c46 As mentioned in 1386299 the solution would be to set br-ex to fail_mode=standalone in the OVS_EXTRA in /etc/sysconfig/network-scripts/ifcfg-br-ex ok, my last update was not correct. For OSP9 + OSP8 there will be a change to neutron to revert the OVS agent change that put bridges in secure mode BZ 1387498 tracks the fix for OSP8. Fixed in openstack-neutron-7.2.0-3.el7ost which is at the moment on QA. correction, BZ 1394894 tracks the fix for OSP8. I'm a bit lost, in light of the secure bridge mode issue handled for OSP 8, 9 and 10 in separate RHBZs, is there any merit to this RHBZ, or any action expected from Engineering? I think we can close this BZ as duplicate of one of the others. *** This bug has been marked as a duplicate of bug 1394890 *** |