1387498 – Stop and start openvswitch does not start neutron-openvswitch-agent, restart does

Bug 1387498 - Stop and start openvswitch does not start neutron-openvswitch-agent, restart does

Summary: Stop and start openvswitch does not start neutron-openvswitch-agent, restart ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1394890
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Assaf Muller
QA Contact:	Toni Freger
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-21 06:01 UTC by Sadique Puthen
Modified:	2020-04-15 14:45 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-12 09:02:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 1 Ihar Hrachyshka 2016-10-21 12:15:27 UTC

I think it's expected behaviour. If you start OVS agent, it correctly also starts openvswitch, because it 'Requires=' it. When you stop openvswitch service, it correctly stops OVS agent, because, again, the agent 'Requires=' openvswitch to run.

But later, when you start openvswitch service, there is no way for the system to deduce that what you really meant is not 'just start the openvswitch service' but 'start the openvswitch service AND all services that require it'. So you really need to start the agent itself.

Comment 2 Assaf Muller 2016-10-21 12:20:19 UTC

The origin of this bug is that the OVS agent should be resilient to OVS crashing. We know that stopping and starting the OVS service via systemd does not start the OVS agent for the reasons Ihar explained above. Sadique, can you test that killing the OVS processes behaves correctly, that OVS is started again by systemd, and that the OVS agent recovers from OVS crashing correctly?

Comment 3 Sadique Puthen 2016-10-21 16:57:57 UTC

(In reply to Assaf Muller from comment #2)
> The origin of this bug is that the OVS agent should be resilient to OVS
> crashing. We know that stopping and starting the OVS service via systemd
> does not start the OVS agent for the reasons Ihar explained above. Sadique,
> can you test that killing the OVS processes behaves correctly, that OVS is
> started again by systemd, and that the OVS agent recovers from OVS crashing
> correctly?

I tested this and I can see that crashing the process by killing the pid of openvswitch does not even restart the openvswitch at all which is not good. Has openvswitch service been configured to auto restart on crash? I can't see Restart= set in systemd unit file. When I tried to set as below.

Restart=always
RestartSec=3

I got below.

Oct 21 16:49:08 overcloud-novacompute-1.localdomain systemd[1]: openvswitch.service has Restart= setting other than no, which isn't allowed for Type=oneshot services. Refusing.

So we need to do crash and recovery test for openvswitch. I am using osp-10 beta for this test.

Comment 4 Sadique Puthen 2016-10-21 17:08:05 UTC

(In reply to Sadique Puthen from comment #3)
> (In reply to Assaf Muller from comment #2)
> > The origin of this bug is that the OVS agent should be resilient to OVS
> > crashing. We know that stopping and starting the OVS service via systemd
> > does not start the OVS agent for the reasons Ihar explained above. Sadique,
> > can you test that killing the OVS processes behaves correctly, that OVS is
> > started again by systemd, and that the OVS agent recovers from OVS crashing
> > correctly?
> 
> I tested this and I can see that crashing the process by killing the pid of
> openvswitch does not even restart the openvswitch at all which is not good.
> Has openvswitch service been configured to auto restart on crash? I can't
> see Restart= set in systemd unit file. When I tried to set as below.
> 
> Restart=always
> RestartSec=3
> 
> I got below.
> 
> Oct 21 16:49:08 overcloud-novacompute-1.localdomain systemd[1]:
> openvswitch.service has Restart= setting other than no, which isn't allowed
> for Type=oneshot services. Refusing.
> 
> So we need to do crash and recovery test for openvswitch. I am using osp-10
> beta for this test.

I spoke too soon. I just did "kill pid" that didn't restart openvswitch process. Now I did "kill -11 pid". The latter, which more equivalent to a crash, in fact restarts openvswitch and reinstates all flows.

Comment 5 Assaf Muller 2016-10-21 17:09:53 UTC

Closing as per comments 1, 2 and 4.

Comment 6 Sadique Puthen 2016-10-27 20:27:40 UTC

Assaf,

I would like reopen this as Flavio already clarified that there should not be any difference while stopping and starting and restarting openvswitch. Agree?

Comment 7 VIKRANT 2016-11-02 11:33:09 UTC

Assaf,

IHAC who also came up with same query :

~~~
following sequence on one of the controllers 

* pcs cluster standby $(hostname -s)
* wait a few minutes
* pcs cluster stop
* reboot

results in a configuration where I cannot ping the interfaces connected to the OVS bridge. In this situation the openvswitch service is started and running (active) while neutron-openvswitch-agent is disabled and inactive.

These are the flows after reboot:
#####################
[root@controller-1 ~]# uptime
 10:08:02 up 1 min,  1 user,  load average: 1.35, 0.78, 0.30

[root@controller-1 ~]# ovs-ofctl dump-flows br-ex
NXST_FLOW reply (xid=0x4):
#####################

If I think about it a bit more the following probably is the problem: Pacemaker/Corosync are talking to other controllers via an interface that is managed by openvswitch which itself is controlled by neutron-openvswitch-agent that will not be started as Pacemaker/Corosync cannot talk to their HA partners.

If I manually start neutron-openvswitch-agent then the flows are restored and I can ping the various interfaces connected to br-ex. Then I can also "pcs cluster start" and "pcs cluster unstandby $(hostname -s)" and continue using the cloud.

If this theory is correct then the documentation should probably be updated to discourage users from using an ovs managed interface as management interface of the overcloud.
~~~

I guess it's better to reopen this bug. Kindly let us know your thoughts on it.

Comment 9 Martin Schuppert 2016-11-21 08:20:59 UTC

A workaround for this is to add an "ovs-ofctl add-flow" to /etc/rc.local that the basic flow gets added and the node can join the cluster on boot: 

# cat /etc/rc.local 

~~~ 
#!/bin/bash 
 
ovs-ofctl add-flow br-ex priority=0,actions=normal 
touch /var/lock/subsys/local 
~~~ 
Make sure that the rc.local can be run: 

# chmod +x /etc/rc.d/rc.local 

From tests and feedback with this network got restored when system came up: 

$ pcs cluster stop overcoud-controller-X 
$ reboot

Comment 10 Martin Schuppert 2016-11-21 09:07:05 UTC

Ok, what Vikrant and I were looking at seem to be a duplicate of 1386299 and the root cause that the bridge mode has changed to secure mode :

https://bugzilla.redhat.com/show_bug.cgi?id=1386299#c46

As mentioned in 1386299 the solution would be to set br-ex to fail_mode=standalone in the OVS_EXTRA in /etc/sysconfig/network-scripts/ifcfg-br-ex

Comment 11 Martin Schuppert 2016-11-21 09:43:39 UTC

ok, my last update was not correct. For OSP9 + OSP8 there will be a change to neutron to revert the OVS agent change that put bridges in secure mode

BZ 1387498 tracks the fix for OSP8. Fixed in openstack-neutron-7.2.0-3.el7ost which is at the moment on QA.

Comment 12 Martin Schuppert 2016-11-21 09:56:00 UTC

correction, BZ 1394894 tracks the fix for OSP8.

Comment 13 Assaf Muller 2016-12-09 21:51:58 UTC

I'm a bit lost, in light of the secure bridge mode issue handled for OSP 8, 9 and 10 in separate RHBZs, is there any merit to this RHBZ, or any action expected from Engineering?

Comment 14 Martin Schuppert 2016-12-12 09:02:31 UTC

I think we can close this BZ as duplicate of one of the others.

*** This bug has been marked as a duplicate of bug 1394890 ***

Note You need to log in before you can comment on or make changes to this bug.