Bug 1568268

Summary: Executing ovs commands using ovs-vsctl causes a deadlock sporadically
Product: [oVirt] vdsm Reporter: Edward Haas <edwardh>
Component: CoreAssignee: Edward Haas <edwardh>
Status: CLOSED NEXTRELEASE QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: high    
Version: 4.30.0CC: aconole, amarchuk, bugs, edwardh, fleitner, mburman
Target Milestone: ovirt-4.2.3Keywords: TestOnly
Target Release: ---Flags: rule-engine: ovirt-4.2+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-30 08:46:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from the jenkins slave
none
Logs from the mock which runs in the jenkins slave
none
Test run console with the gdb trace none

Description Edward Haas 2018-04-17 05:37:53 UTC
Description of problem:
When VDSM networking is configuring an OVS network, it uses ovs-vsctl tool.
Since OVS 2.9, we are seeing frequently that the VDSM tests are getting stuck on ovs-vsctl execution.

Version-Release number of selected component (if applicable):
OVS 2.9.0

How reproducible:
Sporadically, mainly seen on CI runs.

Steps to Reproduce:
1. Run VDSM unit/integration tests.


Expected results:

1. ovs-vsctl should not get blocked at all, the system should be healthy in the tests. [OVS domain]
2. If there is  indeed a problem with the backend service, a timeout needs to be set in place so supervdsmd will not be stuck. [VDSM domain]

Comment 1 Edward Haas 2018-04-17 11:32:28 UTC
Created attachment 1423002 [details]
Logs from the jenkins slave

Comment 2 Edward Haas 2018-04-17 11:34:27 UTC
Created attachment 1423003 [details]
Logs from the mock which runs in the jenkins slave

Note: OVS is started from within the mock by the tests fixture.

Comment 3 Edward Haas 2018-04-17 11:41:02 UTC
Created attachment 1423004 [details]
Test run console with the gdb trace

This is the Jenkins job console, showing the trace generated by GDB when the tests are getting stuck.

Comment 4 Edward Haas 2018-04-17 11:56:08 UTC
Additional info:
- The tests are running in a mock on a Jenkins slave.
- The openvswitch service is started/stopped by calling "/usr/share/openvswitch/scripts/ovs-ctl". (because systemd is not available in the mock)

The problem has not been seen with OVS 2.73.

Comment 5 Flavio Leitner 2018-04-17 13:56:54 UTC
Hi,

We have ovsdb-server.log, but not ovs-vswitchd.log which could indicate that the service is not running.  Unfortunately I can't help much as there is almost no OVS related information in the logs provided.

We need the journalctl -b0, systemd services status, ovs-vswitchd.log, list of contents in /run/openvswitch, ps -eLf, netstat -an, selinuxenabled, dmesg,  etc...

ovs-vsctl command by default will use the unix socket in /run/openvswitch to talk with the daemon (ovsdb-server) and then it will wait ovs-vswitchd daemon to reconfigure itself. If the daemon has a problem or it is not running, the command will hang unless you pass --no-wait. So, the next step is to find out about the whole OVS service status and that's why I am requesting the information above. A sosreport would be very helpful as I don't think you run OVS services inside of the mock.

I suspect that this is related to OVS running as non-root user since new installations will run as 'openvswitch:openvswitch' but you need to pass that as a parameter to ovs-ctl.

Comment 6 Aaron Conole 2018-04-17 14:21:16 UTC
I don't see the ovs-vswitchd logs in the /var/log/openvswitch - are you sure it was started?  If so, is it possible that the collection script forgot to grab that information?  ovs-vsctl will wait for ovs-vswitchd to acknowledge the database change after it detects it from ovsdb-server.  So the two are somewhat inter-dependent.  That can be avoided by passing either --no-wait or --timeout flag to ovs-vsctl (but it means you'll need to poll to see whether the change has taken effect).

It shouldn't be required to run as the openvswitch:openvswitch user (only recommended for security purposes).  I don't see any AVCs related, or obvious permission issues (indeed - ovsdb-server log is owned as root, so they're running that way).

Comment 7 Edward Haas 2018-04-17 14:24:08 UTC
(In reply to Flavio Leitner from comment #5)
> Hi,
> 
> We have ovsdb-server.log, but not ovs-vswitchd.log which could indicate that
> the service is not running.  Unfortunately I can't help much as there is
> almost no OVS related information in the logs provided.

Could it be that not all services have been really started, although the start command returned? We already know about the ansync nature of such things.
Perhaps, in previous versions, the init was faster and we have not encountered it then?
I guess I can try and add a delay of 2 sec between the start of the service and the running of the tests. But I would prefer to have the theory supporting this possibility before we try it.

> 
> We need the journalctl -b0, systemd services status, ovs-vswitchd.log, list
> of contents in /run/openvswitch, ps -eLf, netstat -an, selinuxenabled,
> dmesg,  etc...

We have a problem collecting this info, as the job will clean everything up and there will be no traces of anything left.
The only way to do it will be to enter a jenkins slave and run the whole thing manually.

> 
> ovs-vsctl command by default will use the unix socket in /run/openvswitch to
> talk with the daemon (ovsdb-server) and then it will wait ovs-vswitchd
> daemon to reconfigure itself. If the daemon has a problem or it is not
> running, the command will hang unless you pass --no-wait. So, the next step
> is to find out about the whole OVS service status and that's why I am
> requesting the information above. A sosreport would be very helpful as I
> don't think you run OVS services inside of the mock.

OVS is installed on the host and we start it from within the mock.

> 
> I suspect that this is related to OVS running as non-root user since new
> installations will run as 'openvswitch:openvswitch' but you need to pass
> that as a parameter to ovs-ctl.

Inside the mock we have root access. We create bonds and play a lot with other networking entities. It is also working 4 out of 5 times in average, if it would have been an privilege issue, I would expect it to be constantly reoccurring.

Comment 8 Aaron Conole 2018-04-17 14:38:01 UTC
(In reply to Edward Haas from comment #7)
> (In reply to Flavio Leitner from comment #5)
> > Hi,
> > 
> > We have ovsdb-server.log, but not ovs-vswitchd.log which could indicate that
> > the service is not running.  Unfortunately I can't help much as there is
> > almost no OVS related information in the logs provided.
> 
> Could it be that not all services have been really started, although the
> start command returned? We already know about the ansync nature of such
> things.
> Perhaps, in previous versions, the init was faster and we have not
> encountered it then?
> I guess I can try and add a delay of 2 sec between the start of the service
> and the running of the tests. But I would prefer to have the theory
> supporting this possibility before we try it.

I would suggest using something like --timeout=10 instead of inserting a sleep.  That way, once ovs-vswitchd is up and running all the vsctl commands will succeed quickly, but until that point, you'll wait 10s.

Comment 9 Edward Haas 2018-04-17 14:43:32 UTC
(In reply to Aaron Conole from comment #8)
> 
> I would suggest using something like --timeout=10 instead of inserting a
> sleep.  That way, once ovs-vswitchd is up and running all the vsctl commands
> will succeed quickly, but until that point, you'll wait 10s.

We added 5sec on our master branch, it did not help.
Perhaps if it already started, it will no longer be able to recover? Perhaps with a retry option it would have worked.

Comment 10 Anton Marchukov 2018-04-27 09:25:03 UTC
This patch has an unmerged change http://gerrit.ovirt.org/90457 on master branch, however 4.2 changes are merged. We usually enforce master to be merged first in order not to loose the change there (alternatively it is possible to clone bug and leave only relevant changes there). Hence this bug was left on MODIFIED.

If it is ok to move to ON_QA for 4.2 please move it, but we need to make sure this change is not lost in master.

Comment 11 Edward Haas 2018-04-28 18:47:52 UTC
(In reply to Anton Marchukov from comment #10)
> This patch has an unmerged change http://gerrit.ovirt.org/90457 on master
> branch, however 4.2 changes are merged. We usually enforce master to be
> merged first in order not to loose the change there (alternatively it is
> possible to clone bug and leave only relevant changes there). Hence this bug
> was left on MODIFIED.
> 
> If it is ok to move to ON_QA for 4.2 please move it, but we need to make
> sure this change is not lost in master.

The patches which have been merged to 4.2 are already merged in master, I have linked them to the BZ manually. The additional patch which is now on master was added only at a latter stage.

The deadlock is solved, so we can pass this to QA.

Comment 12 Michael Burman 2018-04-29 05:15:52 UTC
Edy hi,
We haven't seen it in QE with ovs 2.9.
Please test this on your CI env, thanks)

Comment 13 Meni Yakove 2018-04-30 08:46:34 UTC
No regression by this patches.

Comment 14 Edward Haas 2018-04-30 09:30:27 UTC
I think this needs to be verified.
If ovs-vswitchd is killed, attempts to change OVS configuration will block supervdsm until someone or something will restart it.

This is relevant when we have an OVS based cluster.