Description of problem: When VDSM networking is configuring an OVS network, it uses ovs-vsctl tool. Since OVS 2.9, we are seeing frequently that the VDSM tests are getting stuck on ovs-vsctl execution. Version-Release number of selected component (if applicable): OVS 2.9.0 How reproducible: Sporadically, mainly seen on CI runs. Steps to Reproduce: 1. Run VDSM unit/integration tests. Expected results: 1. ovs-vsctl should not get blocked at all, the system should be healthy in the tests. [OVS domain] 2. If there is indeed a problem with the backend service, a timeout needs to be set in place so supervdsmd will not be stuck. [VDSM domain]
Created attachment 1423002 [details] Logs from the jenkins slave
Created attachment 1423003 [details] Logs from the mock which runs in the jenkins slave Note: OVS is started from within the mock by the tests fixture.
Created attachment 1423004 [details] Test run console with the gdb trace This is the Jenkins job console, showing the trace generated by GDB when the tests are getting stuck.
Additional info: - The tests are running in a mock on a Jenkins slave. - The openvswitch service is started/stopped by calling "/usr/share/openvswitch/scripts/ovs-ctl". (because systemd is not available in the mock) The problem has not been seen with OVS 2.73.
Hi, We have ovsdb-server.log, but not ovs-vswitchd.log which could indicate that the service is not running. Unfortunately I can't help much as there is almost no OVS related information in the logs provided. We need the journalctl -b0, systemd services status, ovs-vswitchd.log, list of contents in /run/openvswitch, ps -eLf, netstat -an, selinuxenabled, dmesg, etc... ovs-vsctl command by default will use the unix socket in /run/openvswitch to talk with the daemon (ovsdb-server) and then it will wait ovs-vswitchd daemon to reconfigure itself. If the daemon has a problem or it is not running, the command will hang unless you pass --no-wait. So, the next step is to find out about the whole OVS service status and that's why I am requesting the information above. A sosreport would be very helpful as I don't think you run OVS services inside of the mock. I suspect that this is related to OVS running as non-root user since new installations will run as 'openvswitch:openvswitch' but you need to pass that as a parameter to ovs-ctl.
I don't see the ovs-vswitchd logs in the /var/log/openvswitch - are you sure it was started? If so, is it possible that the collection script forgot to grab that information? ovs-vsctl will wait for ovs-vswitchd to acknowledge the database change after it detects it from ovsdb-server. So the two are somewhat inter-dependent. That can be avoided by passing either --no-wait or --timeout flag to ovs-vsctl (but it means you'll need to poll to see whether the change has taken effect). It shouldn't be required to run as the openvswitch:openvswitch user (only recommended for security purposes). I don't see any AVCs related, or obvious permission issues (indeed - ovsdb-server log is owned as root, so they're running that way).
(In reply to Flavio Leitner from comment #5) > Hi, > > We have ovsdb-server.log, but not ovs-vswitchd.log which could indicate that > the service is not running. Unfortunately I can't help much as there is > almost no OVS related information in the logs provided. Could it be that not all services have been really started, although the start command returned? We already know about the ansync nature of such things. Perhaps, in previous versions, the init was faster and we have not encountered it then? I guess I can try and add a delay of 2 sec between the start of the service and the running of the tests. But I would prefer to have the theory supporting this possibility before we try it. > > We need the journalctl -b0, systemd services status, ovs-vswitchd.log, list > of contents in /run/openvswitch, ps -eLf, netstat -an, selinuxenabled, > dmesg, etc... We have a problem collecting this info, as the job will clean everything up and there will be no traces of anything left. The only way to do it will be to enter a jenkins slave and run the whole thing manually. > > ovs-vsctl command by default will use the unix socket in /run/openvswitch to > talk with the daemon (ovsdb-server) and then it will wait ovs-vswitchd > daemon to reconfigure itself. If the daemon has a problem or it is not > running, the command will hang unless you pass --no-wait. So, the next step > is to find out about the whole OVS service status and that's why I am > requesting the information above. A sosreport would be very helpful as I > don't think you run OVS services inside of the mock. OVS is installed on the host and we start it from within the mock. > > I suspect that this is related to OVS running as non-root user since new > installations will run as 'openvswitch:openvswitch' but you need to pass > that as a parameter to ovs-ctl. Inside the mock we have root access. We create bonds and play a lot with other networking entities. It is also working 4 out of 5 times in average, if it would have been an privilege issue, I would expect it to be constantly reoccurring.
(In reply to Edward Haas from comment #7) > (In reply to Flavio Leitner from comment #5) > > Hi, > > > > We have ovsdb-server.log, but not ovs-vswitchd.log which could indicate that > > the service is not running. Unfortunately I can't help much as there is > > almost no OVS related information in the logs provided. > > Could it be that not all services have been really started, although the > start command returned? We already know about the ansync nature of such > things. > Perhaps, in previous versions, the init was faster and we have not > encountered it then? > I guess I can try and add a delay of 2 sec between the start of the service > and the running of the tests. But I would prefer to have the theory > supporting this possibility before we try it. I would suggest using something like --timeout=10 instead of inserting a sleep. That way, once ovs-vswitchd is up and running all the vsctl commands will succeed quickly, but until that point, you'll wait 10s.
(In reply to Aaron Conole from comment #8) > > I would suggest using something like --timeout=10 instead of inserting a > sleep. That way, once ovs-vswitchd is up and running all the vsctl commands > will succeed quickly, but until that point, you'll wait 10s. We added 5sec on our master branch, it did not help. Perhaps if it already started, it will no longer be able to recover? Perhaps with a retry option it would have worked.
This patch has an unmerged change http://gerrit.ovirt.org/90457 on master branch, however 4.2 changes are merged. We usually enforce master to be merged first in order not to loose the change there (alternatively it is possible to clone bug and leave only relevant changes there). Hence this bug was left on MODIFIED. If it is ok to move to ON_QA for 4.2 please move it, but we need to make sure this change is not lost in master.
(In reply to Anton Marchukov from comment #10) > This patch has an unmerged change http://gerrit.ovirt.org/90457 on master > branch, however 4.2 changes are merged. We usually enforce master to be > merged first in order not to loose the change there (alternatively it is > possible to clone bug and leave only relevant changes there). Hence this bug > was left on MODIFIED. > > If it is ok to move to ON_QA for 4.2 please move it, but we need to make > sure this change is not lost in master. The patches which have been merged to 4.2 are already merged in master, I have linked them to the BZ manually. The additional patch which is now on master was added only at a latter stage. The deadlock is solved, so we can pass this to QA.
Edy hi, We haven't seen it in QE with ovs 2.9. Please test this on your CI env, thanks)
No regression by this patches.
I think this needs to be verified. If ovs-vswitchd is killed, attempts to change OVS configuration will block supervdsm until someone or something will restart it. This is relevant when we have an OVS based cluster.