Bug 1419917 - VDSM fails to report capabilities if openvswitch stopped while Vdsm is running
Summary: VDSM fails to report capabilities if openvswitch stopped while Vdsm is running
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.19.4
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ovirt-4.1.2
: 4.19.11
Assignee: Edward Haas
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-07 12:02 UTC by Evgheni Dereveanchin
Modified: 2019-03-04 16:48 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-23 08:14:50 UTC
oVirt Team: Network
rule-engine: ovirt-4.1+
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
sosreport from affected host (12.05 MB, application/x-xz)
2017-02-07 12:07 UTC, Evgheni Dereveanchin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 73425 0 master MERGED net: Refresh OVS service state on connectivity error 2017-03-11 18:42:36 UTC
oVirt gerrit 73975 0 ovirt-4.1 MERGED net: Properly initialize ConfigNetworkError exception 2017-03-23 13:27:14 UTC
oVirt gerrit 73976 0 ovirt-4.1 MERGED net: Refresh OVS service state on connectivity error 2017-03-27 12:38:22 UTC

Description Evgheni Dereveanchin 2017-02-07 12:02:41 UTC
Description of problem:
After updating VDSM to 4.18.4 a host failed to activate as Host.getCapabilities was failing with the following error:

2017-02-06 19:35:43,798 ERROR (jsonrpc/2) [jsonrpc.JsonRpcServer] Internal server error (__init__:552)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 547, in _handle_request
    res = method(**params)
  File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 202, in _dynamicMethod
    result = fn(*methodArgs)
  File "/usr/share/vdsm/API.py", line 1378, in getCapabilities
    c = caps.get()
  File "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 166, in get
    net_caps = supervdsm.getProxy().network_caps()
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 53, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 51, in <lambda>
    **kwargs)
  File "<string>", line 2, in network_caps
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
ConfigNetworkError: (21, 'Executing commands failed: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)')
2017-02-06 19:35:43,802 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities failed (error -32603) in 0.08 seconds (__init__:515)


Version-Release number of selected component (if applicable):
vdsm 4.19.4-1.el7.centos

How reproducible:
after 4.0->4.1 VDSM upgrade with no host reboot

Steps to Reproduce:
1. install 4.0 environment
2. upgrade engine to 4.1
3. put 4.0 host into maintenance
4. add 4.1 repo and run "yum update"
5. activate host in the Engine

Actual results:
Host goes unresponsive as it fails to reports capabilities

Expected results:
Host goes up if openvswitch networks are not used in its cluster. If openvswitch is used - host goes into NonOperational step

Additional info:
a reboot of the host seems to fix this

Comment 1 Evgheni Dereveanchin 2017-02-07 12:07:09 UTC
Created attachment 1248374 [details]
sosreport from affected host

Attached full sosreport of affected hypervisor. The timestamp of the upgrade and activation attempt is "2017-02-06 19:35"

Comment 2 Dan Kenigsberg 2017-02-08 09:37:30 UTC
Have you set the cluster switch type to OvS (experimental!)?

Comment 3 Evgheni Dereveanchin 2017-02-08 09:53:42 UTC
Hi Dan! No, the cluster has LEGACY networking set hence I was surprised ovs was even queried. It's OK to get capabilities and see what's installed but VDSM shouldn't fail if a component is missing - same as let's say for Gluster - if it's not installed the host will work but will go NonOperational if put into a datacenter that uses Gluster.

Comment 4 Edward Haas 2017-02-09 07:15:52 UTC
(In reply to Evgheni Dereveanchin from comment #3)
> Hi Dan! No, the cluster has LEGACY networking set hence I was surprised ovs
> was even queried. It's OK to get capabilities and see what's installed but
> VDSM shouldn't fail if a component is missing - same as let's say for
> Gluster - if it's not installed the host will work but will go
> NonOperational if put into a datacenter that uses Gluster.

We are actually checking if the component is missing before trying to query it in this case.
I think something happened after 19:35:34 which stopped openvswitch, as we can see it was running just fine:
MainProcess|jsonrpc/1::DEBUG::2017-02-06 19:35:34,392::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-1 /bin/systemctl status openvswitch.service (cwd None)
MainProcess|jsonrpc/1::DEBUG::2017-02-06 19:35:34,402::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|jsonrpc/1::DEBUG::2017-02-06 19:35:34,403::vsctl::57::root::(commit) Executing commands: /usr/bin/ovs-vsctl --oneline --format=json -- list Bridge -- list Port -- list Interface
MainProcess|jsonrpc/1::DEBUG::2017-02-06 19:35:34,403::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/ovs-vsctl --oneline --format=json -- list Bridge -- list Port -- list Interface (cwd None)
MainProcess|jsonrpc/1::DEBUG::2017-02-06 19:35:34,411::commands::93::root::(execCmd) SUCCESS: <err> = '2017-02-06T18:35:34Z|00001|ovsdb_idl|WARN|Interface table in Open_vSwitch database lacks mtu_request column (database needs upgrade?)\n2017-02-06T18:35:34Z|00002|ovsdb_idl|WARN|Port table in Open_vSwitch database lacks protected column (database needs upgrade?)\n2017-02-06T18:35:34Z|00003|ovsdb_idl|WARN|Interface table in Open_vSwitch database lacks mtu_request column (database needs upgrade?)\n2017-02-06T18:35:34Z|00004|ovsdb_idl|WARN|Port table in Open_vSwitch database lacks protected column (database needs upgrade?)\n'; <rc> = 0

We currently cache the response of "is openvswitch available", to reduce load. If OVS is stopped in the middle, while VDSM is running, we have a problem.
This was actually taken care of by systemd, when we required OVS (VDSM would go down if OVS was stopped), but now it is a soft requirement.

Comment 5 Edward Haas 2017-02-09 07:38:53 UTC
I think it makes sense to expect the OVS service to be stable during VDSM lifetime.

@danken, what was the reason for https://gerrit.ovirt.org/#/c/68074 ?
Please share your opinion on this one.

Comment 6 Dan Kenigsberg 2017-02-09 11:52:11 UTC
We don't promote usage of OvS in ovirt, so there is no need to pull it in when most users do not need it. That's the motivation for https://gerrit.ovirt.org/#/c/68074

Evgeheni, why was openvswitch started? Any idea why was it stopped?

Comment 7 Yaniv Kaul 2017-02-12 16:21:20 UTC
(In reply to Edward Haas from comment #5)
> I think it makes sense to expect the OVS service to be stable during VDSM
> lifetime.

If we don't need it, users may opt to stop and disable it. Just like mom (and unlike iSCSI and NFS services, btw - would be nice to be able to stop and disable them - but I think we somehow depend on them!)

> 
> @danken, what was the reason for https://gerrit.ovirt.org/#/c/68074 ?
> Please share your opinion on this one.

Comment 8 Edward Haas 2017-02-13 06:53:01 UTC
(In reply to Yaniv Kaul from comment #7)
> 
> If we don't need it, users may opt to stop and disable it. Just like mom
> (and unlike iSCSI and NFS services, btw - would be nice to be able to stop
> and disable them - but I think we somehow depend on them!)

We are fine if users stop or disable OVS, but we currently do not support this being done while VDSM is running.

Comment 9 Dan Kenigsberg 2017-02-15 09:24:49 UTC
Yaniv, the proper process of playing with services that Vdsm uses (e.g OvS, NetworkManager, libvirtd, possibly even mom) is to turn them on/off when vdsmd is off.

I don't think that it is urgent to change this now for OvS, which is not even recommended to be used right now.

Comment 10 Evgheni Dereveanchin 2017-02-15 14:17:09 UTC
Just to share my thoughts on this - I reported this not because I think VDSM should monitor OVS health (which is not bad) but because it crashed. Thanks for pointing out that OVS was still working upon VDSM init at 2017-02-06 19:35:34

Looking through the provided sosreport I see the following in /var/log/messages:

Feb  6 19:35:31 ovirt-host2 systemd: Starting Virtual Desktop Server Manager...
...
Feb  6 19:35:37 ovirt-host2 systemd: Reloading.
...
Feb  6 19:35:37 ovirt-host2 systemd: Stopping Open vSwitch...
Feb  6 19:35:37 ovirt-host2 systemd: Starting Open vSwitch...
Feb  6 19:35:37 ovirt-host2 systemd: Started Open vSwitch.
Feb  6 19:35:37 ovirt-host2 ovs-ctl: ovsdb-server is already running.
Feb  6 19:35:37 ovirt-host2 ovs-ctl: Enabling remote OVSDB managers [  OK  ]
Feb  6 19:35:37 ovirt-host2 systemd: Stopping Open vSwitch...
Feb  6 19:35:37 ovirt-host2 systemd: Stopped Open vSwitch.
Feb  6 19:35:37 ovirt-host2 ovs-ctl: Killing ovsdb-server (12732) [  OK  ]
Feb  6 19:35:37 ovirt-host2 systemd: Stopped Open vSwitch Database Unit.
...
Feb  6 19:35:43 ovirt-host2 journal: ovirt-ha-broker mgmt_bridge.MgmtBridge ERROR Failed to getVdsCapabilities: (21, 'Executing commands failed: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)')

So systemd is restarting systemd after VDSM has started which looks like what is causing the issue. Can we fix it on our side with some kind of dependency or maybe we can move it to another team?

Comment 11 Edward Haas 2017-02-16 09:03:19 UTC
One suggestion raised was: On failure, update the cached status of OVS service.

Assuming OVS is initially up, if VDSM ever fails due to OVS being down, it will update the cached state to down.

This suggestion resolves this bug scenario, but will introduce a problem in case OVS service comes up later on or even if OVS was used and now we no longer report caps for it.

We could solve this partially by checking if we are supposed to have OVS networks defined and if OVS shows as down, we can refresh its state cache.

Comment 12 Dan Kenigsberg 2017-02-27 12:08:31 UTC
We have just seen another case of this (on hera08). Let us look into this sooner.

Comment 13 Dan Kenigsberg 2017-04-26 13:50:04 UTC
In order to verify:
- Please start openvswitch, and then restart vdsm+supervdsm.
- See that all is well.
- Stop openvswitch
- See that there might be a single getCaps failure, but after it - all is still well.

Comment 14 Michael Burman 2017-04-30 12:41:28 UTC
Not might be. There is a single getCapabilities on the first attempt indeed -

[root@silver-vdsb ~]# vdsm-client Host getCapabilities
vdsm-client: Command Host.getCapabilities with args {} failed:
(code=-32603, message=(32, 'ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)'))


Verified on - vdsm-4.19.11-1.el7ev.x86_64


Note You need to log in before you can comment on or make changes to this bug.