Bug 986635 - No dependency between “Vdsmd” and “SuperVdsmd” daemons
No dependency between “Vdsmd” and “SuperVdsmd” daemons
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
x86_64 Linux
unspecified Severity urgent
: ---
: 3.3.0
Assigned To: Yaniv Bronhaim
yeylon@redhat.com
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-21 03:49 EDT by vvyazmin@redhat.com
Modified: 2016-04-18 02:52 EDT (History)
12 users (show)

See Also:
Fixed In Version: is16
Doc Type: Bug Fix
Doc Text:
There is now a dependency between vdsmd and supervdsmd. The supervdsmd service needs to be up before vdsmd starts up.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 11:29:18 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (4.92 MB, application/x-gzip)
2013-09-12 03:17 EDT, vvyazmin@redhat.com
no flags Details
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (2.36 MB, application/x-gzip)
2013-10-20 07:40 EDT, vvyazmin@redhat.com
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 17195 None None None Never
oVirt gerrit 19462 None None None Never

  None (edit)
Description vvyazmin@redhat.com 2013-07-21 03:49:43 EDT
Description of problem:
No dependency between  “Vdsmd” and “SuperVdsmd” daemons

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS5 environment:

RHEVM: rhevm-3.3.0-0.7.master.el6ev.noarch
VDSM: vdsm-4.11.0-121.git082925a.el6.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1.
2. 
3.

Actual results:
Service “Vdsmd” can run without dependency to “SuperVdsmd” daemons.
Service “Vdsmd” can run although “SuperVdsmd” have status down or failed startup.

Expected results:
“Vdsmd” daemons should failed start or run, if “SuperVdsmd” service failed start or run. And via versa. Poth service “Vdsmd” and “SuperVdsmd” should be depend each other.

Impact on user:
Failed run action that depend on root user (“SuperVdsmd” service)

Workaround:
Kill all “Vdsmd” service and run it again with verification that both “Vdsmd” and “SuperVdsmd” running

Additional info:

/var/log/ovirt-engine/engine.log

/var/log/vdsm/vdsm.log
Comment 1 vvyazmin@redhat.com 2013-07-21 04:07:02 EDT
RHEVM 3.3 - IS6 environment:

RHEVM: rhevm-3.3.0-0.9.master.el6ev.noarch
VDSM: vdsm-4.12.0-rc1.12.git8ee6885.el6.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64
Comment 2 Aharon Canan 2013-07-21 06:27:50 EDT
This is the design, not a bug.
The bug can be why the supervdsm is down or why they are not communicating.

Yaniv - please approve.
Comment 3 Yaniv Bronhaim 2013-07-23 07:49:34 EDT
The dependency exists, supervdsm depends on libvirt , vdsm depends on few services that supervdsm and libvirt are part of them.

When starting vdsmd, supervdsm starts automatically, starting supervdsm explicitly doesn't mean that vdsmd will also start.

When vdsm dies, supervdsm also dies over systemd. with initctl we don't have such dependency, when vdsm dies, supervdsm can still run and vdsm will reconnect it.

when supervdsm dies, next vdsm call to supervdsm will raise exception in vdsm and vdsm will reconnect to supervdsm, next call should work properly. that's the expected behavior as we implemented it right now. if that leads to bugs please let me know. I attached the relevant fixes for those issues.
Comment 4 vvyazmin@redhat.com 2013-09-12 03:17:19 EDT
Created attachment 796661 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Comment 5 vvyazmin@redhat.com 2013-09-12 03:17:51 EDT
Failed, tested on RHEVM 3.3 - IS13 environment:

RHEVM:  rhevm-3.3.0-0.19.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.13-1.el6ev.noarch
VDSM:  vdsm-4.12.0-105.git0da1561.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.7.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64


Steps to Reproduce:
1. Kill SuperVDSMd 
2. Call VDSM (iScsiScan command)

Actual results:
SuperVDSMd stay in “not running” state

Expected results:
SuperVDSMd should start



logs attached
Comment 6 Yaniv Bronhaim 2013-09-12 03:39:42 EDT
What do you mean by killing supervdsmd? stop the service? how would it start? I don't understand the problem, please provide the exact commands that you run - 

For example:
1. kill -9 [pid]
2. vdsClient ... 

Currently in 3.3 the expected behavior should be:

When vdsmd starts also supervdsmd starts.
when supervdsmd starts only libvirtd will start too.
When vdsmd is restarted also supervdsmd is restarted.

Only over systemd when supervdsmd is restarted also vdsmd is restarted (due to the systemd dependencies mechanism). - Otherwise after reset of supervdsmd we should restart vdsmd manually, or after first failed call to supervdsmd, vdsmd will restart itself and reconnect to new instance of supervdsmd.


Your description doesn't sound like a bug. please verify that you're in one of the above scenarios.
Comment 7 vvyazmin@redhat.com 2013-09-22 22:34:14 EDT
1. I have working environment, VDSMd and SuperVDSMd running - clear day 
service vdsmd status --> VDS daemon server is running
service supervdsmd status --> Super VDSM daemon server is running

2. Run command :
kill -9 supervdsmd
VDSMd - running 
SuperVDSMd - not running 

3. Call VDSM command that need SuperVDSMd - vdsmd not restart itself and not reconnect to new instance of supervdsmd
VDSMd - running 
SuperVDSMd - not running
Comment 8 Yaniv Bronhaim 2013-09-23 08:28:19 EDT
You're right. Due to new inline parameters that are sent to vdsm and supervdsm when starting the process, the respawn script should change. The attached new path fixes it. Now when killing supervdsmd process you'll see new instance automatically as expected.
Comment 9 vvyazmin@redhat.com 2013-10-20 07:39:59 EDT
Failed, tested on RHEVM 3.3 - IS18 environment:
Tested on FCP Data Centers

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.25.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.15-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.2.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-27.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64


1. I have working environment, VDSMd and SuperVDSMd running - clear day 
service vdsmd status --> VDS daemon server is running
service supervdsmd status --> Super VDSM daemon server is running

2. Run command :
kill -9 supervdsmd
VDSMd - running 
SuperVDSMd - not running 

3. Call VDSM command that need SuperVDSMd - vdsmd not restart itself and not reconnect to new instance of supervdsmd
VDSMd - running 
SuperVDSMd - not running

server.log

2013-10-20 14:34:28,769 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) HostName = tigris01.scl.lab.tlv.redhat.com
2013-10-20 14:34:28,769 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) Command GetDeviceListVDS execution failed. Exception: V
DSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-10-20 14:34:28,769 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) FINISH, GetDeviceListVDSCommand, log id: 41c5db2a
2013-10-20 14:34:28,769 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp-/127.0.0.1:8702-7) Query GetDeviceListQuery failed. Exception message is VdcBLLExceptio
n: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: () (Failed
 with error BlockDeviceActionError and code 600)
2013-10-20 14:34:48,800 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Failed in GetDeviceListVDS method
2013-10-20 14:34:48,801 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Error code BlockDeviceActionError and error message VDS
GenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-10-20 14:34:48,801 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Command org.ovirt.engine.core.vdsbroker.vdsbroker.GetDe
viceListVDSCommand return value 
 
LUNListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=600, mMessage=Error block device action: ()]]

2013-10-20 14:34:48,801 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) HostName = tigris01.scl.lab.tlv.redhat.com
2013-10-20 14:34:48,801 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Command GetDeviceListVDS execution failed. Exception: V
DSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()




Logs attached
Comment 10 vvyazmin@redhat.com 2013-10-20 07:40:39 EDT
Created attachment 814206 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Comment 11 Yaniv Bronhaim 2013-10-27 10:13:46 EDT
Hey,
Sorry but on step 2 I lost you:

"""
2. Run command :
kill -9 supervdsmd
VDSMd - running 
SuperVDSMd - not running 
"""

when I do that supervdsmd is running afterwards:

$ ps aux | grep supervdsm

root     10123  0.0  0.0  11300   828 pts/0    S<   16:09   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid

root     10125  1.0  0.2 532092 23284 pts/0    S<l  16:09   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid

$  kill -9 10125

ybronhei@bronhaim-dell1:~/Projects/vdsm ((ca75f2b...))$ ps aux | grep supervdsm
root     10123  0.0  0.0  11304   920 pts/0    S<   16:09   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     10835 12.0  0.2 380532 22972 pts/0    S<l  16:09   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid

$  service supervdsmd status
Super VDSM daemon server is running

now when calling to supervdsm verb, first time mismatch, second works:

$  vdsClient -s 0 getVdsHardwareInfo
Failed to read hardware information

$  vdsClient -s 0 getVdsHardwareInfo
	systemFamily = 'Not Specified'
	systemManufacturer = 'Dell Inc.'
	systemProductName = 'OptiPlex 780'
	systemSerialNumber = 'J2NJX4J'
	systemUUID = '44454c4c-3200-104e-804a-cac04f58344a'
	systemVersion = 'Not Specified'

As expected.
Comment 12 vvyazmin@redhat.com 2013-10-27 10:43:56 EDT
[root@tigris02 ~]#  ps aux | grep supervdsm
root     11375  0.0  0.0  11340   816 ?        S<   09:06   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     11378  0.0  0.0 781768 28000 ?        S<l  09:06   0:01 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
qemu     14643  0.0  0.0      0     0 ?        Z<   14:09   0:00 [supervdsmServer] <defunct>
root     16794  0.0  0.0 103256   896 pts/1    S+   14:37   0:00 grep supervdsm


[root@tigris02 ~]# kill -9 11378


[root@tigris02 ~]#  ps aux | grep supervdsm
root     11375  0.0  0.0  11344   920 ?        S<   09:06   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     17493 18.0  0.0 378064 22148 ?        S<l  14:39   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     17502  0.0  0.0 103256   896 pts/1    S+   14:39   0:00 grep supervdsm


SuperVDSM service start again with new ID

Tested on FCP Data Centers
Verified, tested on RHEVM 3.3 - IS20 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.28.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.17-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.5.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-29.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64
Comment 15 Charlie 2013-11-27 19:26:34 EST
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.
Comment 17 errata-xmlrpc 2014-01-21 11:29:18 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html

Note You need to log in before you can comment on or make changes to this bug.