Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 986635

Summary:

No dependency between “Vdsmd” and “SuperVdsmd” daemons

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

vvyazmin <vvyazmin>

Component:

vdsm

Assignee:

Yaniv Bronhaim <ybronhei>

Status:

CLOSED ERRATA

QA Contact:

yeylon <yeylon>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

3.3.0

CC:

abaron, acanan, acathrow, bazulay, eedri, iheim, lpeer, pstehlik, srevivo, ybronhei, yeylon, yzaslavs

Target Milestone:

---

Target Release:

3.3.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

infra

Fixed In Version:

is16

Doc Type:

Bug Fix

Doc Text:

There is now a dependency between vdsmd and supervdsmd. The supervdsmd service needs to be up before vdsmd starts up.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-01-21 16:29:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm	none
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm	none

Description vvyazmin@redhat.com 2013-07-21 07:49:43 UTC

Description of problem:
No dependency between  “Vdsmd” and “SuperVdsmd” daemons

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS5 environment:

RHEVM: rhevm-3.3.0-0.7.master.el6ev.noarch
VDSM: vdsm-4.11.0-121.git082925a.el6.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1.
2. 
3.

Actual results:
Service “Vdsmd” can run without dependency to “SuperVdsmd” daemons.
Service “Vdsmd” can run although “SuperVdsmd” have status down or failed startup.

Expected results:
“Vdsmd” daemons should failed start or run, if “SuperVdsmd” service failed start or run. And via versa. Poth service “Vdsmd” and “SuperVdsmd” should be depend each other.

Impact on user:
Failed run action that depend on root user (“SuperVdsmd” service)

Workaround:
Kill all “Vdsmd” service and run it again with verification that both “Vdsmd” and “SuperVdsmd” running

Additional info:

/var/log/ovirt-engine/engine.log

/var/log/vdsm/vdsm.log

Comment 1 vvyazmin@redhat.com 2013-07-21 08:07:02 UTC

RHEVM 3.3 - IS6 environment:

RHEVM: rhevm-3.3.0-0.9.master.el6ev.noarch
VDSM: vdsm-4.12.0-rc1.12.git8ee6885.el6.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

Comment 2 Aharon Canan 2013-07-21 10:27:50 UTC

This is the design, not a bug.
The bug can be why the supervdsm is down or why they are not communicating.

Yaniv - please approve.

Comment 3 Yaniv Bronhaim 2013-07-23 11:49:34 UTC

The dependency exists, supervdsm depends on libvirt , vdsm depends on few services that supervdsm and libvirt are part of them.

When starting vdsmd, supervdsm starts automatically, starting supervdsm explicitly doesn't mean that vdsmd will also start.

When vdsm dies, supervdsm also dies over systemd. with initctl we don't have such dependency, when vdsm dies, supervdsm can still run and vdsm will reconnect it.

when supervdsm dies, next vdsm call to supervdsm will raise exception in vdsm and vdsm will reconnect to supervdsm, next call should work properly. that's the expected behavior as we implemented it right now. if that leads to bugs please let me know. I attached the relevant fixes for those issues.

Comment 4 vvyazmin@redhat.com 2013-09-12 07:17:19 UTC

Created attachment 796661 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 5 vvyazmin@redhat.com 2013-09-12 07:17:51 UTC

Failed, tested on RHEVM 3.3 - IS13 environment:

RHEVM:  rhevm-3.3.0-0.19.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.13-1.el6ev.noarch
VDSM:  vdsm-4.12.0-105.git0da1561.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.7.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64


Steps to Reproduce:
1. Kill SuperVDSMd 
2. Call VDSM (iScsiScan command)

Actual results:
SuperVDSMd stay in “not running” state

Expected results:
SuperVDSMd should start



logs attached

Comment 6 Yaniv Bronhaim 2013-09-12 07:39:42 UTC

What do you mean by killing supervdsmd? stop the service? how would it start? I don't understand the problem, please provide the exact commands that you run - 

For example:
1. kill -9 [pid]
2. vdsClient ... 

Currently in 3.3 the expected behavior should be:

When vdsmd starts also supervdsmd starts.
when supervdsmd starts only libvirtd will start too.
When vdsmd is restarted also supervdsmd is restarted.

Only over systemd when supervdsmd is restarted also vdsmd is restarted (due to the systemd dependencies mechanism). - Otherwise after reset of supervdsmd we should restart vdsmd manually, or after first failed call to supervdsmd, vdsmd will restart itself and reconnect to new instance of supervdsmd.


Your description doesn't sound like a bug. please verify that you're in one of the above scenarios.

Comment 7 vvyazmin@redhat.com 2013-09-23 02:34:14 UTC

1. I have working environment, VDSMd and SuperVDSMd running - clear day 
service vdsmd status --> VDS daemon server is running
service supervdsmd status --> Super VDSM daemon server is running

2. Run command :
kill -9 supervdsmd
VDSMd - running 
SuperVDSMd - not running 

3. Call VDSM command that need SuperVDSMd - vdsmd not restart itself and not reconnect to new instance of supervdsmd
VDSMd - running 
SuperVDSMd - not running

Comment 8 Yaniv Bronhaim 2013-09-23 12:28:19 UTC

You're right. Due to new inline parameters that are sent to vdsm and supervdsm when starting the process, the respawn script should change. The attached new path fixes it. Now when killing supervdsmd process you'll see new instance automatically as expected.

Comment 9 vvyazmin@redhat.com 2013-10-20 11:39:59 UTC

Failed, tested on RHEVM 3.3 - IS18 environment:
Tested on FCP Data Centers

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.25.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.15-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.2.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-27.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64


1. I have working environment, VDSMd and SuperVDSMd running - clear day 
service vdsmd status --> VDS daemon server is running
service supervdsmd status --> Super VDSM daemon server is running

2. Run command :
kill -9 supervdsmd
VDSMd - running 
SuperVDSMd - not running 

3. Call VDSM command that need SuperVDSMd - vdsmd not restart itself and not reconnect to new instance of supervdsmd
VDSMd - running 
SuperVDSMd - not running

server.log

2013-10-20 14:34:28,769 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) HostName = tigris01.scl.lab.tlv.redhat.com
2013-10-20 14:34:28,769 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) Command GetDeviceListVDS execution failed. Exception: V
DSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-10-20 14:34:28,769 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) FINISH, GetDeviceListVDSCommand, log id: 41c5db2a
2013-10-20 14:34:28,769 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp-/127.0.0.1:8702-7) Query GetDeviceListQuery failed. Exception message is VdcBLLExceptio
n: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: () (Failed
 with error BlockDeviceActionError and code 600)
2013-10-20 14:34:48,800 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Failed in GetDeviceListVDS method
2013-10-20 14:34:48,801 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Error code BlockDeviceActionError and error message VDS
GenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-10-20 14:34:48,801 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Command org.ovirt.engine.core.vdsbroker.vdsbroker.GetDe
viceListVDSCommand return value 
 
LUNListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=600, mMessage=Error block device action: ()]]

2013-10-20 14:34:48,801 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) HostName = tigris01.scl.lab.tlv.redhat.com
2013-10-20 14:34:48,801 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Command GetDeviceListVDS execution failed. Exception: V
DSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()




Logs attached

Comment 10 vvyazmin@redhat.com 2013-10-20 11:40:39 UTC

Created attachment 814206 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 11 Yaniv Bronhaim 2013-10-27 14:13:46 UTC

Hey,
Sorry but on step 2 I lost you:

"""
2. Run command :
kill -9 supervdsmd
VDSMd - running 
SuperVDSMd - not running 
"""

when I do that supervdsmd is running afterwards:

$ ps aux | grep supervdsm

root     10123  0.0  0.0  11300   828 pts/0    S<   16:09   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid

root     10125  1.0  0.2 532092 23284 pts/0    S<l  16:09   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid

$  kill -9 10125

ybronhei@bronhaim-dell1:~/Projects/vdsm ((ca75f2b...))$ ps aux | grep supervdsm
root     10123  0.0  0.0  11304   920 pts/0    S<   16:09   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     10835 12.0  0.2 380532 22972 pts/0    S<l  16:09   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid

$  service supervdsmd status
Super VDSM daemon server is running

now when calling to supervdsm verb, first time mismatch, second works:

$  vdsClient -s 0 getVdsHardwareInfo
Failed to read hardware information

$  vdsClient -s 0 getVdsHardwareInfo
	systemFamily = 'Not Specified'
	systemManufacturer = 'Dell Inc.'
	systemProductName = 'OptiPlex 780'
	systemSerialNumber = 'J2NJX4J'
	systemUUID = '44454c4c-3200-104e-804a-cac04f58344a'
	systemVersion = 'Not Specified'

As expected.

Comment 12 vvyazmin@redhat.com 2013-10-27 14:43:56 UTC

[root@tigris02 ~]#  ps aux | grep supervdsm
root     11375  0.0  0.0  11340   816 ?        S<   09:06   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     11378  0.0  0.0 781768 28000 ?        S<l  09:06   0:01 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
qemu     14643  0.0  0.0      0     0 ?        Z<   14:09   0:00 [supervdsmServer] <defunct>
root     16794  0.0  0.0 103256   896 pts/1    S+   14:37   0:00 grep supervdsm


[root@tigris02 ~]# kill -9 11378


[root@tigris02 ~]#  ps aux | grep supervdsm
root     11375  0.0  0.0  11344   920 ?        S<   09:06   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     17493 18.0  0.0 378064 22148 ?        S<l  14:39   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root     17502  0.0  0.0 103256   896 pts/1    S+   14:39   0:00 grep supervdsm


SuperVDSM service start again with new ID

Tested on FCP Data Centers
Verified, tested on RHEVM 3.3 - IS20 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.28.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.17-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.5.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-29.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

Comment 15 Charlie 2013-11-28 00:26:34 UTC

This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 17 errata-xmlrpc 2014-01-21 16:29:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html