Description of problem: No dependency between “Vdsmd” and “SuperVdsmd” daemons Version-Release number of selected component (if applicable): RHEVM 3.3 - IS5 environment: RHEVM: rhevm-3.3.0-0.7.master.el6ev.noarch VDSM: vdsm-4.11.0-121.git082925a.el6.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Service “Vdsmd” can run without dependency to “SuperVdsmd” daemons. Service “Vdsmd” can run although “SuperVdsmd” have status down or failed startup. Expected results: “Vdsmd” daemons should failed start or run, if “SuperVdsmd” service failed start or run. And via versa. Poth service “Vdsmd” and “SuperVdsmd” should be depend each other. Impact on user: Failed run action that depend on root user (“SuperVdsmd” service) Workaround: Kill all “Vdsmd” service and run it again with verification that both “Vdsmd” and “SuperVdsmd” running Additional info: /var/log/ovirt-engine/engine.log /var/log/vdsm/vdsm.log
RHEVM 3.3 - IS6 environment: RHEVM: rhevm-3.3.0-0.9.master.el6ev.noarch VDSM: vdsm-4.12.0-rc1.12.git8ee6885.el6.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64
This is the design, not a bug. The bug can be why the supervdsm is down or why they are not communicating. Yaniv - please approve.
The dependency exists, supervdsm depends on libvirt , vdsm depends on few services that supervdsm and libvirt are part of them. When starting vdsmd, supervdsm starts automatically, starting supervdsm explicitly doesn't mean that vdsmd will also start. When vdsm dies, supervdsm also dies over systemd. with initctl we don't have such dependency, when vdsm dies, supervdsm can still run and vdsm will reconnect it. when supervdsm dies, next vdsm call to supervdsm will raise exception in vdsm and vdsm will reconnect to supervdsm, next call should work properly. that's the expected behavior as we implemented it right now. if that leads to bugs please let me know. I attached the relevant fixes for those issues.
Created attachment 796661 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Failed, tested on RHEVM 3.3 - IS13 environment: RHEVM: rhevm-3.3.0-0.19.master.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.13-1.el6ev.noarch VDSM: vdsm-4.12.0-105.git0da1561.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.7.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 Steps to Reproduce: 1. Kill SuperVDSMd 2. Call VDSM (iScsiScan command) Actual results: SuperVDSMd stay in “not running” state Expected results: SuperVDSMd should start logs attached
What do you mean by killing supervdsmd? stop the service? how would it start? I don't understand the problem, please provide the exact commands that you run - For example: 1. kill -9 [pid] 2. vdsClient ... Currently in 3.3 the expected behavior should be: When vdsmd starts also supervdsmd starts. when supervdsmd starts only libvirtd will start too. When vdsmd is restarted also supervdsmd is restarted. Only over systemd when supervdsmd is restarted also vdsmd is restarted (due to the systemd dependencies mechanism). - Otherwise after reset of supervdsmd we should restart vdsmd manually, or after first failed call to supervdsmd, vdsmd will restart itself and reconnect to new instance of supervdsmd. Your description doesn't sound like a bug. please verify that you're in one of the above scenarios.
1. I have working environment, VDSMd and SuperVDSMd running - clear day service vdsmd status --> VDS daemon server is running service supervdsmd status --> Super VDSM daemon server is running 2. Run command : kill -9 supervdsmd VDSMd - running SuperVDSMd - not running 3. Call VDSM command that need SuperVDSMd - vdsmd not restart itself and not reconnect to new instance of supervdsmd VDSMd - running SuperVDSMd - not running
You're right. Due to new inline parameters that are sent to vdsm and supervdsm when starting the process, the respawn script should change. The attached new path fixes it. Now when killing supervdsmd process you'll see new instance automatically as expected.
Failed, tested on RHEVM 3.3 - IS18 environment: Tested on FCP Data Centers Host OS: RHEL 6.5 RHEVM: rhevm-3.3.0-0.25.beta1.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.15-1.el6ev.noarch VDSM: vdsm-4.13.0-0.2.beta1.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-27.el6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 1. I have working environment, VDSMd and SuperVDSMd running - clear day service vdsmd status --> VDS daemon server is running service supervdsmd status --> Super VDSM daemon server is running 2. Run command : kill -9 supervdsmd VDSMd - running SuperVDSMd - not running 3. Call VDSM command that need SuperVDSMd - vdsmd not restart itself and not reconnect to new instance of supervdsmd VDSMd - running SuperVDSMd - not running server.log 2013-10-20 14:34:28,769 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) HostName = tigris01.scl.lab.tlv.redhat.com 2013-10-20 14:34:28,769 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) Command GetDeviceListVDS execution failed. Exception: V DSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: () 2013-10-20 14:34:28,769 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-7) FINISH, GetDeviceListVDSCommand, log id: 41c5db2a 2013-10-20 14:34:28,769 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp-/127.0.0.1:8702-7) Query GetDeviceListQuery failed. Exception message is VdcBLLExceptio n: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: () (Failed with error BlockDeviceActionError and code 600) 2013-10-20 14:34:48,800 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Failed in GetDeviceListVDS method 2013-10-20 14:34:48,801 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Error code BlockDeviceActionError and error message VDS GenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: () 2013-10-20 14:34:48,801 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Command org.ovirt.engine.core.vdsbroker.vdsbroker.GetDe viceListVDSCommand return value LUNListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=600, mMessage=Error block device action: ()]] 2013-10-20 14:34:48,801 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) HostName = tigris01.scl.lab.tlv.redhat.com 2013-10-20 14:34:48,801 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-3) Command GetDeviceListVDS execution failed. Exception: V DSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: () Logs attached
Created attachment 814206 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Hey, Sorry but on step 2 I lost you: """ 2. Run command : kill -9 supervdsmd VDSMd - running SuperVDSMd - not running """ when I do that supervdsmd is running afterwards: $ ps aux | grep supervdsm root 10123 0.0 0.0 11300 828 pts/0 S< 16:09 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid root 10125 1.0 0.2 532092 23284 pts/0 S<l 16:09 0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid $ kill -9 10125 ybronhei@bronhaim-dell1:~/Projects/vdsm ((ca75f2b...))$ ps aux | grep supervdsm root 10123 0.0 0.0 11304 920 pts/0 S< 16:09 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid root 10835 12.0 0.2 380532 22972 pts/0 S<l 16:09 0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid $ service supervdsmd status Super VDSM daemon server is running now when calling to supervdsm verb, first time mismatch, second works: $ vdsClient -s 0 getVdsHardwareInfo Failed to read hardware information $ vdsClient -s 0 getVdsHardwareInfo systemFamily = 'Not Specified' systemManufacturer = 'Dell Inc.' systemProductName = 'OptiPlex 780' systemSerialNumber = 'J2NJX4J' systemUUID = '44454c4c-3200-104e-804a-cac04f58344a' systemVersion = 'Not Specified' As expected.
[root@tigris02 ~]# ps aux | grep supervdsm root 11375 0.0 0.0 11340 816 ? S< 09:06 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid root 11378 0.0 0.0 781768 28000 ? S<l 09:06 0:01 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid qemu 14643 0.0 0.0 0 0 ? Z< 14:09 0:00 [supervdsmServer] <defunct> root 16794 0.0 0.0 103256 896 pts/1 S+ 14:37 0:00 grep supervdsm [root@tigris02 ~]# kill -9 11378 [root@tigris02 ~]# ps aux | grep supervdsm root 11375 0.0 0.0 11344 920 ? S< 09:06 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid root 17493 18.0 0.0 378064 22148 ? S<l 14:39 0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid root 17502 0.0 0.0 103256 896 pts/1 S+ 14:39 0:00 grep supervdsm SuperVDSM service start again with new ID Tested on FCP Data Centers Verified, tested on RHEVM 3.3 - IS20 environment: Host OS: RHEL 6.5 RHEVM: rhevm-3.3.0-0.28.beta1.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.17-1.el6ev.noarch VDSM: vdsm-4.13.0-0.5.beta1.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-29.el6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html