Created attachment 669287 [details] logs Description of problem: After I updated vdsm on my host I noticed that the spm failed to reinitialize. to reproduce I pkilled vdsm on the spm and we fail to initialize the host we are stuck on initializing until we restart the vdsm manually. I have other hosts in the DC - the spm becomes non-operational but remains spm ******Also reproduced by blocking master storage domain on the SPM. ******** Version-Release number of selected component (if applicable): sf2 vdsm-4.10.2-2.0.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. pkill vdsm on the spm or block master storage domain from the spm using iptables (I used iscsi storage) 2. 3. Actual results: vdsm fails to start until we restart it manually Expected results: super vdsm should start the vdsm Additional info: logs [root@gold-vdsd ~]# service vdsmd status VDS daemon server is running [root@gold-vdsd ~]# vdsClient -s 0 getSpmStatus afcde1c5-6022-4077-ab06-2beed7e5e404 Failed to initialize storage [root@gold-vdsd ~]#
From the logs I can see that the startup is stuck after ksm initialization. This is post storage init (and a separate thread anyway) so I don't see what this has to do with storage. Unfortunately the flow doesn't have enough logging to tell where it hung. KsmMonitor::DEBUG::2012-12-26 19:08:41,155::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/service ksm start' (cwd None) KsmMonitor::DEBUG::2012-12-26 19:08:41,197::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0 MainThread::INFO::2012-12-26 19:13:55,206::vmChannels::135::vds::(stop) VM channels listener was stopped.
Dafna - how did you pkilled the vdsm ? what signal ?
(In reply to comment #2) > Dafna - how did you pkilled the vdsm ? what signal ? I did not issue any specific signal (default) just pkill vdsm but please note that you can also reproduce with blocking master domain (no manual user interference on pid).
when you run "pkill vdsm" it sends SIGTERM to vdsm, and vdsm restarts itself automatically. This is another new issue that could be solved with that patch: http://gerrit.ovirt.org/#/c/9691/ because super vdsm was not serving vdsm fast enough after reset, the init of HSM threw exception and we didn't recover from that.. you can see that here: starting vdsm MainThread::INFO::2012-12-26 19:08:40,204::vdsm::88::vds::(run) I am the actual vdsm 4.10-2.0 gold-vdsd.qa.lab.tlv.redhat.com (2.6.32-348.el6.x86_64) ... MainThread::ERROR::2012-12-26 19:08:40,891::clientIF::260::vds::(_initIRS) Error initializing IRS Traceback (most recent call last): File "/usr/share/vdsm/clientIF.py", line 258, in _initIRS self.irs = Dispatcher(HSM()) File "/usr/share/vdsm/storage/hsm.py", line 346, in __init__ if not multipath.isEnabled(): File "/usr/share/vdsm/storage/multipath.py", line 89, in isEnabled mpathconf = svdsm.readMultipathConf() File "/usr/share/vdsm/supervdsm.py", line 76, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 66, in <lambda> getattr(self._supervdsmProxy._svdsm, self._funcName)(*args, AttributeError: 'ProxyCaller' object has no attribute 'readMultipathConf' MainThread::WARNING::2012-12-26 19:08:40,976::clientIF::197::vds::(_prepareMOM) MOM initialization failed and fall back to KsmMonitor super vdsm didn't function and we fail the initialization and stayed in that state. the only way to recover from that is restarting vdsm afterwards. to fix it we need to consider to change the way we initialize super vdsm on startup, or handling supervdsm failures without failing the initialization. because the call that fails here is readMultipathConf, i need to know if we want a workaround here even when svdsm cannot operate, or if we want to fail the initialization if it happens as it now and consider svdsm failures as critical error that demand restarting.. In my opinion, during initialization we need to handle any exception that is not critical and can be raised without failing the process, in that case we should catch svdsm exception and return false instead of passing the exception in multipath.isEnabled function.
From the way it looks, all we need is the first part of: http://gerrit.ovirt.org/#/c/10491 to solve the "AttributeError: 'ProxyCaller' object has no attribute 'readMultipathConf'" issue.(In reply to comment #4)
I also suggest this fix http://gerrit.ovirt.org/#/c/10508/. readMultipathConf call is operated during hsm.__init__ and if it fails I'm not sure it's right reason to fail hsm initialization. Although, the bug here is in svdsm and http://gerrit.ovirt.org/#/c/10491 fixes it.
*** Bug 903673 has been marked as a duplicate of this bug. ***
*** Bug 885747 has been marked as a duplicate of this bug. ***
verified on vdsm-4.10.2-8.0.el6ev.x86_64
3.2 has been released