Red Hat Bugzilla – Bug 890365
3.2 - vdsm: Error initializing IRS after vdsm crash
Last modified: 2016-02-10 14:41:40 EST
Created attachment 669287 [details]
Description of problem:
After I updated vdsm on my host I noticed that the spm failed to reinitialize.
to reproduce I pkilled vdsm on the spm and we fail to initialize the host
we are stuck on initializing until we restart the vdsm manually.
I have other hosts in the DC - the spm becomes non-operational but remains spm
******Also reproduced by blocking master storage domain on the SPM. ********
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. pkill vdsm on the spm or block master storage domain from the spm using iptables (I used iscsi storage)
vdsm fails to start until we restart it manually
super vdsm should start the vdsm
Additional info: logs
[root@gold-vdsd ~]# service vdsmd status
VDS daemon server is running
[root@gold-vdsd ~]# vdsClient -s 0 getSpmStatus afcde1c5-6022-4077-ab06-2beed7e5e404
Failed to initialize storage
From the logs I can see that the startup is stuck after ksm initialization.
This is post storage init (and a separate thread anyway) so I don't see what this has to do with storage.
Unfortunately the flow doesn't have enough logging to tell where it hung.
KsmMonitor::DEBUG::2012-12-26 19:08:41,155::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/service ksm start' (cwd None)
KsmMonitor::DEBUG::2012-12-26 19:08:41,197::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
MainThread::INFO::2012-12-26 19:13:55,206::vmChannels::135::vds::(stop) VM channels listener was stopped.
Dafna - how did you pkilled the vdsm ? what signal ?
(In reply to comment #2)
> Dafna - how did you pkilled the vdsm ? what signal ?
I did not issue any specific signal (default) just pkill vdsm
but please note that you can also reproduce with blocking master domain (no manual user interference on pid).
when you run "pkill vdsm" it sends SIGTERM to vdsm, and vdsm restarts itself automatically.
This is another new issue that could be solved with that patch: http://gerrit.ovirt.org/#/c/9691/
because super vdsm was not serving vdsm fast enough after reset, the init of HSM threw exception and we didn't recover from that..
you can see that here:
MainThread::INFO::2012-12-26 19:08:40,204::vdsm::88::vds::(run) I am the actual vdsm 4.10-2.0 gold-vdsd.qa.lab.tlv.redhat.com (2.6.32-348.el6.x86_64)
MainThread::ERROR::2012-12-26 19:08:40,891::clientIF::260::vds::(_initIRS) Error initializing IRS
Traceback (most recent call last):
File "/usr/share/vdsm/clientIF.py", line 258, in _initIRS
self.irs = Dispatcher(HSM())
File "/usr/share/vdsm/storage/hsm.py", line 346, in __init__
if not multipath.isEnabled():
File "/usr/share/vdsm/storage/multipath.py", line 89, in isEnabled
mpathconf = svdsm.readMultipathConf()
File "/usr/share/vdsm/supervdsm.py", line 76, in __call__
File "/usr/share/vdsm/supervdsm.py", line 66, in <lambda>
AttributeError: 'ProxyCaller' object has no attribute 'readMultipathConf'
MainThread::WARNING::2012-12-26 19:08:40,976::clientIF::197::vds::(_prepareMOM) MOM initialization failed and fall back to KsmMonitor
super vdsm didn't function and we fail the initialization and stayed in that state. the only way to recover from that is restarting vdsm afterwards.
to fix it we need to consider to change the way we initialize super vdsm on startup, or handling supervdsm failures without failing the initialization.
because the call that fails here is readMultipathConf, i need to know if we want a workaround here even when svdsm cannot operate, or if we want to fail the initialization if it happens as it now and consider svdsm failures as critical error that demand restarting..
In my opinion, during initialization we need to handle any exception that is not critical and can be raised without failing the process, in that case we should catch svdsm exception and return false instead of passing the exception in multipath.isEnabled function.
From the way it looks, all we need is the first part of: http://gerrit.ovirt.org/#/c/10491 to solve the "AttributeError: 'ProxyCaller' object has no attribute 'readMultipathConf'" issue.(In reply to comment #4)
I also suggest this fix http://gerrit.ovirt.org/#/c/10508/. readMultipathConf call is operated during hsm.__init__ and if it fails I'm not sure it's right reason to fail hsm initialization.
Although, the bug here is in svdsm and http://gerrit.ovirt.org/#/c/10491 fixes it.
*** Bug 903673 has been marked as a duplicate of this bug. ***
*** Bug 885747 has been marked as a duplicate of this bug. ***
verified on vdsm-4.10.2-8.0.el6ev.x86_64
3.2 has been released