Bug 890365 - 3.2 - vdsm: Error initializing IRS after vdsm crash
Summary: 3.2 - vdsm: Error initializing IRS after vdsm crash
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.2.0
Assignee: Yaniv Bronhaim
QA Contact: Dafna Ron
URL:
Whiteboard: infra
: 885747 903673 (view as bug list)
Depends On:
Blocks: 896506
TreeView+ depends on / blocked
 
Reported: 2012-12-26 17:47 UTC by Dafna Ron
Modified: 2018-11-30 20:27 UTC (History)
9 users (show)

Fixed In Version: vdsm-4.10.2-7.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (1.52 MB, application/x-gzip)
2012-12-26 17:47 UTC, Dafna Ron
no flags Details

Description Dafna Ron 2012-12-26 17:47:24 UTC
Created attachment 669287 [details]
logs

Description of problem:

After I updated vdsm on my host I noticed that the spm failed to reinitialize. 
to reproduce I pkilled vdsm on the spm and we fail to initialize the host
we are stuck on initializing until we restart the vdsm manually. 
I have other hosts in the DC - the spm becomes non-operational but remains spm

******Also reproduced by blocking master storage domain on the SPM. ********


Version-Release number of selected component (if applicable):

sf2
vdsm-4.10.2-2.0.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. pkill vdsm on the spm or block master storage domain from the spm using iptables (I used iscsi storage)
2.
3.
  
Actual results:

vdsm fails to start until we restart it manually

Expected results:

super vdsm should start the vdsm 

Additional info: logs

[root@gold-vdsd ~]# service vdsmd status
VDS daemon server is running
[root@gold-vdsd ~]# vdsClient -s 0 getSpmStatus afcde1c5-6022-4077-ab06-2beed7e5e404
Failed to initialize storage
[root@gold-vdsd ~]#

Comment 1 Ayal Baron 2012-12-27 13:24:14 UTC
From the logs I can see that the startup is stuck after ksm initialization.
This is post storage init (and a separate thread anyway) so I don't see what this has to do with storage.
Unfortunately the flow doesn't have enough logging to tell where it hung.

KsmMonitor::DEBUG::2012-12-26 19:08:41,155::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/service ksm start' (cwd None)
KsmMonitor::DEBUG::2012-12-26 19:08:41,197::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
MainThread::INFO::2012-12-26 19:13:55,206::vmChannels::135::vds::(stop) VM channels listener was stopped.

Comment 2 Barak 2012-12-27 13:26:35 UTC
Dafna - how did you pkilled the vdsm ? what signal ?

Comment 3 Dafna Ron 2012-12-27 13:49:56 UTC
(In reply to comment #2)
> Dafna - how did you pkilled the vdsm ? what signal ?

I did not issue any specific signal (default) just pkill vdsm
but please note that you can also reproduce with blocking master domain (no manual user interference on pid).

Comment 4 Yaniv Bronhaim 2012-12-30 13:22:51 UTC
when you run "pkill vdsm" it sends SIGTERM to vdsm, and vdsm restarts itself automatically.

This is another new issue that could be solved with that patch: http://gerrit.ovirt.org/#/c/9691/

because super vdsm was not serving vdsm fast enough after reset, the init of HSM threw exception and we didn't recover from that..

you can see that here:
starting vdsm
MainThread::INFO::2012-12-26 19:08:40,204::vdsm::88::vds::(run) I am the actual vdsm 4.10-2.0 gold-vdsd.qa.lab.tlv.redhat.com (2.6.32-348.el6.x86_64)
...
MainThread::ERROR::2012-12-26 19:08:40,891::clientIF::260::vds::(_initIRS) Error initializing IRS
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 258, in _initIRS
    self.irs = Dispatcher(HSM())
  File "/usr/share/vdsm/storage/hsm.py", line 346, in __init__
    if not multipath.isEnabled():
  File "/usr/share/vdsm/storage/multipath.py", line 89, in isEnabled
    mpathconf = svdsm.readMultipathConf()
  File "/usr/share/vdsm/supervdsm.py", line 76, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 66, in <lambda>
    getattr(self._supervdsmProxy._svdsm, self._funcName)(*args,
AttributeError: 'ProxyCaller' object has no attribute 'readMultipathConf'
MainThread::WARNING::2012-12-26 19:08:40,976::clientIF::197::vds::(_prepareMOM) MOM initialization failed and fall back to KsmMonitor

super vdsm didn't function and we fail the initialization and stayed in that state. the only way to recover from that is restarting vdsm afterwards.

to fix it we need to consider to change the way we initialize super vdsm on startup, or handling supervdsm failures without failing the initialization.

because the call that fails here is readMultipathConf, i need to know if we want a workaround here even when svdsm cannot operate, or if we want to fail the initialization if it happens as it now and consider svdsm failures as critical error that demand restarting..

In my opinion, during initialization we need to handle any exception that is not critical and can be raised without failing the process, in that case we should catch svdsm exception and return false instead of passing the exception in multipath.isEnabled function.

Comment 6 Ayal Baron 2012-12-30 15:17:37 UTC
From the way it looks, all we need is the first part of: http://gerrit.ovirt.org/#/c/10491 to solve the "AttributeError: 'ProxyCaller' object has no attribute 'readMultipathConf'" issue.(In reply to comment #4)

Comment 7 Yaniv Bronhaim 2012-12-31 16:44:32 UTC
I also suggest this fix http://gerrit.ovirt.org/#/c/10508/. readMultipathConf call is operated during hsm.__init__ and if it fails I'm not sure it's right reason to fail hsm initialization. 
Although, the bug here is in svdsm and http://gerrit.ovirt.org/#/c/10491 fixes it.

Comment 8 Yaniv Bronhaim 2013-01-27 10:17:49 UTC
*** Bug 903673 has been marked as a duplicate of this bug. ***

Comment 11 Yaniv Bronhaim 2013-02-17 10:40:34 UTC
*** Bug 885747 has been marked as a duplicate of this bug. ***

Comment 12 Dafna Ron 2013-02-20 17:32:05 UTC
verified on vdsm-4.10.2-8.0.el6ev.x86_64

Comment 14 Itamar Heim 2013-06-11 08:25:06 UTC
3.2 has been released

Comment 15 Itamar Heim 2013-06-11 08:26:12 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.