1006203 – no SPM failover after SPM lost connection to storage

Bug 1006203 - no SPM failover after SPM lost connection to storage

Summary: no SPM failover after SPM lost connection to storage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Yaniv Bronhaim
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-09-10 08:36 UTC by Aharon Canan
Modified:	2016-02-10 19:32 UTC (History)
CC List:	10 users (show)
Fixed In Version:	is16
Doc Type:	Bug Fix
Doc Text:	After the Storage Pool Manager (SPM) lost connection to the storage and became non-operational, the expected behavior was for another host to take its place as the SPM, but this did not happen. This was because SuperVdsm was not passing kwargs to the fuser. This has now been fixed, so when the SPM becomes non-operational a failover takes place.
Clone Of:
Environment:
Last Closed:	2014-01-21 16:15:36 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (1.86 MB, application/x-gzip) 2013-09-10 08:37 UTC, Aharon Canan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:0040	0	normal	SHIPPED_LIVE	vdsm bug fix and enhancement update	2014-01-21 20:26:21 UTC
oVirt gerrit	19250	0	None	None	None	Never

Description Aharon Canan 2013-09-10 08:36:57 UTC

Description of problem:
after SPM lost connection to storage, the other host doesn't become SPM instead

Version-Release number of selected component (if applicable):
is12

How reproducible:
100%

Steps to Reproduce:
1. 2 hosts with 2 SDs
2. Block storage connectivity to the SPM (I removed the Host from storage group in VNX)
3. 

Actual results:
SPM host become non-op, the other host doesn't become SPM instead.

Expected results:
the other host should become SPM

Additional info:

Comment 1 Aharon Canan 2013-09-10 08:37:57 UTC

Created attachment 795900 [details]
logs

Comment 2 Ayal Baron 2013-09-15 12:32:43 UTC

vdsm on spm node was restarted properly [1] but failed to startup due to failure in unmounting (looks like a regression introduced by: http://gerrit.ovirt.org/13779)

Which means that engine will not start spm on the new host without fencing the first host.

looks like supervdsmServer is not passing kwargs to fuser.
If so patch should be:
-    def fuser(self, *args):
-        return fuser.fuser(*args)
+    def fuser(self, *args, **kwargs):
+        return fuser.fuser(*args, **kwargs)



[1] MainThread::DEBUG::2013-09-10 11:07:28,050::vdsm::45::vds::(sigtermHandler) Received signal 15
MainThread::DEBUG::2013-09-10 11:07:28,050::clientIF::232::vds::(prepareForShutdown) cannot run prepareForShutdown twice
MainThread::INFO::2013-09-10 11:07:28,735::vdsm::101::vds::(run) (PID: 17959) I am the actual vdsm 4.12.0-92.gita04386d.el6ev camel-vdsc.qa.lab.tlv.redhat.com (2.6.32-358.el6.x86_64)
MainThread::DEBUG::2013-09-10 11:07:30,322::resourceManager::420::ResourceManager::(registerNamespace) Registering namespace 'Storage'
MainThread::DEBUG::2013-09-10 11:07:30,322::threadPool::35::Misc.ThreadPool::(__init__) Enter - numThreads: 10.0, waitTimeout: 3, maxTasks: 500.0
MainThread::WARNING::2013-09-10 11:07:30,326::fileUtils::167::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt already exists
MainThread::DEBUG::2013-09-10 11:07:30,328::sp::387::Storage.StoragePool::(cleanupMasterMount) unmounting /rhev/data-center/mnt/blockSD/3a260f93-26e5-4aeb-9854-a7ccb6fba54b/master
MainThread::DEBUG::2013-09-10 11:07:30,918::mount::226::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/umount /rhev/data-center/mnt/blockSD/3a260f93-26e5-4aeb-9854-a7ccb6fba54b/master' (cwd None)
MainThread::DEBUG::2013-09-10 11:07:30,933::supervdsm::77::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm
MainThread::ERROR::2013-09-10 11:07:30,949::clientIF::260::vds::(_initIRS) Error initializing IRS
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 258, in _initIRS
    self.irs = Dispatcher(HSM())
  File "/usr/share/vdsm/storage/hsm.py", line 346, in __init__
    sp.StoragePool.cleanupMasterMount()
  File "/usr/share/vdsm/storage/sp.py", line 389, in cleanupMasterMount
    blockSD.BlockStorageDomain.doUnmountMaster(master)
  File "/usr/share/vdsm/storage/blockSD.py", line 1181, in doUnmountMaster
    pids = svdsmp.fuser(masterMount.fs_file, mountPoint=True)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in fuser
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
TypeError: fuser() got an unexpected keyword argument 'mountPoint'
MainThread::INFO::2013-09-10 11:07:31,015::momIF::47::MOM::(__init__) Starting up MOM
MainThread::INFO::2013-09-10 11:07:31,047::vmChannels::187::vds::(settimeout) Setting channels' timeout to 30 seconds.
clientIFinit::DEBUG::2013-09-10 11:07:31,047::libvirtconnection::124::libvirtconnection::(get) trying to connect libvirt
VM Channels Listener::INFO::2013-09-10 11:07:31,049::vmChannels::170::vds::(run) Starting VM channels listener thread

Comment 3 Yaniv Bronhaim 2013-09-15 12:42:49 UTC

http://gerrit.ovirt.org/19250 - please help to verify and review your suggestion. 

thanks.

Comment 8 Elad 2013-10-15 10:47:31 UTC

Host becomes non-op and SPM fail-over is taking place after SPM looses its connectivity to the storage.

Verified on RHEVM3.3 is18

Comment 10 Charlie 2013-11-28 00:32:09 UTC

This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 11 errata-xmlrpc 2014-01-21 16:15:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html

Note You need to log in before you can comment on or make changes to this bug.