Bug 1017207

Summary: [BLOCKED][vdsm] vdsm fails to unmount the master when losing connectivitiy to master domain
Product: [oVirt] ovirt-engine Reporter: Raz Tamir <ratamir>
Component: GeneralAssignee: Liron Aravot <laravot>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: ---CC: amureini, bugs, fsimonce, laravot, lpeer, rbalakri, Rhev-m-bugs, scohen, tnisan, yeylon, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-28 14:29:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1058022, 1080379    
Bug Blocks:    
Attachments:
Description Flags
engine + vdsm logs none

Description Raz Tamir 2013-10-09 12:47:51 UTC
Created attachment 809891 [details]
engine + vdsm logs

Description of problem:
When trying to execute some actions (e.g. create snapshot, make template etc.) and blocking connectivity to Master storage domain from SPM only using iptables immediately after the action was executed, the engine run SpmStopVDSCommand
with no success.

- Power Management disabled for all hosts 


the master domain should become inactive and ReconstructMasterDomainCommand should be called.
The master's domain status changes to inactive as expected, but ReconstructMasterDomainCommand does not succeed.


Version-Release number of selected component (if applicable):
rhevm-3.3.0-0.24.master.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:

setup with 1 iSCSI DC, 2 host, 2 storage domain which each on different servers and 1 vm.

1. Perform any action that involves the VM's storage (e.g. remove template, create snapshot, remove vm or disks)

2. Immediately after you run the action , block connectivity to Master storage domain from SPM using iptables

Actual results:
We should fail the action performed in step 1 gracefully

Expected results:
Stuck in process with no actual result.
from the engine log it is obvious that the engine is in continuous loop, without stopping the SPM

Additional info:

Comment 1 Liron Aravot 2013-10-29 12:10:36 UTC
ratamir,

When the connectivity is blocked between the spm and the master domain, the spm is self fencing (vdsm restarts automatically) -  you can see that in the logs.

the problem here is that vdsm fails to unmount the master domain which cause to failure to start over and over 

------------------------------------------------------------------
MainThread::INFO::2013-10-09 14:43:34,360::vdsm::101::vds::(run) (PID: 12005) I am the actual vdsm 4.12.0-170.el6ev gold-vdsd.qa.lab.tlv.redhat.com (2.6.32-420.el6.x86_64)
....
-4315-ad56-00d54573146b/master failed - Error
 unmounting master storage domain: 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-00d54573146b/master, rc=1'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 390, in cleanupMasterMount
    blockSD.BlockStorageDomain.doUnmountMaster(master)
  File "/usr/share/vdsm/storage/blockSD.py", line 1256, in doUnmountMaster
    raise se.StorageDomainMasterUnmountError(masterdir, 1)
StorageDomainMasterUnmountError: Error unmounting master storage domain: 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-00d54573146b/master, rc=1'
MainThread::INFO::2013-10-09 14:43:35,832::vdsm::101::vds::(run) (PID: 12035) I am the actual vdsm 4.12.0-170.el6ev gold-vdsd.qa.lab.tlv.redhat.com (2.6.32-420.el6.x86_64)
MainThread::DEBUG::2013-10-09 14:43:36,138::resourceManager::420::ResourceManager::(registerNamespace) Registering namespace 'Storage'
...
MainThread::ERROR::2013-10-09 14:43:37,139::utils::918::root::(panic) Panic: unmount /rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-00d54573146b/master failed - Error
 unmounting master storage domain: 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-00d54573146b/master, rc=1'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 390, in cleanupMasterMount
    blockSD.BlockStorageDomain.doUnmountMaster(master)
  File "/usr/share/vdsm/storage/blockSD.py", line 1256, in doUnmountMaster
    raise se.StorageDomainMasterUnmountError(masterdir, 1)
StorageDomainMasterUnmountError: Error unmounting master storage domain: 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-00d54573146b/master, rc=1'
MainThread::INFO::2013-10-09 14:58:37,248::vdsm::101::vds::(run) (PID: 12479) I am the actual vdsm 4.12.0-170.el6ev gold-vdsd.qa.lab.tlv.redhat.com (2.6.32-420.el6.x86_64)
MainThread::DEBUG::2013-10-09 14:58:37,630::resourceManager::420::ResourceManager::(registerNamespace) Registering namespace 'Storage'
----------------------------------------------------------------------

in the meanwhile, the engine has no connection with vdsm thus the host should move to non responsive/connecting (in that flow it switches between those statuses) and spm stop/spm start isn't being executed as we "wait to know" what happend with the former spm

--------------------------------------------------------------------------------
09 14:41:45,943 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-51) SPM Init: could not find reported vds or not up - pool:testDC vds_spm_id: 2
2013-10-09 14:41:45,945 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-51) SPM selection - vds seems as spm gold-vdsd
2013-10-09 14:41:45,946 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-51) START, SpmStopVDSCommand(HostName = gold-vdsd, HostId = c73348ae-8dbc-4abf-ad2b-75f874c91c6d, storagePoolId = f705d4d5-e0b5-464d-bf0b-a2285dfbb5ce), log id: 7fb212bc
2013-10-09 14:41:45,946 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-51) SpmStopVDSCommand:: vds gold-vdsd is in Connecting status - not performing spm stop, pool id f705d4d5-e0b5-464d-bf0b-a2285dfbb5ce


2013-10-09 14:42:58,808 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (pool-5-thread-42) ResourceManager::vdsNotResponding entered for Host c73348ae-8dbc-4abf-ad2b-75f874c91c6d, gold-vdsd.qa.lab.tlv.redhat.com
2013-10-09 14:42:58,829 INFO  [org.ovirt.engine.core.bll.SshSoftFencingCommand] (pool-5-thread-42) Running command: SshSoftFencingCommand internal: true. Entities affected :  ID: c73348ae-8dbc-4abf-ad2b-75f874c91c6d Type: VDS
2013-10-09 14:42:59,208 INFO  [org.ovirt.engine.core.bll.SshSoftFencingCommand] (pool-5-thread-42) Executing SSH Soft Fencing command on host gold-vdsd.qa.lab.tlv.redhat.com
2013-10-09 14:43:01,818 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-59) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
------------------------------------------------------------------------------

after vdsm on the previous spm succesfully starts (15 minutes), the engine connection with that host is restored and the other host is selected as spm , the tasks are being stopped and the create snapshot command ends with failure:

---------------------------------------------------
2013-10-09 14:58:49,415 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-88) START, SpmStartVDSCommand(HostName = gold-vds
c, HostId = 6c3f59b4-20a0-44d5-92ba-c9f10bc6a565, storagePoolId = f705d4d5-e0b5-464d-bf0b-a2285dfbb5ce, prevId=2, prevLVER=10, storagePoolFormatType=V3, recoveryMode=Manual,
 SCSIFencing=false), log id: 660a1e1f

k::EndCommandAction [within thread] context: Attempting to EndAction CreateAllSnapshotsFromVm, executionIndex: 0
2013-10-09 14:59:03,560 ERROR [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-50) Ending command with failure: org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand
2013-10-09 14:59:03,563 ERROR [org.ovirt.engine.core.bll.CreateSnapshotCommand] (pool-5-thread-50) [40805ead] Ending command with failure: org.ovirt.engine.core.bll.CreateSnapshotCommand

---------------------------------------------------------------


therefore i move this bug to vdsm so that issue would be looked at - why vdsm couldn't restart succesfully on that scenario.
the engine behaviour could be inspected as well (when host is in connecting status and when it's non responsive, when engine commands are called based on the host status from spm start flow..etc) - but it's not the bug issue here.
 
please next time include vdsm logs from all hosts, specify which log is from which host and be sure to reduce the log length, there are many irrelevant runs in that log making it hard to debug and trace. please narrow as much as possible - thanks!


Ayal, please take a look..possibly it was already handled on different bug. 
thanks

Comment 2 Raz Tamir 2013-10-29 16:44:29 UTC
Liron, 
ok, next time the logs will be more specific on the issue

(In reply to Liron Aravot from comment #1)
> ratamir,
> 
> When the connectivity is blocked between the spm and the master domain, the
> spm is self fencing (vdsm restarts automatically) -  you can see that in the
> logs.
> 
> the problem here is that vdsm fails to unmount the master domain which cause
> to failure to start over and over 
> 
> ------------------------------------------------------------------
> MainThread::INFO::2013-10-09 14:43:34,360::vdsm::101::vds::(run) (PID:
> 12005) I am the actual vdsm 4.12.0-170.el6ev gold-vdsd.qa.lab.tlv.redhat.com
> (2.6.32-420.el6.x86_64)
> ....
> -4315-ad56-00d54573146b/master failed - Error
>  unmounting master storage domain:
> 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-
> 00d54573146b/master, rc=1'
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/sp.py", line 390, in cleanupMasterMount
>     blockSD.BlockStorageDomain.doUnmountMaster(master)
>   File "/usr/share/vdsm/storage/blockSD.py", line 1256, in doUnmountMaster
>     raise se.StorageDomainMasterUnmountError(masterdir, 1)
> StorageDomainMasterUnmountError: Error unmounting master storage domain:
> 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-
> 00d54573146b/master, rc=1'
> MainThread::INFO::2013-10-09 14:43:35,832::vdsm::101::vds::(run) (PID:
> 12035) I am the actual vdsm 4.12.0-170.el6ev gold-vdsd.qa.lab.tlv.redhat.com
> (2.6.32-420.el6.x86_64)
> MainThread::DEBUG::2013-10-09
> 14:43:36,138::resourceManager::420::ResourceManager::(registerNamespace)
> Registering namespace 'Storage'
> ...
> MainThread::ERROR::2013-10-09 14:43:37,139::utils::918::root::(panic) Panic:
> unmount
> /rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-00d54573146b/master
> failed - Error
>  unmounting master storage domain:
> 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-
> 00d54573146b/master, rc=1'
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/sp.py", line 390, in cleanupMasterMount
>     blockSD.BlockStorageDomain.doUnmountMaster(master)
>   File "/usr/share/vdsm/storage/blockSD.py", line 1256, in doUnmountMaster
>     raise se.StorageDomainMasterUnmountError(masterdir, 1)
> StorageDomainMasterUnmountError: Error unmounting master storage domain:
> 'masterdir=/rhev/data-center/mnt/blockSD/429ac11e-828e-4315-ad56-
> 00d54573146b/master, rc=1'
> MainThread::INFO::2013-10-09 14:58:37,248::vdsm::101::vds::(run) (PID:
> 12479) I am the actual vdsm 4.12.0-170.el6ev gold-vdsd.qa.lab.tlv.redhat.com
> (2.6.32-420.el6.x86_64)
> MainThread::DEBUG::2013-10-09
> 14:58:37,630::resourceManager::420::ResourceManager::(registerNamespace)
> Registering namespace 'Storage'
> ----------------------------------------------------------------------
> 
> in the meanwhile, the engine has no connection with vdsm thus the host
> should move to non responsive/connecting (in that flow it switches between
> those statuses) and spm stop/spm start isn't being executed as we "wait to
> know" what happend with the former spm
> 
> -----------------------------------------------------------------------------
> ---
> 09 14:41:45,943 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (DefaultQuartzScheduler_Worker-51) SPM Init: could not find reported vds or
> not up - pool:testDC vds_spm_id: 2
> 2013-10-09 14:41:45,945 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (DefaultQuartzScheduler_Worker-51) SPM selection - vds seems as spm gold-vdsd
> 2013-10-09 14:41:45,946 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand]
> (DefaultQuartzScheduler_Worker-51) START, SpmStopVDSCommand(HostName =
> gold-vdsd, HostId = c73348ae-8dbc-4abf-ad2b-75f874c91c6d, storagePoolId =
> f705d4d5-e0b5-464d-bf0b-a2285dfbb5ce), log id: 7fb212bc
> 2013-10-09 14:41:45,946 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand]
> (DefaultQuartzScheduler_Worker-51) SpmStopVDSCommand:: vds gold-vdsd is in
> Connecting status - not performing spm stop, pool id
> f705d4d5-e0b5-464d-bf0b-a2285dfbb5ce
> 
> 
> 2013-10-09 14:42:58,808 INFO  [org.ovirt.engine.core.bll.VdsEventListener]
> (pool-5-thread-42) ResourceManager::vdsNotResponding entered for Host
> c73348ae-8dbc-4abf-ad2b-75f874c91c6d, gold-vdsd.qa.lab.tlv.redhat.com
> 2013-10-09 14:42:58,829 INFO 
> [org.ovirt.engine.core.bll.SshSoftFencingCommand] (pool-5-thread-42) Running
> command: SshSoftFencingCommand internal: true. Entities affected :  ID:
> c73348ae-8dbc-4abf-ad2b-75f874c91c6d Type: VDS
> 2013-10-09 14:42:59,208 INFO 
> [org.ovirt.engine.core.bll.SshSoftFencingCommand] (pool-5-thread-42)
> Executing SSH Soft Fencing command on host gold-vdsd.qa.lab.tlv.redhat.com
> 2013-10-09 14:43:01,818 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand]
> (DefaultQuartzScheduler_Worker-59) Command GetCapabilitiesVDS execution
> failed. Exception: VDSNetworkException: java.net.ConnectException:
> Connection refused
> -----------------------------------------------------------------------------
> -
> 
> after vdsm on the previous spm succesfully starts (15 minutes), the engine
> connection with that host is restored and the other host is selected as spm
> , the tasks are being stopped and the create snapshot command ends with
> failure:
> 
> ---------------------------------------------------
> 2013-10-09 14:58:49,415 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
> (DefaultQuartzScheduler_Worker-88) START, SpmStartVDSCommand(HostName =
> gold-vds
> c, HostId = 6c3f59b4-20a0-44d5-92ba-c9f10bc6a565, storagePoolId =
> f705d4d5-e0b5-464d-bf0b-a2285dfbb5ce, prevId=2, prevLVER=10,
> storagePoolFormatType=V3, recoveryMode=Manual,
>  SCSIFencing=false), log id: 660a1e1f
> 
> k::EndCommandAction [within thread] context: Attempting to EndAction
> CreateAllSnapshotsFromVm, executionIndex: 0
> 2013-10-09 14:59:03,560 ERROR
> [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand]
> (pool-5-thread-50) Ending command with failure:
> org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand
> 2013-10-09 14:59:03,563 ERROR
> [org.ovirt.engine.core.bll.CreateSnapshotCommand] (pool-5-thread-50)
> [40805ead] Ending command with failure:
> org.ovirt.engine.core.bll.CreateSnapshotCommand
> 
> ---------------------------------------------------------------
> 
> 
> therefore i move this bug to vdsm so that issue would be looked at - why
> vdsm couldn't restart succesfully on that scenario.
> the engine behaviour could be inspected as well (when host is in connecting
> status and when it's non responsive, when engine commands are called based
> on the host status from spm start flow..etc) - but it's not the bug issue
> here.
>  
> please next time include vdsm logs from all hosts, specify which log is from
> which host and be sure to reduce the log length, there are many irrelevant
> runs in that log making it hard to debug and trace. please narrow as much as
> possible - thanks!
> 
> 
> Ayal, please take a look..possibly it was already handled on different bug. 
> thanks

Comment 4 Allon Mureinik 2013-11-13 15:36:24 UTC
The bottom line, between all these comment is this:
"the problem here is that vdsm fails to unmount the master domain which cause to failure to start over and over "

Fede, can you take a look please?

Comment 5 Federico Simoncelli 2013-11-15 13:35:28 UTC
For as long as the master fs is mounted it is not safe to start the SPM anywhere else. In this respect vdsm is extra cautious and exits right away if the umount command fails (at startup).

We can try to come up with a solution where we keep retrying in vdsm and meanwhile we return some specific error about the host/masterfs status, but in the end it would result in the same behavior as it wouldn't be safe enough to start the SPM anywhere else.

In this specific case there were also 2 processes still holding the mount in use, so it would have been even more risky to try and start the SPM on another host.

A (drastic) solution is to force the host to self-fence after a certain number of retries.

Comment 6 Ayal Baron 2013-11-17 10:28:52 UTC
(In reply to Federico Simoncelli from comment #5)
> For as long as the master fs is mounted it is not safe to start the SPM
> anywhere else. In this respect vdsm is extra cautious and exits right away
> if the umount command fails (at startup).
> 
> We can try to come up with a solution where we keep retrying in vdsm and
> meanwhile we return some specific error about the host/masterfs status, but
> in the end it would result in the same behavior as it wouldn't be safe
> enough to start the SPM anywhere else.
> 
> In this specific case there were also 2 processes still holding the mount in
> use, so it would have been even more risky to try and start the SPM on
> another host.
> 
> A (drastic) solution is to force the host to self-fence after a certain
> number of retries.

An alternative which we're already pursuing is to remove the need for the master fs to begin with...

In any event, this is not short term.

Comment 8 Ayal Baron 2014-02-23 12:58:21 UTC
Sean, this is also pending the get rid of spm work (need to block this bug on it)

Comment 10 Sandro Bonazzola 2015-10-26 12:50:09 UTC
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to

- 3.6.1 if severity >= high
- 4.0 if severity < high

Comment 11 Red Hat Bugzilla Rules Engine 2015-11-30 19:00:26 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 12 Allon Mureinik 2016-03-28 14:29:36 UTC
Closing old bugs.

If you think it needs fixing, please explain why and reopen.