Bug 954169

Summary: refreshStoragePool fails if domain is upgraded from format v1 to format v3
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: amureini, bazulay, fsimonce, iheim, jkt, lpeer, scohen, tnisan, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: ALPHA2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-06 14:21:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1142923, 1156165    
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-04-21 14:00:30 UTC
Created attachment 738261 [details]
logs

Description of problem:

vdsm reports the domain is partially inaccessible if we put the domain in maintenance during upgrade.

Version-Release number of selected component (if applicable):

sf13.1

How reproducible:

100%

Steps to Reproduce:
1. start upgrade of pool from v1 (3.0) to v3 (3.2)
2. put master domain in maintenance
3.
 
Actual results:

we report that the domain is partially inaccessible

Expected results:

we should complete the upgrade before marking domain in maintenance

Additional info: logs

I think it might be related to RefreshStoragePool which is sent during upgrade: 

2013-04-21 16:54:11,446 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-50) [3df2feeb] Command RefreshStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to RefreshStoragePoolVDS, error = Unknown pool id, pool not connected: ('7f13928f-5e07-4971-a7ac-44ca747c4467',)
2013-04-21 16:54:11,446 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-50) [3df2feeb] FINISH, RefreshStoragePoolVDSCommand, log id: 744980b1
2013-04-21 16:54:11,446 ERROR [org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation] (pool-4-thread-50) [3df2feeb] Failed to refresh storagePool. Host cougar01 to storage pool iSCSI. Exception: {3}: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RefreshStoragePoolVDS, error = Unknown pool id, pool not connected: ('7f13928f-5e07-4971-a7ac-44ca747c4467',)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:167) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation.execute(AfterDeactivateSingleAsyncOperation.java:32) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:42) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:31) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalCallable.call(ThreadPoolUtil.java:99) [engine-utils.jar:]
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_09-icedtea]
        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea]

2013-04-21 16:54:11,448 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-40) [3df2feeb] FINISH, RefreshStoragePoolVDSCommand, log id: 3ee32507
2013-04-21 16:54:11,487 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (pool-4-thread-43) [3df2feeb] START, DisconnectStorageServerVDSCommand(HostName = cougar02, HostId = 8d6fddbc-dde5-474c-a436-f26936a4deb0, storagePoolId = 7f13928f-5e07-4971-a7ac-44ca747c4467, storageType = ISCSI, connectionList = [{ id: 7a709e03-1538-434c-9a45-7b747e5184ea, connection: 10.35.64.10, iqn: Dafna-03, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 72b717c1

Comment 1 Ayal Baron 2013-04-22 08:50:28 UTC
Fede, please take a look what's going on here.

Comment 2 Federico Simoncelli 2013-04-29 16:03:42 UTC
This issue is not related to the domain being moved to maintenance even though it's probably a valid reproducer.

Basically if we send a refreshStoragePool to the HSMs right after the upgrade was completed, the metadata switch (lv-based vs tag-based) in the domain cache didn't happen yet and the master domain fails the validation (empty metadata).

 @public
 def refreshStoragePool(self, spUUID, msdUUID,
                        masterVersion, options=None):
     ...
     pool = self.getPool(spUUID)

     try:
         self.validateSdUUID(msdUUID)
         pool.refresh(msdUUID, masterVersion)
     except:
         ...

The problem here is that before the refresh (pool.refresh) which would trigger the metadata switch there's a "validateSdUUID" call for the master domain which runs against the cached domain.

Traceback found on the HSM:

Thread-77::ERROR::2013-04-21 16:54:11,467::task::850::TaskManager.Task::(_setError) Task=`b6e59a23-2acd-4856-9082-86b47aa09cb9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 807, in refreshStoragePool
    self.validateSdUUID(msdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 254, in validateSdUUID
    sdDom.validate()
  File "/usr/share/vdsm/storage/blockSD.py", line 816, in validate
    raise se.StorageDomainAccessError(self.sdUUID)
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('dd713883-398a-45db-ae2a-ec6de8751874',)

Where:

 def validate(self):
     ...
     if not len(self.getMetadata()):
         raise se.StorageDomainAccessError(self.sdUUID)

The metadata is empty (len == 0) because it has been just moved to the vg tags.

Looking at the SPM logs we have the prove that the domain was just upgraded:

Thread-91::DEBUG::2013-04-21 16:53:41,053::formatConverter::336::Storage.v3DomainConverter::(v3DomainConverter) Switching the cluster lock for domain dd713883-398a-45db-ae2a-ec6de8751874

The solution for this will probably involve removing (or somehow improving) the preliminary validateSdUUID call.

Comment 3 Ayal Baron 2013-05-01 09:08:35 UTC
The failed refresh will cause the host to disconnect from the pool and then host recovery will reconnect it so although the problem is real, the impact is not that great.

Comment 5 Ayal Baron 2014-02-16 09:33:46 UTC
Fede, does the flow change with getting rid of pool metadata changes?

Comment 6 Nir Soffer 2014-02-16 16:02:28 UTC
(In reply to Ayal Baron from comment #5)
> Fede, does the flow change with getting rid of pool metadata changes?

This may not fail when using the new memory based pool backend, but since we must support the old backend, we must fix this.

Comment 7 Ayal Baron 2014-02-16 16:13:56 UTC
(In reply to Nir Soffer from comment #6)
> (In reply to Ayal Baron from comment #5)
> > Fede, does the flow change with getting rid of pool metadata changes?
> 
> This may not fail when using the new memory based pool backend, but since we
> must support the old backend, we must fix this.

actually considering the impact is not very big (this sorts itself out automatically after a while), if this will not be relevant with the memory based pool backend then it will not really be worth the trouble.
Fede?

Comment 8 Federico Simoncelli 2014-02-17 10:04:52 UTC
(In reply to Ayal Baron from comment #7)
> (In reply to Nir Soffer from comment #6)
> > (In reply to Ayal Baron from comment #5)
> > > Fede, does the flow change with getting rid of pool metadata changes?
> > 
> > This may not fail when using the new memory based pool backend, but since we
> > must support the old backend, we must fix this.
> 
> actually considering the impact is not very big (this sorts itself out
> automatically after a while), if this will not be relevant with the memory
> based pool backend then it will not really be worth the trouble.
> Fede?

I agree that the impact is not big but the fix should be easy.

If understand correctly the problem is that we try to validate the master domain (on HSM) before refreshing and the check fails because the metadata was moved somewhere else.

Obviously when a refreshStoragePool is sent we cannot try to validate anything before satisfying the refresh request.

I think that the fix may be as easy as moving the validation after the refresh (or even remove it completely).

Comment 9 Ayal Baron 2014-02-17 13:44:18 UTC
(In reply to Federico Simoncelli from comment #8)
> (In reply to Ayal Baron from comment #7)
> > (In reply to Nir Soffer from comment #6)
> > > (In reply to Ayal Baron from comment #5)
> > > > Fede, does the flow change with getting rid of pool metadata changes?
> > > 
> > > This may not fail when using the new memory based pool backend, but since we
> > > must support the old backend, we must fix this.
> > 
> > actually considering the impact is not very big (this sorts itself out
> > automatically after a while), if this will not be relevant with the memory
> > based pool backend then it will not really be worth the trouble.
> > Fede?
> 
> I agree that the impact is not big but the fix should be easy.
> 
> If understand correctly the problem is that we try to validate the master
> domain (on HSM) before refreshing and the check fails because the metadata
> was moved somewhere else.
> 
> Obviously when a refreshStoragePool is sent we cannot try to validate
> anything before satisfying the refresh request.
> 
> I think that the fix may be as easy as moving the validation after the
> refresh (or even remove it completely).

If this is indeed the case then by all means, let's get a patch.