954169 – refreshStoragePool fails if domain is upgraded from format v1 to format v3

Bug 954169 - refreshStoragePool fails if domain is upgraded from format v1 to format v3

Summary: refreshStoragePool fails if domain is upgraded from format v1 to format v3

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Nir Soffer
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2013-04-21 14:00 UTC by Dafna Ron
Modified:	2016-02-10 16:51 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ALPHA2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-07-06 14:21:09 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (1.17 MB, application/x-gzip) 2013-04-21 14:00 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	24568	0	None	None	None	Never

Description Dafna Ron 2013-04-21 14:00:30 UTC

Created attachment 738261 [details]
logs

Description of problem:

vdsm reports the domain is partially inaccessible if we put the domain in maintenance during upgrade.

Version-Release number of selected component (if applicable):

sf13.1

How reproducible:

100%

Steps to Reproduce:
1. start upgrade of pool from v1 (3.0) to v3 (3.2)
2. put master domain in maintenance
3.
 
Actual results:

we report that the domain is partially inaccessible

Expected results:

we should complete the upgrade before marking domain in maintenance

Additional info: logs

I think it might be related to RefreshStoragePool which is sent during upgrade: 

2013-04-21 16:54:11,446 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-50) [3df2feeb] Command RefreshStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to RefreshStoragePoolVDS, error = Unknown pool id, pool not connected: ('7f13928f-5e07-4971-a7ac-44ca747c4467',)
2013-04-21 16:54:11,446 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-50) [3df2feeb] FINISH, RefreshStoragePoolVDSCommand, log id: 744980b1
2013-04-21 16:54:11,446 ERROR [org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation] (pool-4-thread-50) [3df2feeb] Failed to refresh storagePool. Host cougar01 to storage pool iSCSI. Exception: {3}: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to RefreshStoragePoolVDS, error = Unknown pool id, pool not connected: ('7f13928f-5e07-4971-a7ac-44ca747c4467',)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:167) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation.execute(AfterDeactivateSingleAsyncOperation.java:32) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:42) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:31) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalCallable.call(ThreadPoolUtil.java:99) [engine-utils.jar:]
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_09-icedtea]
        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea]

2013-04-21 16:54:11,448 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-40) [3df2feeb] FINISH, RefreshStoragePoolVDSCommand, log id: 3ee32507
2013-04-21 16:54:11,487 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (pool-4-thread-43) [3df2feeb] START, DisconnectStorageServerVDSCommand(HostName = cougar02, HostId = 8d6fddbc-dde5-474c-a436-f26936a4deb0, storagePoolId = 7f13928f-5e07-4971-a7ac-44ca747c4467, storageType = ISCSI, connectionList = [{ id: 7a709e03-1538-434c-9a45-7b747e5184ea, connection: 10.35.64.10, iqn: Dafna-03, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 72b717c1

Comment 1 Ayal Baron 2013-04-22 08:50:28 UTC

Fede, please take a look what's going on here.

Comment 2 Federico Simoncelli 2013-04-29 16:03:42 UTC

This issue is not related to the domain being moved to maintenance even though it's probably a valid reproducer.

Basically if we send a refreshStoragePool to the HSMs right after the upgrade was completed, the metadata switch (lv-based vs tag-based) in the domain cache didn't happen yet and the master domain fails the validation (empty metadata).

 @public
 def refreshStoragePool(self, spUUID, msdUUID,
                        masterVersion, options=None):
     ...
     pool = self.getPool(spUUID)

     try:
         self.validateSdUUID(msdUUID)
         pool.refresh(msdUUID, masterVersion)
     except:
         ...

The problem here is that before the refresh (pool.refresh) which would trigger the metadata switch there's a "validateSdUUID" call for the master domain which runs against the cached domain.

Traceback found on the HSM:

Thread-77::ERROR::2013-04-21 16:54:11,467::task::850::TaskManager.Task::(_setError) Task=`b6e59a23-2acd-4856-9082-86b47aa09cb9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 807, in refreshStoragePool
    self.validateSdUUID(msdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 254, in validateSdUUID
    sdDom.validate()
  File "/usr/share/vdsm/storage/blockSD.py", line 816, in validate
    raise se.StorageDomainAccessError(self.sdUUID)
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('dd713883-398a-45db-ae2a-ec6de8751874',)

Where:

 def validate(self):
     ...
     if not len(self.getMetadata()):
         raise se.StorageDomainAccessError(self.sdUUID)

The metadata is empty (len == 0) because it has been just moved to the vg tags.

Looking at the SPM logs we have the prove that the domain was just upgraded:

Thread-91::DEBUG::2013-04-21 16:53:41,053::formatConverter::336::Storage.v3DomainConverter::(v3DomainConverter) Switching the cluster lock for domain dd713883-398a-45db-ae2a-ec6de8751874

The solution for this will probably involve removing (or somehow improving) the preliminary validateSdUUID call.

Comment 3 Ayal Baron 2013-05-01 09:08:35 UTC

The failed refresh will cause the host to disconnect from the pool and then host recovery will reconnect it so although the problem is real, the impact is not that great.

Comment 5 Ayal Baron 2014-02-16 09:33:46 UTC

Fede, does the flow change with getting rid of pool metadata changes?

Comment 6 Nir Soffer 2014-02-16 16:02:28 UTC

(In reply to Ayal Baron from comment #5)
> Fede, does the flow change with getting rid of pool metadata changes?

This may not fail when using the new memory based pool backend, but since we must support the old backend, we must fix this.

Comment 7 Ayal Baron 2014-02-16 16:13:56 UTC

(In reply to Nir Soffer from comment #6)
> (In reply to Ayal Baron from comment #5)
> > Fede, does the flow change with getting rid of pool metadata changes?
> 
> This may not fail when using the new memory based pool backend, but since we
> must support the old backend, we must fix this.

actually considering the impact is not very big (this sorts itself out automatically after a while), if this will not be relevant with the memory based pool backend then it will not really be worth the trouble.
Fede?

Comment 8 Federico Simoncelli 2014-02-17 10:04:52 UTC

(In reply to Ayal Baron from comment #7)
> (In reply to Nir Soffer from comment #6)
> > (In reply to Ayal Baron from comment #5)
> > > Fede, does the flow change with getting rid of pool metadata changes?
> > 
> > This may not fail when using the new memory based pool backend, but since we
> > must support the old backend, we must fix this.
> 
> actually considering the impact is not very big (this sorts itself out
> automatically after a while), if this will not be relevant with the memory
> based pool backend then it will not really be worth the trouble.
> Fede?

I agree that the impact is not big but the fix should be easy.

If understand correctly the problem is that we try to validate the master domain (on HSM) before refreshing and the check fails because the metadata was moved somewhere else.

Obviously when a refreshStoragePool is sent we cannot try to validate anything before satisfying the refresh request.

I think that the fix may be as easy as moving the validation after the refresh (or even remove it completely).

Comment 9 Ayal Baron 2014-02-17 13:44:18 UTC

(In reply to Federico Simoncelli from comment #8)
> (In reply to Ayal Baron from comment #7)
> > (In reply to Nir Soffer from comment #6)
> > > (In reply to Ayal Baron from comment #5)
> > > > Fede, does the flow change with getting rid of pool metadata changes?
> > > 
> > > This may not fail when using the new memory based pool backend, but since we
> > > must support the old backend, we must fix this.
> > 
> > actually considering the impact is not very big (this sorts itself out
> > automatically after a while), if this will not be relevant with the memory
> > based pool backend then it will not really be worth the trouble.
> > Fede?
> 
> I agree that the impact is not big but the fix should be easy.
> 
> If understand correctly the problem is that we try to validate the master
> domain (on HSM) before refreshing and the check fails because the metadata
> was moved somewhere else.
> 
> Obviously when a refreshStoragePool is sent we cannot try to validate
> anything before satisfying the refresh request.
> 
> I think that the fix may be as easy as moving the validation after the
> refresh (or even remove it completely).

If this is indeed the case then by all means, let's get a patch.

Note You need to log in before you can comment on or make changes to this bug.