Created attachment 602365 [details] logs Description of problem: after storage failure, the system sends reconstruct master and I send reinitialize to a second domain (reconstruct) as a result, the reinitialize failed in backend and all action rolled back removing the second domain I added (with reinitialize) from the pool. the vdsm however, has two master domains under the same pool. we are unable to recover from this action since we are getting wrong master domain in backend and vdsm has a sanlock error: AcquireLockFailure: Cannot obtain lock Version-Release number of selected component (if applicable): si13 How reproducible: 100% Steps to Reproduce: 1. 1 domain and 1 host, iscsi. block connectivity to domain from the vds 2. once we get can't find master domain remove the iptables block from vds 3. reinitialize storage Actual results: backend gets exception on MarkPoolInReconstructModeVDSCommand and rollback on reinitialize (in backend only leaving only the original domain in the pool in db). we continue on both actions in vdsm which creates two master domains in vds and a domain which is not connected to pool in backend db but is connected to pool in vds. we cannot recover with sanlock error: AcquireLockFailure: Cannot obtain lock Expected results: we should block one of the actions (which one that started fist) Additional info: both vdsm and backend logs are attached. e0a63f-a60e-4bb6-8394-80e3043c6204::INFO::2012-08-05 17:35:50,086::safelease::211::SANLock::(acquire) Acquiring cluster lock for domain e3c8df23-d94e-4d9f-ba77-6114c76776af (id: 1) 1de0a63f-a60e-4bb6-8394-80e3043c6204::ERROR::2012-08-05 17:35:50,087::task::853::TaskManager.Task::(_setError) Task=`1de0a63f-a60e-4bb6-8394-80e3043c6204`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 320, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 255, in startSpm self.masterDomain.acquireClusterLock(self.id) File "/usr/share/vdsm/storage/sd.py", line 439, in acquireClusterLock self._clusterLock.acquire(hostID) File "/usr/share/vdsm/storage/safelease.py", line 228, in acquire "Cannot acquire cluster lock", str(e)) AcquireLockFailure: Cannot obtain lock: "id=e3c8df23-d94e-4d9f-ba77-6114c76776af, rc=17, out=Cannot acquire cluster lock, err=(17, 'Sanlock resource not acquired', 'File exists')" 1de0a63f-a60e-4bb6-8394-80e3043c6204::DEBUG::2012-08-05 17:35:50,088::task::872::TaskManager.Task::(_run) Task=`1de0a63f-a60e-4bb6-8394-80e3043c6204`::Task._run: 1de0a63f-a60e-4bb6-8394-80e3043c6204 () {} failed - stopping task 1de0a63f-a60e-4bb6-8394-80e3043c6204::DEBUG::2012-08-05 17:35:50,088::task::1199::TaskManager.Task::(stop) Task=`1de0a63f-a60e-4bb6-8394-80e3043c6204`::stopping in state running (force False) 1de0a63f-a60e-4bb6-8394-80e3043c6204::DEBUG::2012-08-05 17:35:50,089::task::978::TaskManager.Task::(_decref) Task=`1de0a63f-a60e-4bb6-8394-80e3043c6204`::ref 1 aborting True 1de0a63f-a60e-4bb6-8394-80e3043c6204::DEBUG::2012-08-05 17:35:50,089::task::903::TaskManager.Task::(_runJobs) Task=`1de0a63f-a60e-4bb6-8394-80e3043c6204`::abortin 2012-08-05 15:10:01,269 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) [58baf64e] clear domain error-timers for pool ddc2020c-c909-4214-89ff-b3cdde436183 2012-08-05 15:10:01,270 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-4-thread-44) [58baf64e] FINISH, MarkPoolInReconstructModeVDSCommand, log id: 34105a9a 2012-08-05 15:10:01,272 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-4-thread-44) [58baf64e] START, DisconnectStoragePoolVDSCommand(vdsId = f0e829f6-dcbe-11e1-b93d-001a4a169741, storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, vds_spm_id = 1, masterDomainId = 00000000-0000-0000-0000-000000000000, masterVersion = 0), log id: 504d243b 2012-08-05 15:10:01,303 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (pool-4-thread-44) [58baf64e] FINISH, DisconnectStoragePoolVDSCommand, log id: 504d243b 2012-08-05 15:10:01,308 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-4-thread-44) [58baf64e] Cannot perform logging: java.util.UnknownFormatConversionException: Conversion = 'i' at java.util.Formatter$FormatSpecifier.conversion(Formatter.java:2646) [rt.jar:1.7.0_05-icedtea] at java.util.Formatter$FormatSpecifier.<init>(Formatter.java:2675) [rt.jar:1.7.0_05-icedtea] at java.util.Formatter.parse(Formatter.java:2528) [rt.jar:1.7.0_05-icedtea] at java.util.Formatter.format(Formatter.java:2469) [rt.jar:1.7.0_05-icedtea] at java.util.Formatter.format(Formatter.java:2423) [rt.jar:1.7.0_05-icedtea] at java.lang.String.format(String.java:2845) [rt.jar:1.7.0_05-icedtea] at org.ovirt.engine.core.common.vdscommands.ReconstructMasterVDSCommandParameters.toString(ReconstructMasterVDSCommandParameters.java:86) [engine-common.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.toString(VDSCommandBase.java:49) [engine-vdsbroker.jar:] at java.text.MessageFormat.subformat(MessageFormat.java:1271) [rt.jar:1.7.0_05-icedtea] at java.text.MessageFormat.format(MessageFormat.java:860) [rt.jar:1.7.0_05-icedtea] at java.text.Format.format(Format.java:157) [rt.jar:1.7.0_05-icedtea] at java.text.MessageFormat.format(MessageFormat.java:836) [rt.jar:1.7.0_05-icedtea] at org.ovirt.engine.core.utils.log.Log.transform(Log.java:159) [engine-utils.jar:] at org.ovirt.engine.core.utils.log.Log.infoFormat(Log.java:120) [engine-utils.jar:] at org.ovirt.engine.core.utils.log.LoggedUtils.log(LoggedUtils.java:162) [engine-utils.jar:] at org.ovirt.engine.core.utils.log.LoggedUtils.logEntry(LoggedUtils.java:80) [engine-utils.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.Execute(VdcCommandBase.java:34) [engine-dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:390) [engine-vdsbroker.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:35) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1322) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.reconstructMaster(ReconstructMasterDomainCommand.java:138) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.executeCommand(ReconstructMasterDomainCommand.java:149) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand.executeCommand(RecoveryStoragePoolCommand.java:103) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.ExecuteWithoutTransaction(CommandBase.java:810) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:901) [engine-bll.jar:] [root@blond-vdsh tmp]# vdsClient -s 0 getStorageDomainsList 7a4c3779-ff28-49ea-9b75-76e34c1c0fc7 e3c8df23-d94e-4d9f-ba77-6114c76776af [root@blond-vdsh tmp]# vdsClient -s 0 getStorageDomainInfo 7a4c3779-ff28-49ea-9b75-76e34c1c0fc7 uuid = 7a4c3779-ff28-49ea-9b75-76e34c1c0fc7 vguuid = caaAuF-pzB1-ZFVP-9iiE-pEiq-GADI-gui5Qu lver = -1 state = OK version = 3 role = Master pool = ['ddc2020c-c909-4214-89ff-b3cdde436183'] spm_id = -1 type = ISCSI class = Data master_ver = 0 name = Dafna-si13-02 [root@blond-vdsh tmp]# vdsClient -s 0 getStorageDomainInfo e3c8df23-d94e-4d9f-ba77-6114c76776af uuid = e3c8df23-d94e-4d9f-ba77-6114c76776af vguuid = o6KZEf-FY7m-t0o0-fSIi-OGHH-Kh6P-hEaif3 lver = 3 state = OK version = 3 role = Master pool = ['ddc2020c-c909-4214-89ff-b3cdde436183'] spm_id = 1 type = ISCSI class = Data master_ver = 1 name = Dafna-si13-01
expected results: this attempt should be blocked as RHEVM shouldn't allow parallel reconstruct attempts. please see related bug opened for VDSM: https://bugzilla.redhat.com/show_bug.cgi?id=845839. note that engine does the right thing trying to reconstruct the pool, though spmStart is failing due to issue in sanlock (vdsm bug).
This is an error flow plus a race between automatic action and user initiated action, removing beta1 blocker.
bug http://gerrit.ovirt.org/#/c/7964/ as a fix
correct patch : http://gerrit.ovirt.org/#/c/7982/
The problem is following, Recovery triggered reconstruct, reconstruct successes, vdsm has a master, but recovery was rollbacked at engine side - split engine and vsm has different masters. The rollback was because of failure during update of vms ovf. Place at logs: 2012-08-05 15:10:34,215 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-4-thread-44) [58baf64e] FINISH, ReconstructMasterVDSCommand, log Sucess of Reconstruct. Rollback of Recovery: 2012-08-05 15:10:46,238 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-44) [58baf64e] Error code VMPathNotExists and error message IRSGenericException: IRSErrorException: Failed to UpdateVMVDS, error = Cannot found VMs directory: ('/rhev/data-center/mnt/blockSD/7a4c3779-ff28-49ea-9b75-76e34c1c0fc7/master/vms',) 2012-08-05 15:10:46,238 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) [58baf64e] IrsBroker::Failed::UpdateVMVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to UpdateVMVDS, error = Cannot found VMs directory: ('/rhev/data-center/mnt/blockSD/7a4c3779-ff28-49ea-9b75-76e34c1c0fc7/master/vms',) 2012-08-05 15:10:46,242 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (pool-4-thread-44) [58baf64e] FINISH, UpdateVMVDSCommand, log id: 60af7e71 2012-08-05 15:10:46,243 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-4-thread-44) [58baf64e] START, MarkPoolInReconstructModeVDSCommand(storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 43157064 2012-08-05 15:10:46,243 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) [58baf64e] clearing cache for problematic entities in pool ddc2020c-c909-4214-89ff-b3cdde436183 2012-08-05 15:10:46,243 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-4-thread-44) [58baf64e] FINISH, MarkPoolInReconstructModeVDSCommand, log id: 43157064 2012-08-05 15:10:46,243 ERROR [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-4-thread-44) [58baf64e] Command org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to UpdateVMVDS, error = Cannot found VMs directory: ('/rhev/data-center/mnt/blockSD/7a4c3779-ff28-49ea-9b75-76e34c1c0fc7/master/vms',) 2012-08-05 15:10:46,288 INFO [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-4-thread-44) [58baf64e] Command [id=9af8b87c-a4ba-4159-9027-9f8b453ec4f1]: Compensating CHANGED_ENTITY of org.ovirt.engine.core.common.businessentities.storage_domain_static; snapshot: id=e3c8df23-d94e-4d9f-ba77-6114c76776af. 2012-08-05 15:10:46,298 INFO [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-4-thread-44) [58baf64e] Command [id=9af8b87c-a4ba-4159-9027-9f8b453ec4f1]: Compensating CHANGED_ENTITY of org.ovirt.engine.core.common.businessentities.storage_domain_static; snapshot: id=7a4c3779-ff28-49ea-9b75-76e34c1c0fc7. 2012-08-05 15:10:46,302 INFO [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-4-thread-44) [58baf64e] Command [id=9af8b87c-a4ba-4159-9027-9f8b453ec4f1]: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.storage_pool_iso_map; snapshot: storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, storageId = 7a4c3779-ff28-49ea-9b75-76e34c1c0fc7.
mkublin, one command is ReconstructMasterDomain command and one is RecoveryStoragePoolCommand (which calls recustrurct execute() method and not using the runAction mechanism). the issue does seems to be related to two reconstruct on the same pool that run simoultenious thus this patch does should solve the issue and i'd like pre-intergaration to verify it.
(In reply to comment #8) > mkublin, one command is ReconstructMasterDomain command and one is > RecoveryStoragePoolCommand (which calls recustrurct execute() method and not > using the runAction mechanism). the issue does seems to be related to two > reconstruct on the same pool that run simoultenious thus this patch does > should solve the issue and i'd like pre-intergaration to verify it. Prove it by logs
2012-08-05 15:09:50,079 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (QuartzScheduler_Worker-58) [417a872b] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: e3c8df23-d94e-4d9f-ba77-6114c76776af Type: Storage 2012-08-05 15:09:50,127 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [417a872b] START, MarkPoolInReconstructModeVDSCommand(storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 10aa801c 2012-08-05 15:09:50,128 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [417a872b] clearing cache for problematic entities in pool ddc2020c-c909-4214-89ff-b3cdde436183 2012-08-05 15:09:50,128 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [417a872b] FINISH, MarkPoolInReconstructModeVDSCommand, log id: 10aa801c 2012-08-05 15:09:59,583 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-75) domain e3c8df23-d94e-4d9f-ba77-6114c76776af in problem. vds: blond-vdsh 2012-08-05 15:10:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) Checking autorecoverable hosts 2012-08-05 15:10:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) Autorecovering 0 hosts 2012-08-05 15:10:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) Checking autorecoverable hosts done 2012-08-05 15:10:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) Checking autorecoverable storage domains 2012-08-05 15:10:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) Autorecovering 0 storage domains 2012-08-05 15:10:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) Checking autorecoverable storage domains done 2012-08-05 15:10:00,175 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-76) hostFromVds::selectedVds - blond-vdsh, spmStatus Unknown_Pool, storage pool iSCSI 2012-08-05 15:10:00,193 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-76) START, ConnectStoragePoolVDSCommand(vdsId = f0e829f6-dcbe-11e1-b93d-001a4a169741, storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, vds_spm_id = 1, masterDomainId = e3c8df23-d94e-4d9f-ba77-6114c76776af, masterVersion = 1), log id: 57b2a23 2012-08-05 15:10:00,483 INFO [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-4-thread-44) [58baf64e] Running command: RecoveryStoragePoolCommand internal: false. Entities aff
2012-08-05 15:09:50,127 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [417a872b] START, MarkPoolInReconstructModeVDSCommand(storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 10aa801c 2012-08-05 15:09:50,128 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [417a872b] clearing cache for problematic entities in pool ddc2020c-c909-4214-89ff-b3cdde436183 2012-08-05 15:09:50,128 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [417a872b] FINISH, MarkPoolInReconstructModeVDSCommand, log id: 10aa801c It is means that it finished. finally { // reset cache and mark reconstruct for pool as finished Backend.getInstance() .getResourceManager() .RunVdsCommand( VDSCommandType.MarkPoolInReconstructMode, new MarkPoolInReconstructModeVDSCommandParameters(getStoragePoolId() .getValue(), ReconstructMarkAction.ClearCache)); } Where the second is running?
mkublin, the storagepool is loaded within the canDoAction() method, the print is done after canDoAction() has finished. so basically while thread A is executing the reconstruct we can load the storagepool from thread B with stale values - so race can occur and cause this bug.
(In reply to comment #12) > mkublin, the storagepool is loaded within the canDoAction() method, the > print is done after canDoAction() has finished. so basically while thread A > is executing the reconstruct we can load the storagepool from thread B with > stale values - so race can occur and cause this bug. Like I said, race can occurred, but these was not a reason for bug. Preventing a race will not solve a problem described in bug. You can not provide some random solution for any bug, you should solve a problem described in the bug.
after reconstruct complets on the two different threads, a connect and refresh commands are sent to the hosts (from the log) - note the different master domain id and different version 2012-08-05 15:10:41,245 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-44) [58baf64e] START, RefreshStoragePoolVDSCommand(vdsId = f0e829f6-dcbe-11e1-b93d-001a4a169741, storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, masterStorageDomainId=7a4c3779-ff28-49ea-9b75-76e34c1c0fc7, msterVersion=2), log id: 2fdb5609 2012-08-05 15:10:46,172 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.RefreshStoragePoolVDSCommand] (pool-4-thread-45) [35d639c1] START, RefreshStoragePoolVDSCommand(vdsId = f0e829f6-dcbe-11e1-b93d-001a4a169741, storagePoolId = ddc2020c-c909-4214-89ff-b3cdde436183, masterStorageDomainId=7a4c3779-ff28-49ea-9b75-76e34c1c0fc7, msterVersion=1), log id: 77a4c559 what causes the updatevm to fail? we need to verify on vdsm side as we pass the storagepoolId, but i'm pretty sure that this mixure causes it as one reconstruct doesn't cause that issue.
(In reply to comment #14) > after reconstruct complets on the two different threads, a connect and > refresh commands are sent to the hosts (from the log) - note the different > master domain id and different version > 2012-08-05 15:10:41,245 != 2012-08-05 15:10:46,172 41 != 46 One finished, next started
http://gerrit.ovirt.org/#/c/8390/
Merged Ida51b027b8078555ffab5b80d2ffe54572fa6cca
verified on si24