Bug 888259 - engine: storage pool recovery fails if reinitialize is initiated during SetStoragePoolStatusCommand
Summary: engine: storage pool recovery fails if reinitialize is initiated during SetSt...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.2.0
Assignee: Liron Aravot
QA Contact: Dafna Ron
URL:
Whiteboard: storage
Depends On: 917548
Blocks: 948448
TreeView+ depends on / blocked
 
Reported: 2012-12-18 12:12 UTC by Dafna Ron
Modified: 2016-02-10 20:18 UTC (History)
10 users (show)

Fixed In Version: sf16
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (1.63 MB, application/x-gzip)
2012-12-18 12:12 UTC, Dafna Ron
no flags Details
logs (950.26 KB, application/x-gzip)
2013-03-12 17:01 UTC, Dafna Ron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 13709 0 None None None Never

Description Dafna Ron 2012-12-18 12:12:50 UTC
Created attachment 665465 [details]
logs

Description of problem:

during reinitialize DC which sends reconstruct master to vdsm, engine sends SetStoragePoolStatusCommand which sends connectStoragePool to vdsm - reconstruct fails since pool is connected. 

Version-Release number of selected component (if applicable):

si25.1

How reproducible:

race

Steps to Reproduce:
1. with nfs storage, block connectivity to the master domain
2. reinitialize storage during SetStoragePoolStatusCommand
3.
  
Actual results:

SetStoragePoolStatusCommand sends connectStoragePool during reconstruct. 
we fail with wrong master domain error

Expected results:

we should not send connectStoragePool during reconstruct. 

Additional info: logs

Comment 1 Ayal Baron 2012-12-26 06:59:50 UTC
Liron,
Is this covered by the fix for bug 882837 ?

Comment 2 Liron Aravot 2013-01-01 14:59:13 UTC
summary: 
1.the described scenario isn't that relevant as simultenous reconstruct won't occur now, we will have conncetStoragePool but no reconstructs failurs.

2. we have other issues here that do need to be taken care of : initiation of recovery, possible races between automatic timers and recovery/reconstruct.

----------------------------------------------------------------------------
-----------------------------------------------------------------------------


what happend here is as follows:

GetStoragePoolInfo commands runs every few seconds by timer, triggers ConnectStoragePool and can triggers Reconstruct.
example:

2012-12-16 17:48:28,975 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] hostFromVds::selectedVds - pink-vds2, spmStatus Unknown_Pool, storage pool TEST-DC
2012-12-16 17:48:29,030 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-4) [3f2d4db] START, ConnectStoragePoolVDSCommand(HostName = pink-vds2, HostId = 8ed2188e-4795-11e2-b963-001122334455, storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, vds_spm_id = 1, masterDomainId = e589c31b-f266-4a87-a833-fad240da8257, 

(QuartzScheduler_Worker-4) [3f2d4db] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=03bf4661-cd9b-4581-a18f-ed7f7be148dd, msdUUID=e589c31b-f266-4a87-a833-fad240da8257'
2012-12-16 17:48:31,257 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-4) [3f2d4db] FINISH, ConnectStoragePoolVDSCommand, log id: 5217d889
2012-12-16 17:48:31,257 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] IrsBroker::Failed::GetStoragePoolInfoVDS
2012-12-16 17:48:31,257 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=03bf4661-cd9b-4581-a18f-ed7f7be148dd, msdUUID=e589c31b-f266-4a87-a833-fad240da8257'
2012-12-16 17:48:31,461 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (QuartzScheduler_Worker-4) [275de760] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: 





now the user runs RecoveryStoragePoolCommand:

2012-12-16 17:48:43,989 INFO  [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-3-thread-47) [4a41bda1] Running command: RecoveryStoragePoolCommand

2012-12-16 17:48:44,426 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-3-thread-47) [2d5562a8] START, ReconstructMasterVDSComman
2012-12-16 17:49:08,100 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-3-thread-47) [2d5562a8] FINISH, ReconstructMasterVDSCommand, log id: 13d16ea7


after https://bugzilla.redhat.com/show_bug.cgi?id=882837 was merged reconstruct shouldn't run simultaneously - Dafna, so if RecoveryStoragePool runs no other reconstruct should run meanwhile or after.



there seems to be another issues here which is not related to the described bug
- races between reconstruct/recovery and running timers which should be resolved -
example: the recovery has completed succesfully, but we initate another reconstruct and spm stop because of what seems to be stale data (loaded storage pool with version 1 from the db, while the reconstruct has updated it to 2).

2012-12-16 17:49:35,721 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [3fb35ac1] Master domain version is not in sync between DB and VDSM. Domain st marked as master, but the version in DB: 1 and in VDSM: 2

2012-12-16 17:49:35,721 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [3fb35ac1] Master domain version is not in sync between DB and VDSM. Domain st marked as master, but the version in DB: 1 and in VDSM: 2
2012-12-16 17:49:39,596 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: 2863bd4d-3f70-4ae8-b186-bf58088e28bd Type: Storage
2012-12-16 17:49:41,898 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-81) [42958589] Storage Domain 2863bd4d-3f70-4ae8-b186-bf58088e28bd:st was reported by Host pink-vds2 as Active in Pool 03bf4661-cd9b-4581-a18f-ed7f7be148dd, moving to active status
2012-12-16 17:49:43,739 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, ResetIrsVDSCommand( storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, ignoreFailoverLimit = false, compatabilityVersion = null, hostName = pink-vds2.qa.lab.tlv.redhat.com, vdsId = 8ed2188e-4795-11e2-b963-001122334455, ignoreStopFailed = true), log id: 841cc0c
2012-12-16 17:49:43,741 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, SpmStopVDSCommand(HostName = pink-vds2, HostId = 8ed2188e-4795-11e2-b963-001122334455, storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd), log id: 15407735
2012-12-16 17:49:43,762 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] SpmStopVDSCommand::Stopping SPM on vds pink-vds2, pool id 03bf4661-cd9b-4581-a18f-ed7f7be148dd
2012-12-16 17:49:43,993 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH, SpmStopVDSCommand, log id: 15407735
2012-12-16 17:49:46,396 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH, ResetIrsVDSCommand, log id: 841cc0c
2012-12-16 17:49:48,516 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 25576d88
2012-12-16 17:49:48,516 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] clearing cache for problematic entities in pool 03bf4661-cd9b-4581-a18f-ed7f7be148dd
2012-12-16 17:49:48,516 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH,

Comment 3 Ayal Baron 2013-01-01 15:49:39 UTC
(In reply to comment #2)
> summary: 
> 1.the described scenario isn't that relevant as simultenous reconstruct
> won't occur now, we will have conncetStoragePool but no reconstructs failurs.

Please update title to reflect remaining issues.

Comment 4 Liron Aravot 2013-01-08 16:45:28 UTC
patch:
http://gerrit.ovirt.org/#/c/10764/

related patch with fix to recovery command:
http://gerrit.ovirt.org/#/c/10549/

Comment 6 Liron Aravot 2013-02-27 12:18:48 UTC
should be ok right now - in very rare cases we might have an extra reconstruct when performing recovery, moving this to modified.

Comment 7 Dafna Ron 2013-03-12 17:00:06 UTC
tested on sf10. 
I am still seeing two reconstruct threads: 

2013-03-12 06:28:30,696 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: b37a4aef-09dd-4a4e-ae


2013-03-12 06:29:40,892 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFai


we get an error from vdsm failing to verify MD: 

StorageDomainMasterError: Error validating master storage domain: ('MD read error',)


engine side: 

2013-03-12 06:30:31,165 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-22) [4ba73fa1] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)

system reconstruct is failing with NPE: 

2013-03-12 06:32:40,897 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-50) [d6f3ed3] Exception during process of events for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, error is org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
2013-03-12 06:32:40,896 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-49) [d6f3ed3] Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 with exception java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException: java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.FutureTask.get(FutureTask.java:111) [rt.jar:1.7.0_09-icedtea]
        at org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor.submitEventSync(EventQueueMonitor.java:54) [engine-bll.jar:]
        at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ejb3.component.singleton.SingletonComponentInstanceAssociationInterceptor.processInvocation(SingletonComponentInstanceAssociationInterceptor.java:53) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]


when AutoRecovery runs we are rolling back on the reconstruct and getting wrong master domain or version in the previously Non-Operational host.

moving back to devel with new logs

Comment 8 Dafna Ron 2013-03-12 17:01:27 UTC
Created attachment 709079 [details]
logs

Comment 9 Liron Aravot 2013-03-17 11:11:14 UTC
Dafna, there are no two reconstruct threads:
2013-03-12 06:28:30,696 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand
 internal: true. Entities affected :  ID: b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 Type: Storage
2013-03-12 06:28:30,711 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-49) [abd2112] START, MarkPoolInReconstructModeVD
SCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 79c3e8
9a
2013-03-12 06:28:30,714 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-3-thread-49) [abd2112] No string for RECONSTRUCT_MASTER_FAILED typ
e. Use default Log


this is where the first reconstruct ended - later on different reconstruct begins: 

2013-03-12 06:29:40,892 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearJobs), log id: 81f8f7


2013-03-12 06:32:40,896 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 7b41bc52
2013-03-12 06:32:40,896 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-3-thread-46) clearing cache for problematic entities in pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7

the stacktrace that you pasted is a failure with TimeoutException.

There is NPE in the code, it's seems to be caused by a race when Recovery failed (compensation runs) and a host activation loaded domain 
0789a306-8ad8-4655-8644-960809561139 as the master while in the meanwhile the compensation returned b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 to be the master -
this issue should have been solved in 3.3 (it's a different bug).

Dafna, as it's not the initial bug - I think that we need to close this one and open a bug for the present issue , then decide if we want it for 3.2.

Comment 10 Ayal Baron 2013-03-17 11:38:33 UTC
(In reply to comment #9)
> Dafna, there are no two reconstruct threads:
> 2013-03-12 06:28:30,696 INFO 
> [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand]
> (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand
>  internal: true. Entities affected :  ID:
> b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 Type: Storage
> 2013-03-12 06:28:30,711 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.
> MarkPoolInReconstructModeVDSCommand] (pool-3-thread-49) [abd2112] START,
> MarkPoolInReconstructModeVD
> SCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7,
> ignoreFailoverLimit = false, compatabilityVersion = null,
> reconstructMarkAction = ClearCache), log id: 79c3e8
> 9a
> 2013-03-12 06:28:30,714 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (pool-3-thread-49) [abd2112] No string for RECONSTRUCT_MASTER_FAILED typ
> e. Use default Log
> 
> 
> this is where the first reconstruct ended - later on different reconstruct
> begins: 
> 
> 2013-03-12 06:29:40,892 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.
> MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START,
> MarkPoolInReconstructModeVDSCommand( storagePoolId =
> 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false,
> compatabilityVersion = null, reconstructMarkAction = ClearJobs), log id:
> 81f8f7
> 
> 
> 2013-03-12 06:32:40,896 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.
> MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START,
> MarkPoolInReconstructModeVDSCommand( storagePoolId =
> 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false,
> compatabilityVersion = null, reconstructMarkAction = ClearCache), log id:
> 7b41bc52
> 2013-03-12 06:32:40,896 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (pool-3-thread-46) clearing cache for problematic entities in pool
> 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7
> 
> the stacktrace that you pasted is a failure with TimeoutException.
> 
> There is NPE in the code, it's seems to be caused by a race when Recovery
> failed (compensation runs) and a host activation loaded domain 
> 0789a306-8ad8-4655-8644-960809561139 as the master while in the meanwhile
> the compensation returned b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 to be the
> master -
> this issue should have been solved in 3.3 (it's a different bug).
> 
> Dafna, as it's not the initial bug - I think that we need to close this one
> and open a bug for the present issue , then decide if we want it for 3.2.

Ack. please close this and file a new one.

Comment 11 Dafna Ron 2013-03-17 14:31:11 UTC
actually:

we did try to block the 2ed reconstruct but are failing:

2013-03-12 06:26:03,801 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (QuartzScheduler_Worker-77) Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 with exception java.util.concurrent.Cancellation
Exception: java.util.concurrent.CancellationException


which will cause:

2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)


and:

2013-03-12 06:32:45,677 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-46) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=324, mMessage=Wrong Master domain or its version: 'SD=0789a306-8ad8-4655-8644-960809561139, pool=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7']]
2013-03-12 06:32:45,677 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-46) HostName = gold-vdsc


since we still have a race here which cause the same exact issue I cannot verify this bug.

Comment 12 Liron Aravot 2013-03-17 14:53:43 UTC
Dafna, I'll comment by each part of your comment - first, the logs in your comment are 3 minutes before you ran recovery and before the logs that you pasted in the previous comment.

1. "we did try to block the 2ed reconstruct but are failing" - the log that you pasted is related to domain monitoring process being rejected, not to reconstruct (full line copied)


2013-03-12 06:26:03,801 INFO  [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-49) [57862efb] The following operation Event [eventType=DOMAINMONITORIN
G, storagePoolId=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, domainId=null, vdsId=83834e1f-9e60-41b5-a9cc-16460a8a2fe2] was cancelled, because of recosntruct was run before
2013-03-12 06:26:03,801 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (QuartzScheduler_Worker-77) Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b
1-a7e3e67829b7 with exception java.util.concurrent.CancellationException: java.util.concurrent.CancellationException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:250) 

2. second log is a failure in accessing the mater domain, which causes to reconstruct and is fine
kipped.
2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-100) Command SpmStatusVDS execution failed. Exception: IRSNoMasterDoma
inException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-03-12 06:26:30,127 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) hostFromVds::selectedVds - gold-vdsd, spmStatus retur
ned null!
2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) IrsBroker::Failed::GetStoragePoolInfoVDS
2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-03-12 06:26:30,149 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [2d52ff64] Running command: Reconst

3. this logs are 3 minutes before you ran recovery and before the logs that you pasted in the previous comment


the initial issue of this bug was resolved (reconstructs run simultaneously) - if we have them running simultaneously,  please paste the log, in the logs that were pasted it doesn't occur.

Comment 13 Dafna Ron 2013-03-17 15:43:26 UTC
I ran this scenario several times that day (you have a lot of reconstruct) and I am looking at the correct time since when I opened the bug I looked at it live and posted the correct entries in the log.
since both reconstructs were running simultaneously this issue is not solved...
once again, I am testing the scenario which yields the same results hence I cannot verify the bug.

Comment 14 Ayal Baron 2013-03-20 10:09:50 UTC
...

Comment 17 Liron Aravot 2013-03-25 07:45:04 UTC
Ayal, there are two reinitialize executions in the log - one at 6:29~ and one at 6:00~.
the pasted logs were of the second one (6:29~), so i'll refer to it first and then to the one at 6:00~

1. 6:29~ recovery fails because of timeout when performing disconnect storage pool (this is the one quoted in the pasted logs):
2013-03-12 06:32:40,895 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (pool-3-thread-46) XML RPC error in command DisconnectStoragePoolVDS ( HostName = 
gold-vdsd ), the error was: java.util.concurrent.TimeoutException, null 


Caused by: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1591) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.reconstructMaster(ReconstructMasterDomainCommand.java:110) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.executeCommand(ReconstructMasterDomainCommand.java:139) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand.executeReconstruct(RecoveryStoragePoolCommand.java:91) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:128) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:104) [engine-bll.jar:]




2. 6:00~ first recovery fails because of failure in disconnectStoragePool as spm was started meanwhile:
Caused by: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Operation not allowed while SPM is active: ('5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7',)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1591) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.reconstructMaster(ReconstructMasterDomainCommand.java:110) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.executeCommand(ReconstructMasterDomainCommand.java:139) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand.executeReconstruct(RecoveryStoragePoolCommand.java:91) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:128) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:104) [engine-bll.jar:]

Comment 18 Ayal Baron 2013-03-25 09:12:17 UTC
Liron, there are 2 things here that I do not understand:
1. Can the timeout issue be reproduced consistently? if so, why are we timing out? (failure of the second reconstruct)

2. Why are we trying to start the spm while reconstruct is running? (failure of the first reconstruct)

Comment 19 Ayal Baron 2013-03-28 08:55:29 UTC
After discussing with Liron, it is not clear to us what the end state of the system is and what the repro steps are to reach this state.
Haim, do we have clear reproduction steps for this?
Does the system recover from this state or not?

Comment 20 Haim 2013-03-29 21:02:51 UTC
(In reply to comment #19)
> After discussing with Liron, it is not clear to us what the end state of the
> system is and what the repro steps are to reach this state.
> Haim, do we have clear reproduction steps for this?
> Does the system recover from this state or not?

Yes we have, see comment #1.
No, it doesn't recover (wrong master domain or version).

Comment 23 Allon Mureinik 2013-05-02 09:13:36 UTC
Change-Id I6418b9d2826146e1e4ceff4341c6d7cd3a0024af (referenced in the external tracker) should solve this issue too.

Comment 24 Dafna Ron 2013-05-13 13:00:50 UTC
verified on sf16

Comment 25 Itamar Heim 2013-06-11 08:32:24 UTC
3.2 has been released

Comment 26 Itamar Heim 2013-06-11 08:32:26 UTC
3.2 has been released

Comment 27 Itamar Heim 2013-06-11 08:33:31 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.