Bug 888259

Summary:

engine: storage pool recovery fails if reinitialize is initiated during SetStoragePoolStatusCommand

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Dafna Ron <dron>

Component:

ovirt-engine

Assignee:

Liron Aravot <laravot>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Dafna Ron <dron>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.1.0

CC:

abaron, amureini, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul

Target Milestone:

---

Target Release:

3.2.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

sf16

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

917548

Bug Blocks:

948448

Attachments:

Description	Flags
logs	none
logs	none

Description Dafna Ron 2012-12-18 12:12:50 UTC

Created attachment 665465 [details]
logs

Description of problem:

during reinitialize DC which sends reconstruct master to vdsm, engine sends SetStoragePoolStatusCommand which sends connectStoragePool to vdsm - reconstruct fails since pool is connected. 

Version-Release number of selected component (if applicable):

si25.1

How reproducible:

race

Steps to Reproduce:
1. with nfs storage, block connectivity to the master domain
2. reinitialize storage during SetStoragePoolStatusCommand
3.
  
Actual results:

SetStoragePoolStatusCommand sends connectStoragePool during reconstruct. 
we fail with wrong master domain error

Expected results:

we should not send connectStoragePool during reconstruct. 

Additional info: logs

Comment 1 Ayal Baron 2012-12-26 06:59:50 UTC

Liron,
Is this covered by the fix for bug 882837 ?

Comment 2 Liron Aravot 2013-01-01 14:59:13 UTC

summary: 
1.the described scenario isn't that relevant as simultenous reconstruct won't occur now, we will have conncetStoragePool but no reconstructs failurs.

2. we have other issues here that do need to be taken care of : initiation of recovery, possible races between automatic timers and recovery/reconstruct.

----------------------------------------------------------------------------
-----------------------------------------------------------------------------


what happend here is as follows:

GetStoragePoolInfo commands runs every few seconds by timer, triggers ConnectStoragePool and can triggers Reconstruct.
example:

2012-12-16 17:48:28,975 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] hostFromVds::selectedVds - pink-vds2, spmStatus Unknown_Pool, storage pool TEST-DC
2012-12-16 17:48:29,030 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-4) [3f2d4db] START, ConnectStoragePoolVDSCommand(HostName = pink-vds2, HostId = 8ed2188e-4795-11e2-b963-001122334455, storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, vds_spm_id = 1, masterDomainId = e589c31b-f266-4a87-a833-fad240da8257, 

(QuartzScheduler_Worker-4) [3f2d4db] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=03bf4661-cd9b-4581-a18f-ed7f7be148dd, msdUUID=e589c31b-f266-4a87-a833-fad240da8257'
2012-12-16 17:48:31,257 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-4) [3f2d4db] FINISH, ConnectStoragePoolVDSCommand, log id: 5217d889
2012-12-16 17:48:31,257 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] IrsBroker::Failed::GetStoragePoolInfoVDS
2012-12-16 17:48:31,257 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=03bf4661-cd9b-4581-a18f-ed7f7be148dd, msdUUID=e589c31b-f266-4a87-a833-fad240da8257'
2012-12-16 17:48:31,461 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (QuartzScheduler_Worker-4) [275de760] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: 





now the user runs RecoveryStoragePoolCommand:

2012-12-16 17:48:43,989 INFO  [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-3-thread-47) [4a41bda1] Running command: RecoveryStoragePoolCommand

2012-12-16 17:48:44,426 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-3-thread-47) [2d5562a8] START, ReconstructMasterVDSComman
2012-12-16 17:49:08,100 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-3-thread-47) [2d5562a8] FINISH, ReconstructMasterVDSCommand, log id: 13d16ea7


after https://bugzilla.redhat.com/show_bug.cgi?id=882837 was merged reconstruct shouldn't run simultaneously - Dafna, so if RecoveryStoragePool runs no other reconstruct should run meanwhile or after.



there seems to be another issues here which is not related to the described bug
- races between reconstruct/recovery and running timers which should be resolved -
example: the recovery has completed succesfully, but we initate another reconstruct and spm stop because of what seems to be stale data (loaded storage pool with version 1 from the db, while the reconstruct has updated it to 2).

2012-12-16 17:49:35,721 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [3fb35ac1] Master domain version is not in sync between DB and VDSM. Domain st marked as master, but the version in DB: 1 and in VDSM: 2

2012-12-16 17:49:35,721 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [3fb35ac1] Master domain version is not in sync between DB and VDSM. Domain st marked as master, but the version in DB: 1 and in VDSM: 2
2012-12-16 17:49:39,596 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: 2863bd4d-3f70-4ae8-b186-bf58088e28bd Type: Storage
2012-12-16 17:49:41,898 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-81) [42958589] Storage Domain 2863bd4d-3f70-4ae8-b186-bf58088e28bd:st was reported by Host pink-vds2 as Active in Pool 03bf4661-cd9b-4581-a18f-ed7f7be148dd, moving to active status
2012-12-16 17:49:43,739 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, ResetIrsVDSCommand( storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, ignoreFailoverLimit = false, compatabilityVersion = null, hostName = pink-vds2.qa.lab.tlv.redhat.com, vdsId = 8ed2188e-4795-11e2-b963-001122334455, ignoreStopFailed = true), log id: 841cc0c
2012-12-16 17:49:43,741 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, SpmStopVDSCommand(HostName = pink-vds2, HostId = 8ed2188e-4795-11e2-b963-001122334455, storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd), log id: 15407735
2012-12-16 17:49:43,762 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] SpmStopVDSCommand::Stopping SPM on vds pink-vds2, pool id 03bf4661-cd9b-4581-a18f-ed7f7be148dd
2012-12-16 17:49:43,993 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH, SpmStopVDSCommand, log id: 15407735
2012-12-16 17:49:46,396 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH, ResetIrsVDSCommand, log id: 841cc0c
2012-12-16 17:49:48,516 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 25576d88
2012-12-16 17:49:48,516 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] clearing cache for problematic entities in pool 03bf4661-cd9b-4581-a18f-ed7f7be148dd
2012-12-16 17:49:48,516 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH,

Comment 3 Ayal Baron 2013-01-01 15:49:39 UTC

(In reply to comment #2)
> summary: 
> 1.the described scenario isn't that relevant as simultenous reconstruct
> won't occur now, we will have conncetStoragePool but no reconstructs failurs.

Please update title to reflect remaining issues.

Comment 4 Liron Aravot 2013-01-08 16:45:28 UTC

patch:
http://gerrit.ovirt.org/#/c/10764/

related patch with fix to recovery command:
http://gerrit.ovirt.org/#/c/10549/

Comment 6 Liron Aravot 2013-02-27 12:18:48 UTC

should be ok right now - in very rare cases we might have an extra reconstruct when performing recovery, moving this to modified.

Comment 7 Dafna Ron 2013-03-12 17:00:06 UTC

tested on sf10. 
I am still seeing two reconstruct threads: 

2013-03-12 06:28:30,696 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: b37a4aef-09dd-4a4e-ae


2013-03-12 06:29:40,892 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFai


we get an error from vdsm failing to verify MD: 

StorageDomainMasterError: Error validating master storage domain: ('MD read error',)


engine side: 

2013-03-12 06:30:31,165 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-22) [4ba73fa1] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)

system reconstruct is failing with NPE: 

2013-03-12 06:32:40,897 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-50) [d6f3ed3] Exception during process of events for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, error is org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
2013-03-12 06:32:40,896 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-49) [d6f3ed3] Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 with exception java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException: java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) [rt.jar:1.7.0_09-icedtea]
        at java.util.concurrent.FutureTask.get(FutureTask.java:111) [rt.jar:1.7.0_09-icedtea]
        at org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor.submitEventSync(EventQueueMonitor.java:54) [engine-bll.jar:]
        at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ejb3.component.singleton.SingletonComponentInstanceAssociationInterceptor.processInvocation(SingletonComponentInstanceAssociationInterceptor.java:53) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]


when AutoRecovery runs we are rolling back on the reconstruct and getting wrong master domain or version in the previously Non-Operational host.

moving back to devel with new logs

Comment 8 Dafna Ron 2013-03-12 17:01:27 UTC

Created attachment 709079 [details]
logs

Comment 9 Liron Aravot 2013-03-17 11:11:14 UTC

Dafna, there are no two reconstruct threads:
2013-03-12 06:28:30,696 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand
 internal: true. Entities affected :  ID: b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 Type: Storage
2013-03-12 06:28:30,711 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-49) [abd2112] START, MarkPoolInReconstructModeVD
SCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 79c3e8
9a
2013-03-12 06:28:30,714 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-3-thread-49) [abd2112] No string for RECONSTRUCT_MASTER_FAILED typ
e. Use default Log


this is where the first reconstruct ended - later on different reconstruct begins: 

2013-03-12 06:29:40,892 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearJobs), log id: 81f8f7


2013-03-12 06:32:40,896 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 7b41bc52
2013-03-12 06:32:40,896 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-3-thread-46) clearing cache for problematic entities in pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7

the stacktrace that you pasted is a failure with TimeoutException.

There is NPE in the code, it's seems to be caused by a race when Recovery failed (compensation runs) and a host activation loaded domain 
0789a306-8ad8-4655-8644-960809561139 as the master while in the meanwhile the compensation returned b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 to be the master -
this issue should have been solved in 3.3 (it's a different bug).

Dafna, as it's not the initial bug - I think that we need to close this one and open a bug for the present issue , then decide if we want it for 3.2.

Comment 10 Ayal Baron 2013-03-17 11:38:33 UTC

(In reply to comment #9)
> Dafna, there are no two reconstruct threads:
> 2013-03-12 06:28:30,696 INFO 
> [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand]
> (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand
>  internal: true. Entities affected :  ID:
> b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 Type: Storage
> 2013-03-12 06:28:30,711 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.
> MarkPoolInReconstructModeVDSCommand] (pool-3-thread-49) [abd2112] START,
> MarkPoolInReconstructModeVD
> SCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7,
> ignoreFailoverLimit = false, compatabilityVersion = null,
> reconstructMarkAction = ClearCache), log id: 79c3e8
> 9a
> 2013-03-12 06:28:30,714 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (pool-3-thread-49) [abd2112] No string for RECONSTRUCT_MASTER_FAILED typ
> e. Use default Log
> 
> 
> this is where the first reconstruct ended - later on different reconstruct
> begins: 
> 
> 2013-03-12 06:29:40,892 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.
> MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START,
> MarkPoolInReconstructModeVDSCommand( storagePoolId =
> 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false,
> compatabilityVersion = null, reconstructMarkAction = ClearJobs), log id:
> 81f8f7
> 
> 
> 2013-03-12 06:32:40,896 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.
> MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START,
> MarkPoolInReconstructModeVDSCommand( storagePoolId =
> 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false,
> compatabilityVersion = null, reconstructMarkAction = ClearCache), log id:
> 7b41bc52
> 2013-03-12 06:32:40,896 INFO 
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (pool-3-thread-46) clearing cache for problematic entities in pool
> 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7
> 
> the stacktrace that you pasted is a failure with TimeoutException.
> 
> There is NPE in the code, it's seems to be caused by a race when Recovery
> failed (compensation runs) and a host activation loaded domain 
> 0789a306-8ad8-4655-8644-960809561139 as the master while in the meanwhile
> the compensation returned b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 to be the
> master -
> this issue should have been solved in 3.3 (it's a different bug).
> 
> Dafna, as it's not the initial bug - I think that we need to close this one
> and open a bug for the present issue , then decide if we want it for 3.2.

Ack. please close this and file a new one.

Comment 11 Dafna Ron 2013-03-17 14:31:11 UTC

actually:

we did try to block the 2ed reconstruct but are failing:

2013-03-12 06:26:03,801 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (QuartzScheduler_Worker-77) Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 with exception java.util.concurrent.Cancellation
Exception: java.util.concurrent.CancellationException


which will cause:

2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)


and:

2013-03-12 06:32:45,677 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-46) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=324, mMessage=Wrong Master domain or its version: 'SD=0789a306-8ad8-4655-8644-960809561139, pool=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7']]
2013-03-12 06:32:45,677 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-46) HostName = gold-vdsc


since we still have a race here which cause the same exact issue I cannot verify this bug.

Comment 12 Liron Aravot 2013-03-17 14:53:43 UTC

Dafna, I'll comment by each part of your comment - first, the logs in your comment are 3 minutes before you ran recovery and before the logs that you pasted in the previous comment.

1. "we did try to block the 2ed reconstruct but are failing" - the log that you pasted is related to domain monitoring process being rejected, not to reconstruct (full line copied)


2013-03-12 06:26:03,801 INFO  [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-49) [57862efb] The following operation Event [eventType=DOMAINMONITORIN
G, storagePoolId=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, domainId=null, vdsId=83834e1f-9e60-41b5-a9cc-16460a8a2fe2] was cancelled, because of recosntruct was run before
2013-03-12 06:26:03,801 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (QuartzScheduler_Worker-77) Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b
1-a7e3e67829b7 with exception java.util.concurrent.CancellationException: java.util.concurrent.CancellationException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:250) 

2. second log is a failure in accessing the mater domain, which causes to reconstruct and is fine
kipped.
2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-100) Command SpmStatusVDS execution failed. Exception: IRSNoMasterDoma
inException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-03-12 06:26:30,127 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) hostFromVds::selectedVds - gold-vdsd, spmStatus retur
ned null!
2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) IrsBroker::Failed::GetStoragePoolInfoVDS
2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2013-03-12 06:26:30,149 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [2d52ff64] Running command: Reconst

3. this logs are 3 minutes before you ran recovery and before the logs that you pasted in the previous comment


the initial issue of this bug was resolved (reconstructs run simultaneously) - if we have them running simultaneously,  please paste the log, in the logs that were pasted it doesn't occur.

Comment 13 Dafna Ron 2013-03-17 15:43:26 UTC

I ran this scenario several times that day (you have a lot of reconstruct) and I am looking at the correct time since when I opened the bug I looked at it live and posted the correct entries in the log.
since both reconstructs were running simultaneously this issue is not solved...
once again, I am testing the scenario which yields the same results hence I cannot verify the bug.

Comment 14 Ayal Baron 2013-03-20 10:09:50 UTC

...

Comment 17 Liron Aravot 2013-03-25 07:45:04 UTC

Ayal, there are two reinitialize executions in the log - one at 6:29~ and one at 6:00~.
the pasted logs were of the second one (6:29~), so i'll refer to it first and then to the one at 6:00~

1. 6:29~ recovery fails because of timeout when performing disconnect storage pool (this is the one quoted in the pasted logs):
2013-03-12 06:32:40,895 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (pool-3-thread-46) XML RPC error in command DisconnectStoragePoolVDS ( HostName = 
gold-vdsd ), the error was: java.util.concurrent.TimeoutException, null 


Caused by: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1591) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.reconstructMaster(ReconstructMasterDomainCommand.java:110) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.executeCommand(ReconstructMasterDomainCommand.java:139) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand.executeReconstruct(RecoveryStoragePoolCommand.java:91) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:128) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:104) [engine-bll.jar:]




2. 6:00~ first recovery fails because of failure in disconnectStoragePool as spm was started meanwhile:
Caused by: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Operation not allowed while SPM is active: ('5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7',)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1591) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.reconstructMaster(ReconstructMasterDomainCommand.java:110) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.executeCommand(ReconstructMasterDomainCommand.java:139) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand.executeReconstruct(RecoveryStoragePoolCommand.java:91) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:128) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:104) [engine-bll.jar:]

Comment 18 Ayal Baron 2013-03-25 09:12:17 UTC

Liron, there are 2 things here that I do not understand:
1. Can the timeout issue be reproduced consistently? if so, why are we timing out? (failure of the second reconstruct)

2. Why are we trying to start the spm while reconstruct is running? (failure of the first reconstruct)

Comment 19 Ayal Baron 2013-03-28 08:55:29 UTC

After discussing with Liron, it is not clear to us what the end state of the system is and what the repro steps are to reach this state.
Haim, do we have clear reproduction steps for this?
Does the system recover from this state or not?

Comment 20 Haim 2013-03-29 21:02:51 UTC

(In reply to comment #19)
> After discussing with Liron, it is not clear to us what the end state of the
> system is and what the repro steps are to reach this state.
> Haim, do we have clear reproduction steps for this?
> Does the system recover from this state or not?

Yes we have, see comment #1.
No, it doesn't recover (wrong master domain or version).

Comment 23 Allon Mureinik 2013-05-02 09:13:36 UTC

Change-Id I6418b9d2826146e1e4ceff4341c6d7cd3a0024af (referenced in the external tracker) should solve this issue too.

Comment 24 Dafna Ron 2013-05-13 13:00:50 UTC

verified on sf16

Comment 25 Itamar Heim 2013-06-11 08:32:24 UTC

3.2 has been released

Comment 26 Itamar Heim 2013-06-11 08:32:26 UTC

3.2 has been released

Comment 27 Itamar Heim 2013-06-11 08:33:31 UTC

3.2 has been released