Bug 888259
Summary: | engine: storage pool recovery fails if reinitialize is initiated during SetStoragePoolStatusCommand | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||||
Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Dafna Ron <dron> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.1.0 | CC: | abaron, amureini, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.2.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | sf16 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | Type: | Bug | |||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 917548 | ||||||||
Bug Blocks: | 948448 | ||||||||
Attachments: |
|
Liron, Is this covered by the fix for bug 882837 ? summary: 1.the described scenario isn't that relevant as simultenous reconstruct won't occur now, we will have conncetStoragePool but no reconstructs failurs. 2. we have other issues here that do need to be taken care of : initiation of recovery, possible races between automatic timers and recovery/reconstruct. ---------------------------------------------------------------------------- ----------------------------------------------------------------------------- what happend here is as follows: GetStoragePoolInfo commands runs every few seconds by timer, triggers ConnectStoragePool and can triggers Reconstruct. example: 2012-12-16 17:48:28,975 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] hostFromVds::selectedVds - pink-vds2, spmStatus Unknown_Pool, storage pool TEST-DC 2012-12-16 17:48:29,030 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-4) [3f2d4db] START, ConnectStoragePoolVDSCommand(HostName = pink-vds2, HostId = 8ed2188e-4795-11e2-b963-001122334455, storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, vds_spm_id = 1, masterDomainId = e589c31b-f266-4a87-a833-fad240da8257, (QuartzScheduler_Worker-4) [3f2d4db] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=03bf4661-cd9b-4581-a18f-ed7f7be148dd, msdUUID=e589c31b-f266-4a87-a833-fad240da8257' 2012-12-16 17:48:31,257 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-4) [3f2d4db] FINISH, ConnectStoragePoolVDSCommand, log id: 5217d889 2012-12-16 17:48:31,257 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] IrsBroker::Failed::GetStoragePoolInfoVDS 2012-12-16 17:48:31,257 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-4) [3f2d4db] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=03bf4661-cd9b-4581-a18f-ed7f7be148dd, msdUUID=e589c31b-f266-4a87-a833-fad240da8257' 2012-12-16 17:48:31,461 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (QuartzScheduler_Worker-4) [275de760] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: now the user runs RecoveryStoragePoolCommand: 2012-12-16 17:48:43,989 INFO [org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand] (pool-3-thread-47) [4a41bda1] Running command: RecoveryStoragePoolCommand 2012-12-16 17:48:44,426 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-3-thread-47) [2d5562a8] START, ReconstructMasterVDSComman 2012-12-16 17:49:08,100 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-3-thread-47) [2d5562a8] FINISH, ReconstructMasterVDSCommand, log id: 13d16ea7 after https://bugzilla.redhat.com/show_bug.cgi?id=882837 was merged reconstruct shouldn't run simultaneously - Dafna, so if RecoveryStoragePool runs no other reconstruct should run meanwhile or after. there seems to be another issues here which is not related to the described bug - races between reconstruct/recovery and running timers which should be resolved - example: the recovery has completed succesfully, but we initate another reconstruct and spm stop because of what seems to be stale data (loaded storage pool with version 1 from the db, while the reconstruct has updated it to 2). 2012-12-16 17:49:35,721 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [3fb35ac1] Master domain version is not in sync between DB and VDSM. Domain st marked as master, but the version in DB: 1 and in VDSM: 2 2012-12-16 17:49:35,721 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [3fb35ac1] Master domain version is not in sync between DB and VDSM. Domain st marked as master, but the version in DB: 1 and in VDSM: 2 2012-12-16 17:49:39,596 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: 2863bd4d-3f70-4ae8-b186-bf58088e28bd Type: Storage 2012-12-16 17:49:41,898 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-81) [42958589] Storage Domain 2863bd4d-3f70-4ae8-b186-bf58088e28bd:st was reported by Host pink-vds2 as Active in Pool 03bf4661-cd9b-4581-a18f-ed7f7be148dd, moving to active status 2012-12-16 17:49:43,739 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, ResetIrsVDSCommand( storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, ignoreFailoverLimit = false, compatabilityVersion = null, hostName = pink-vds2.qa.lab.tlv.redhat.com, vdsId = 8ed2188e-4795-11e2-b963-001122334455, ignoreStopFailed = true), log id: 841cc0c 2012-12-16 17:49:43,741 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, SpmStopVDSCommand(HostName = pink-vds2, HostId = 8ed2188e-4795-11e2-b963-001122334455, storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd), log id: 15407735 2012-12-16 17:49:43,762 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] SpmStopVDSCommand::Stopping SPM on vds pink-vds2, pool id 03bf4661-cd9b-4581-a18f-ed7f7be148dd 2012-12-16 17:49:43,993 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH, SpmStopVDSCommand, log id: 15407735 2012-12-16 17:49:46,396 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH, ResetIrsVDSCommand, log id: 841cc0c 2012-12-16 17:49:48,516 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 03bf4661-cd9b-4581-a18f-ed7f7be148dd, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 25576d88 2012-12-16 17:49:48,516 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] clearing cache for problematic entities in pool 03bf4661-cd9b-4581-a18f-ed7f7be148dd 2012-12-16 17:49:48,516 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (QuartzScheduler_Worker-58) [6cb6cfb0] FINISH, (In reply to comment #2) > summary: > 1.the described scenario isn't that relevant as simultenous reconstruct > won't occur now, we will have conncetStoragePool but no reconstructs failurs. Please update title to reflect remaining issues. patch: http://gerrit.ovirt.org/#/c/10764/ related patch with fix to recovery command: http://gerrit.ovirt.org/#/c/10549/ should be ok right now - in very rare cases we might have an extra reconstruct when performing recovery, moving this to modified. tested on sf10. I am still seeing two reconstruct threads: 2013-03-12 06:28:30,696 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: b37a4aef-09dd-4a4e-ae 2013-03-12 06:29:40,892 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFai we get an error from vdsm failing to verify MD: StorageDomainMasterError: Error validating master storage domain: ('MD read error',) engine side: 2013-03-12 06:30:31,165 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-22) [4ba73fa1] Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',) system reconstruct is failing with NPE: 2013-03-12 06:32:40,897 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-50) [d6f3ed3] Exception during process of events for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, error is org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException 2013-03-12 06:32:40,896 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-49) [d6f3ed3] Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 with exception java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException: java.util.concurrent.ExecutionException: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) [rt.jar:1.7.0_09-icedtea] at java.util.concurrent.FutureTask.get(FutureTask.java:111) [rt.jar:1.7.0_09-icedtea] at org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor.submitEventSync(EventQueueMonitor.java:54) [engine-bll.jar:] at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source) [:1.7.0_09-icedtea] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea] at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea] at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ejb3.component.singleton.SingletonComponentInstanceAssociationInterceptor.processInvocation(SingletonComponentInstanceAssociationInterceptor.java:53) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2] at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4] when AutoRecovery runs we are rolling back on the reconstruct and getting wrong master domain or version in the previously Non-Operational host. moving back to devel with new logs Created attachment 709079 [details]
logs
Dafna, there are no two reconstruct threads: 2013-03-12 06:28:30,696 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 Type: Storage 2013-03-12 06:28:30,711 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-49) [abd2112] START, MarkPoolInReconstructModeVD SCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 79c3e8 9a 2013-03-12 06:28:30,714 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-3-thread-49) [abd2112] No string for RECONSTRUCT_MASTER_FAILED typ e. Use default Log this is where the first reconstruct ended - later on different reconstruct begins: 2013-03-12 06:29:40,892 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearJobs), log id: 81f8f7 2013-03-12 06:32:40,896 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, MarkPoolInReconstructModeVDSCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: 7b41bc52 2013-03-12 06:32:40,896 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-3-thread-46) clearing cache for problematic entities in pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 the stacktrace that you pasted is a failure with TimeoutException. There is NPE in the code, it's seems to be caused by a race when Recovery failed (compensation runs) and a host activation loaded domain 0789a306-8ad8-4655-8644-960809561139 as the master while in the meanwhile the compensation returned b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 to be the master - this issue should have been solved in 3.3 (it's a different bug). Dafna, as it's not the initial bug - I think that we need to close this one and open a bug for the present issue , then decide if we want it for 3.2. (In reply to comment #9) > Dafna, there are no two reconstruct threads: > 2013-03-12 06:28:30,696 INFO > [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] > (pool-3-thread-49) [abd2112] Running command: ReconstructMasterDomainCommand > internal: true. Entities affected : ID: > b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 Type: Storage > 2013-03-12 06:28:30,711 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker. > MarkPoolInReconstructModeVDSCommand] (pool-3-thread-49) [abd2112] START, > MarkPoolInReconstructModeVD > SCommand( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, > ignoreFailoverLimit = false, compatabilityVersion = null, > reconstructMarkAction = ClearCache), log id: 79c3e8 > 9a > 2013-03-12 06:28:30,714 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (pool-3-thread-49) [abd2112] No string for RECONSTRUCT_MASTER_FAILED typ > e. Use default Log > > > this is where the first reconstruct ended - later on different reconstruct > begins: > > 2013-03-12 06:29:40,892 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker. > MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, > MarkPoolInReconstructModeVDSCommand( storagePoolId = > 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, > compatabilityVersion = null, reconstructMarkAction = ClearJobs), log id: > 81f8f7 > > > 2013-03-12 06:32:40,896 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker. > MarkPoolInReconstructModeVDSCommand] (pool-3-thread-46) START, > MarkPoolInReconstructModeVDSCommand( storagePoolId = > 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, > compatabilityVersion = null, reconstructMarkAction = ClearCache), log id: > 7b41bc52 > 2013-03-12 06:32:40,896 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (pool-3-thread-46) clearing cache for problematic entities in pool > 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 > > the stacktrace that you pasted is a failure with TimeoutException. > > There is NPE in the code, it's seems to be caused by a race when Recovery > failed (compensation runs) and a host activation loaded domain > 0789a306-8ad8-4655-8644-960809561139 as the master while in the meanwhile > the compensation returned b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 to be the > master - > this issue should have been solved in 3.3 (it's a different bug). > > Dafna, as it's not the initial bug - I think that we need to close this one > and open a bug for the present issue , then decide if we want it for 3.2. Ack. please close this and file a new one. actually: we did try to block the 2ed reconstruct but are failing: 2013-03-12 06:26:03,801 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (QuartzScheduler_Worker-77) Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 with exception java.util.concurrent.Cancellation Exception: java.util.concurrent.CancellationException which will cause: 2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',) and: 2013-03-12 06:32:45,677 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-46) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=324, mMessage=Wrong Master domain or its version: 'SD=0789a306-8ad8-4655-8644-960809561139, pool=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7']] 2013-03-12 06:32:45,677 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-46) HostName = gold-vdsc since we still have a race here which cause the same exact issue I cannot verify this bug. Dafna, I'll comment by each part of your comment - first, the logs in your comment are 3 minutes before you ran recovery and before the logs that you pasted in the previous comment. 1. "we did try to block the 2ed reconstruct but are failing" - the log that you pasted is related to domain monitoring process being rejected, not to reconstruct (full line copied) 2013-03-12 06:26:03,801 INFO [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (pool-3-thread-49) [57862efb] The following operation Event [eventType=DOMAINMONITORIN G, storagePoolId=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, domainId=null, vdsId=83834e1f-9e60-41b5-a9cc-16460a8a2fe2] was cancelled, because of recosntruct was run before 2013-03-12 06:26:03,801 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (QuartzScheduler_Worker-77) Failed at submitEventSync, for pool 5e1e9d7a-ba64-48cd-84b 1-a7e3e67829b7 with exception java.util.concurrent.CancellationException: java.util.concurrent.CancellationException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:250) 2. second log is a failure in accessing the mater domain, which causes to reconstruct and is fine kipped. 2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-100) Command SpmStatusVDS execution failed. Exception: IRSNoMasterDoma inException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',) 2013-03-12 06:26:30,127 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) hostFromVds::selectedVds - gold-vdsd, spmStatus retur ned null! 2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) IrsBroker::Failed::GetStoragePoolInfoVDS 2013-03-12 06:26:30,127 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-100) Exception: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',) 2013-03-12 06:26:30,149 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [2d52ff64] Running command: Reconst 3. this logs are 3 minutes before you ran recovery and before the logs that you pasted in the previous comment the initial issue of this bug was resolved (reconstructs run simultaneously) - if we have them running simultaneously, please paste the log, in the logs that were pasted it doesn't occur. I ran this scenario several times that day (you have a lot of reconstruct) and I am looking at the correct time since when I opened the bug I looked at it live and posted the correct entries in the log. since both reconstructs were running simultaneously this issue is not solved... once again, I am testing the scenario which yields the same results hence I cannot verify the bug. ... Ayal, there are two reinitialize executions in the log - one at 6:29~ and one at 6:00~. the pasted logs were of the second one (6:29~), so i'll refer to it first and then to the one at 6:00~ 1. 6:29~ recovery fails because of timeout when performing disconnect storage pool (this is the one quoted in the pasted logs): 2013-03-12 06:32:40,895 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (pool-3-thread-46) XML RPC error in command DisconnectStoragePoolVDS ( HostName = gold-vdsd ), the error was: java.util.concurrent.TimeoutException, null Caused by: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1591) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.reconstructMaster(ReconstructMasterDomainCommand.java:110) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.executeCommand(ReconstructMasterDomainCommand.java:139) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand.executeReconstruct(RecoveryStoragePoolCommand.java:91) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:128) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:104) [engine-bll.jar:] 2. 6:00~ first recovery fails because of failure in disconnectStoragePool as spm was started meanwhile: Caused by: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Operation not allowed while SPM is active: ('5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7',) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1591) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.reconstructMaster(ReconstructMasterDomainCommand.java:110) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand.executeCommand(ReconstructMasterDomainCommand.java:139) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand.executeReconstruct(RecoveryStoragePoolCommand.java:91) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:128) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.RecoveryStoragePoolCommand$1.call(RecoveryStoragePoolCommand.java:104) [engine-bll.jar:] Liron, there are 2 things here that I do not understand: 1. Can the timeout issue be reproduced consistently? if so, why are we timing out? (failure of the second reconstruct) 2. Why are we trying to start the spm while reconstruct is running? (failure of the first reconstruct) After discussing with Liron, it is not clear to us what the end state of the system is and what the repro steps are to reach this state. Haim, do we have clear reproduction steps for this? Does the system recover from this state or not? (In reply to comment #19) > After discussing with Liron, it is not clear to us what the end state of the > system is and what the repro steps are to reach this state. > Haim, do we have clear reproduction steps for this? > Does the system recover from this state or not? Yes we have, see comment #1. No, it doesn't recover (wrong master domain or version). Change-Id I6418b9d2826146e1e4ceff4341c6d7cd3a0024af (referenced in the external tracker) should solve this issue too. verified on sf16 3.2 has been released 3.2 has been released 3.2 has been released |
Created attachment 665465 [details] logs Description of problem: during reinitialize DC which sends reconstruct master to vdsm, engine sends SetStoragePoolStatusCommand which sends connectStoragePool to vdsm - reconstruct fails since pool is connected. Version-Release number of selected component (if applicable): si25.1 How reproducible: race Steps to Reproduce: 1. with nfs storage, block connectivity to the master domain 2. reinitialize storage during SetStoragePoolStatusCommand 3. Actual results: SetStoragePoolStatusCommand sends connectStoragePool during reconstruct. we fail with wrong master domain error Expected results: we should not send connectStoragePool during reconstruct. Additional info: logs