Created attachment 1273021 [details] engine & vdsm logs Description of problem: After upgrade DC+cluster 4.0->4.1 + Engine restart Reconstruct Master Domain for Data Center fails. Version-Release number of selected component (if applicable): Engine: 4.1.1.8 VDSM: 4.19.10.1 How reproducible: Hard to reproduce , 1:5 or worse. Steps to Reproduce: Main automation Test flow : 1. Create DC + cluster on v3 2. Upgrade the cluster from v3 to v4 3. restart engine during upgrade cluster 4. Verify that cluster is upgraded to v4 Automation TEARDOWN : 5. after checking DC is in "active" state remove DC Actual results: Engine restart Reconstruct Master Domain for Data Center fails. Expected results: Engine restart Reconstruct Master Domain for Data Center should succeed. Additional info: Time line from automation log: 2017-04-19 19:34:11,016 - MainThread - root - INFO - Update cluster cluster_upgrade_4_0_to_4_1 with {'compare': False, 'version': '4.1'} 2017-04-19 19:34:11,305 - MainThread - clusters - INFO - Using Correlation-Id: clusters_update_ada4c825-9dd9-41be 2017-04-19 19:34:11,756 - MainThrea2017-04-19 19:34:11,305 - MainThread - clusters - INFO - Using Correlation-Id: clusters_update_ada4c825-9dd9-41bed - clusters - INFO - cluster was updated 2017-04-19 19:34:29,311 - MainThread - Engine - INFO - [10.35.161.118] GET http://storage-ge-04.scl.lab.tlv.redhat.com/ovirt-engine/services/health 2017-04-19 19:34:29,668 - MainThread - Engine - ERROR - [10.35.161.118] failed to get content of health page: HTTP Error 503: Service Unavailable 2017-04-19 19:34:50,287 - MainThread - Engine - ERROR - [10.35.161.118] failed to get content of health page: HTTP Error 404: Not Found 2017-04-19 19:35:10,313 - MainThread - Engine - DEBUG - Skipping duplicate log-messages... 2017-04-19 19:35:15,887 - MainThread - Engine - INFO - [10.35.161.118] CODE: 200, OK 2017-04-19 19:35:15,888 - MainThread - Engine - INFO - [10.35.161.118] Content: DB Up!Welcome to Health Status! 2017-04-19 19:35:19,397 - MainThread - Engine - INFO - [10.35.161.118] GET http://storage-ge-04.scl.lab.tlv.redhat.com/ovirt-engine/services/health 2017-04-19 19:35:19,523 - MainThread - art.hl_lib.dcs - INFO - Wait for the Data center to become active 2017-04-19 19:35:20,817 - MainThread - datacenters - INFO - Waiting for query `name=dc_upgrade_4_0_to_4_1 and status=up` and event_id None up to 180 seconds,sampling every 10 second. 2017-04-19 19:35:21,077 - MainThread - root - INFO - Waits until SPM gets elected in DataCenter dc_upgrade_4_0_to_4_1 with {'sleep': 5, 'timeout': 300} 2017-04-19 19:35:21,086 - MainThread - root - INFO - Locate and return SPM or HSM host from specific datacenter dc_upgrade_4_0_to_4_1 with {'spm': True} 2017-04-19 19:35:22,159 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_1. 2017-04-19 19:35:22,166 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'} 2017-04-19 19:35:22,703 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_1 is: spm 2017-04-19 19:35:23,473 - MainThread - art.hl_lib.dcs - INFO - Waiting up to 120 seconds for sd upgrade_4_0_to_4_1glusterfs0 to be active 2017-04-19 19:35:24,061 - MainThread - root - INFO - Get Cluster cluster_upgrade_4_0_to_4_1 compatibility version 2017-04-19 19:35:24,311 - MainThread - art.logging - INFO - Status: passed PASSED2017-04-19 19:35:24,313 - MainThread - art.logging - INFO - ================================================================================ 2017-04-19 19:35:24,313 - MainThread - art.logging - INFO - TEARDOWN <TestCaseFunction 'test_restart_engine_during_upgrade_cluster'> 2017-04-19 19:35:24,314 - MainThread - pytest.art.matrix - INFO - The storage type reset to None 2017-04-19 19:35:26,708 - MainThread - rhevmtests.storage.helpers - INFO - Checking if domain upgrade_4_0_to_4_1glusterfs0 is active in dc dc_upgrade_4_0_to_4_1 2017-04-19 19:35:27,376 - MainThread - storagedomains - INFO - Domain upgrade_4_0_to_4_1glusterfs0 in dc dc_upgrade_4_0_to_4_1 is unknown 2017-04-19 19:35:27,376 - MainThread - art.ll_lib.dcs - INFO - Remove datacenter dc_upgrade_4_0_to_4_1 2017-04-19 19:35:27,643 - MainThread - datacenters - INFO - Using Correlation-Id: datacenters_delete_e74bc52a-1294-4256 2017-04-19 19:35:28,386 - MainThread - api_utils - ERROR - Failed to delete element NOT as expected: Status: 409 Reason: Conflict Detail: [Cannot remove an active Data Center.] From Engine: 2017-04-19 19:34:11,580+03 INFO [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-4) [clusters_update_ada4c825-9dd9-41be] Lock Acquired to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='[]'}' 2017-04-19 19:34:11,695+03 INFO [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-4) [clusters_update_ada4c825-9dd9-41be] Running command: UpdateClusterCommand internal: false. Entities affected : ID: 15767467-bcab-418c-a31 4-78ebaf093a75 Type: ClusterAction group EDIT_CLUSTER_CONFIGURATION with role type ADMIN 2017-04-19 19:34:11,725+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-4) [clusters_update_ada4c825-9dd9-41be] EVENT_ID: USER_UPDATE_CLUSTER(811), Correlation ID: clusters_update_ada4c825-9dd 9-41be, Call Stack: null, Custom Event ID: -1, Message: Host cluster cluster_upgrade_4_0_to_4_1 was updated by admin@internal-authz 2017-04-19 19:34:11,726+03 INFO [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-4) [clusters_update_ada4c825-9dd9-41be] Lock freed to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='[]'}' 2017-04-19 19:34:12,265+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler9) [655f8748] FINISH, GetHostJobsVDSCommand, return: {}, log id: 15775af4 2017-04-19 19:34:12,273+03 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler9) [655f8748] Command AmendVolume id: '36d67e16-2469-4785-b910-46c95adf4566': attempting to determine the job status by polling the ent ity. 2017-04-19 19:34:12,276+03 ERROR [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler9) [655f8748] Command AmendVolume id: '36d67e16-2469-4785-b910-46c95adf4566': failed to poll the command entity 2017-04-19 19:34:12,276+03 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler9) [655f8748] Command AmendVolume id: '36d67e16-2469-4785-b910-46c95adf4566': couldn't get the status of job 'b4f15b65-1695-452f-a5fe-8 2f3c11d7c19' on host 'host_mixed_1' (id: '0ba63f2d-06b2-4392-82ce-7abd60614b2a'), assuming it's still running 2017-04-19 19:34:15,914+03 ERROR [org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap] (DefaultQuartzScheduler2) [655f8748] Exception occured during response decomposition: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) [rt.jar:1.8.0_121] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) [rt.jar:1.8.0_121] at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) [rt.jar:1.8.0_121] at org.ovirt.vdsm.jsonrpc.client.internal.BatchCall.get(BatchCall.java:53) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:48) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:19) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.lazyEval(FutureMap.java:112) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.get(FutureMap.java:247) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VMInfoListReturn.<init>(VMInfoListReturn.java:15) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getAllVmStats(JsonRpcVdsServer.java:369) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand.executeVdsBrokerCommand(GetAllVmStatsVDSCommand.java:22) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher.poll(VmsStatisticsFetcher.java:29) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmsListFetcher.fetch(VmsListFetcher.java:56) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:37) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor345.invoke(Unknown Source) [:1.8.0_121] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_121] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_121] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_121] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_121] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_121] 2017-04-19 19:35:40,998+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler1) [4dc8be4] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: IRSNoMasterDomainException : Cannot find master domain: 'spUUID=e23a7d84-49b5-47c0-b5ea-0204648d1dd4, msdUUID=a7ba26e7-1ff4-437f-bd5e-690ef13a107d' 2017-04-19 19:35:41,171+03 INFO [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-22) [5c58f757] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : I D: a7ba26e7-1ff4-437f-bd5e-690ef13a107d Type: Storage 2017-04-19 19:35:41,258+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (org.ovirt.thread.pool-6-thread-22) [5c58f757] START, ResetIrsVDSCommand( ResetIrsVDSCommandParameters:{runAsync='true', storagePoolId='e23a7d 84-49b5-47c0-b5ea-0204648d1dd4', ignoreFailoverLimit='false', vdsId='0ba63f2d-06b2-4392-82ce-7abd60614b2a', ignoreStopFailed='true'}), log id: 7148035d 2017-04-19 19:35:41,268+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-6-thread-22) [5c58f757] START, SpmStopVDSCommand(HostName = host_mixed_1, SpmStopVDSCommandParameters:{runAsync='true', hostId='0ba63f2d-06b2-4392-82ce-7abd60614b2a', storagePoolId='e23a7d84-49b5-47c0-b5ea-0204648d1dd4'}), log id: 2c512330 2017-04-19 19:35:41,299+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-6-thread-22) [5c58f757] SpmStopVDSCommand::Stopping SPM on vds 'host_mixed_1', pool id 'e23a7d84-49b5-47c0-b5ea-0204648d 1dd4' 2017-04-19 19:35:47,260+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [655f8748] Command 'AmendImageGroupVolumes' (id: '59a83751-3637-48fd-844e-c284a95e9e85') waiting on child command id: '36d67e16-2469-4785-b910-46c95adf4566' type:'AmendVolume' to complete 2017-04-19 19:35:47,286+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler7) [7f18269b] Command 'AmendImageGroupVolumes' (id: 'f99e15c6-2099-43f3-a77b-528c782d58cb') waiting on child command id: '33d9d5ec-bf29-450b-8026-e102bd9fa859' type:'AmendVolume' to complete 2017-04-19 19:35:47,313+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler7) [7f18269b] START, GetHostJobsVDSCommand(HostName = host_mixed_1, GetHostJobsVDSCommandParameters:{runAsync='true' , hostId='0ba63f2d-06b2-4392-82ce-7abd60614b2a', type='storage', jobIds='[a9c7a685-a4c7-4943-a66c-d52f71646020]'}), log id: 7941ea3c 2017-04-19 19:35:47,315+03 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Broken pipe 2017-04-19 19:35:47,327+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [655f8748] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_1, VdsIdVDSCommandParametersBase:{runAsync='t rue', hostId='0ba63f2d-06b2-4392-82ce-7abd60614b2a'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe 2017-04-19 19:35:47,339+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [655f8748] Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring. 2017-04-19 19:35:47,349+03 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-23) [655f8748] Host 'host_mixed_1' is not responding. 2017-04-19 19:35:47,369+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [7f18269b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Even t ID: -1, Message: VDSM host_mixed_1 command GetHostJobsVDS failed: Broken pipe 2017-04-19 19:35:47,370+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler7) [7f18269b] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand' return value 'HostJobsReturn :{status='Status [code=5022, message=Broken pipe]'}' 2017-04-19 19:35:47,370+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler7) [7f18269b] HostName = host_mixed_1 2017-04-19 19:35:47,374+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler7) [7f18269b] Command 'GetHostJobsVDSCommand(HostName = host_mixed_1, GetHostJobsVDSCommandParameters:{runAsync='tru e', hostId='0ba63f2d-06b2-4392-82ce-7abd60614b2a', type='storage', jobIds='[a9c7a685-a4c7-4943-a66c-d52f71646020]'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe 2017-04-19 19:35:47,374+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (DefaultQuartzScheduler7) [7f18269b] FINISH, GetHostJobsVDSCommand, log id: 7941ea3c 2017-04-19 19:35:47,374+03 WARN [org.ovirt.engine.core.bll.StorageJobCallback] (DefaultQuartzScheduler7) [7f18269b] Command AmendVolume id: '33d9d5ec-bf29-450b-8026-e102bd9fa859': Failed to poll the job 'a9c7a685-a4c7-4943-a66c-d52f71646 020' on host 'host_mixed_1' (id: '0ba63f2d-06b2-4392-82ce-7abd60614b2a'), will retry soon 2017-04-19 19:35:47,377+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-22) [5c58f757] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, C ustom Event ID: -1, Message: VDSM host_mixed_1 command SpmStopVDS failed: Broken pipe 2017-04-19 19:35:47,377+03 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-6-thread-22) [5c58f757] Could not stop spm of pool 'e23a7d84-49b5-47c0-b5ea-0204648d1dd4' on vds '0ba63f2d-06b2-4392-82c e-7abd60614b2a': VDSGenericException: VDSNetworkException: Broken pipe VDSM: 2017-04-19 19:35:40,840+0300 ERROR (jsonrpc/3) [storage.Dispatcher] {'status': {'message': "Cannot find master domain: 'spUUID=e23a7d84-49b5-47c0-b5ea-0204648d1dd4, msdUUID=a7ba26e7-1ff4-437f-bd5e-690ef13a107d'", 'code': 304}} (dispatcher :77) 2017-04-19 19:35:40,841+0300 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call StoragePool.connect failed (error 304) in 0.36 seconds (__init__:515) 2017-04-19 19:35:41,167+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 dd if=/rhev/data-center/e23a7d84-49b5-47c0-b5ea-0204648d1dd4/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69) 2017-04-19 19:35:41,182+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] FAILED: <err> = "dd: failed to open '/rhev/data-center/e23a7d84-49b5-47c0-b5ea-0204648d1dd4/mastersd/dom_md/inbox': No such file or directory\n"; <rc> = 1 (commands:93 ) 2017-04-19 19:35:41,183+0300 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (storage_mailbox:788) Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 786, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 735, in _checkForMail "Could not read mailbox: %s" % self._inbox) 2017-04-19 19:35:41,369+0300 ERROR (jsonrpc/5) [root] Panic: Unrecoverable errors during SPM stop process. (panic:28) Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 436, in stopSpm self.masterDomain.releaseClusterLock() File "/usr/share/vdsm/storage/sdc.py", line 50, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain return findMethod(sdUUID) File "/usr/share/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/glusterSD.py", line 51, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('a7ba26e7-1ff4-437f-bd5e-690ef13a107d',)
As discussed with Liron, bug opened for deeper investigation. Issue occurred multiple times on different automation tests with same scenario (upgrade DC+cluster + engine restart ) on different storage domains ( iscsi,nfs,gluster)
Note: this is unrelated to the cluster upgrade, the reconstruct happens because of the validation process of the storage domain removal. The actual issue is that reconstruction is triggered in the validation process when it shouldn't
This patch is contained in the ovirt-engine-4.1.3.3 tag.
verified at 4.1.3.4