Description of problem: Data Centers periodically became non responsive on large environment which has 200 DCs + Clusters + Fake Hosts + Storage Domains Initial setup has 1 DC, 1 Cluster, 2 Hosts, 1 SD and 200 VMs As result hosts became non responsive too and data center is not functional for some period. engine.log 2014-08-14 03:41:22,222 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Command HSMGetAllTasksStatusesVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM 2014-08-14 03:41:22,273 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] hostFromVds::selectedVds - TEST_HOST_182, spmStatus Free, storage pool TEST_DC_182 2014-08-14 03:41:22,284 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] starting spm on vds TEST_HOST_182, storage pool TEST_DC_182, prevId -1, LVER 208 2014-08-14 03:41:22,284 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] START, SpmStartVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6, storagePoolId = f18c3df6-7d93-4542-8e36-324ba9e27e92, prevId=-1, prevLVER=208, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 25a60b75 2014-08-14 03:41:22,342 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling started: taskId = 87dad494-8562-4eff-86c5-6b8494106fff 2014-08-14 03:41:23,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Failed in HSMGetTaskStatusVDS method 2014-08-14 03:41:23,351 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling ended: taskId = 87dad494-8562-4eff-86c5-6b8494106fff task status = finished 2014-08-14 03:41:23,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = SPM_MailMonitor create failed - inbox '/rhev/data-center/f18c3df6-7d93-4542-8e36-324ba9e27e92/mastersd/dom_md/inbox' does not exist, code = 100 2014-08-14 03:41:23,400 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling ended, spm status: Free 2014-08-14 03:41:23,400 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] START, HSMClearTaskVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6, taskId=87dad494-8562-4eff-86c5-6b8494106fff), log id: 58a825aa 2014-08-14 03:41:23,404 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) Domain f35f9a3a-9846-455c-97e9-1fe964c38405:TEST_SD_171 recovered from problem. vds: TEST_HOST_171 2014-08-14 03:41:23,404 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) Domain f35f9a3a-9846-455c-97e9-1fe964c38405:TEST_SD_171 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2014-08-14 03:41:23,450 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] FINISH, HSMClearTaskVDSCommand, log id: 58a825aa 2014-08-14 03:41:23,450 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@4c04121c, log id: 25a60b75 2014-08-14 03:41:23,450 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-28) [1f411341] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: f18c3df6-7d93-4542-8e36-324ba9e27e92 Type: StoragePool 2014-08-14 03:41:23,476 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-28) [1f411341] Correlation ID: 1f411341, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center TEST_DC_182. Setting status to Non Responsive. 2014-08-14 03:41:23,484 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [1f411341] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed vdsm.log c9957912-7794-465a-8f9d-3f42e59322fb::INFO::2014-08-14 03:03:40,758::clusterlock::279::SANLock::(release) Releasing cluster lock for domain 494d1109-12e5-4080-a513-9403c0935d7c c9957912-7794-465a-8f9d-3f42e59322fb::DEBUG::2014-08-14 03:03:41,299::clusterlock::289::SANLock::(release) Cluster lock for domain 494d1109-12e5-4080-a513-9403c0935d7c successfully released c9957912-7794-465a-8f9d-3f42e59322fb::ERROR::2014-08-14 03:03:41,300::task::866::TaskManager.Task::(_setError) Task=`c9957912-7794-465a-8f9d-3f42e59322fb`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 334, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 316, in startSpm maxHostID) File "/usr/share/vdsm/storage/storage_mailbox.py", line 544, in __init__ "not exist" % repr(self._inbox)) RuntimeError: SPM_MailMonitor create failed - inbox '/rhev/data-center/f18c3df6-7d93-4542-8e36-324ba9e27e92/mastersd/dom_md/inbox' does not exist c9957912-7794-465a-8f9d-3f42e59322fb::DEBUG::2014-08-14 03:03:41,302::task::885::TaskManager.Task::(_run) Task=`c9957912-7794-465a-8f9d-3f42e59322fb`::Task._run: c9957912-7794-465a-8f9d-3f42e59322 fb () {} failed - stopping task Version-Release number of selected component (if applicable): RHEVM Version: 3.4.1-0.25.el6ev OS Version: RHEL - 6Server - 6.5.0.1.el6 Kernel Version: 2.6.32 - 431.el6.x86_64 KVM Version: 0.12.1.2 - 2.415.el6_5.9 LIBVIRT Version: libvirt-0.10.2-29.el6_5.7 VDSM Version: vdsm-4.14.7-2.el6ev How reproducible: Create large environment with more that 180 DCs/Clusters/Hosts/SDs Steps to Reproduce: 1. 2. 3. Actual results: DC and Hosts became non responsive Expected results: All is up and running Additional info:
Created attachment 926719 [details] Screenshot
Created attachment 926734 [details] vdsm log
Created attachment 926735 [details] engine log
I don't understand. Please provide more info on the setup. How many hosts, how many VMs per host. Where is the failure - VDSM or engine?
Current setup has: 200 Test Data Centers where each contains Clusters, Fake Host and Storage Domain. To create that setup I use another Data Center with 2 Hosts and 200 VMs with qemu. Test Data Centers and Hosts periodically became Non responsive. I see errors in both log files engine.log and vdsm.log
Closing old bugs, as per Itamar's guidlines. If you think this bug is worth fixing, please feel free to reopen.