Bug 1130067 - [scale] Data Centers periodically became non responsive on large environment: 200 DCs
Summary: [scale] Data Centers periodically became non responsive on large environment:...
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.1-1
Hardware: x86_64
OS: Linux
Target Milestone: ovirt-3.6.3
: 3.6.0
Assignee: Nir Soffer
QA Contact: Aharon Canan
Whiteboard: storage
Depends On:
TreeView+ depends on / blocked
Reported: 2014-08-14 09:13 UTC by Yuri Obshansky
Modified: 2016-03-10 06:22 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2015-04-19 15:42:38 UTC
oVirt Team: Storage
Target Upstream Version:

Attachments (Terms of Use)
Screenshot (329.35 KB, image/jpeg)
2014-08-14 09:14 UTC, Yuri Obshansky
no flags Details
vdsm log (662.43 KB, application/x-xz)
2014-08-14 10:23 UTC, Yuri Obshansky
no flags Details
engine log (10.64 MB, application/x-gzip)
2014-08-14 10:28 UTC, Yuri Obshansky
no flags Details

Description Yuri Obshansky 2014-08-14 09:13:43 UTC
Description of problem:
Data Centers periodically became non responsive on large environment which has 200 DCs + Clusters + Fake Hosts + Storage Domains 
Initial setup has 1 DC, 1 Cluster, 2 Hosts, 1 SD and 200 VMs
As result hosts became non responsive too
and data center is not functional for some period.

2014-08-14 03:41:22,222 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Command HSMGetAllTasksStatusesVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM
2014-08-14 03:41:22,273 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] hostFromVds::selectedVds - TEST_HOST_182, spmStatus Free, storage pool TEST_DC_182
2014-08-14 03:41:22,284 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] starting spm on vds TEST_HOST_182, storage pool TEST_DC_182, prevId -1, LVER 208
2014-08-14 03:41:22,284 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] START, SpmStartVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6, storagePoolId = f18c3df6-7d93-4542-8e36-324ba9e27e92, prevId=-1, prevLVER=208, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 25a60b75
2014-08-14 03:41:22,342 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling started: taskId = 87dad494-8562-4eff-86c5-6b8494106fff
2014-08-14 03:41:23,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Failed in HSMGetTaskStatusVDS method
2014-08-14 03:41:23,351 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling ended: taskId = 87dad494-8562-4eff-86c5-6b8494106fff task status = finished
2014-08-14 03:41:23,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = SPM_MailMonitor create failed - inbox '/rhev/data-center/f18c3df6-7d93-4542-8e36-324ba9e27e92/mastersd/dom_md/inbox' does not exist, code = 100
2014-08-14 03:41:23,400 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling ended, spm status: Free
2014-08-14 03:41:23,400 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] START, HSMClearTaskVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6, taskId=87dad494-8562-4eff-86c5-6b8494106fff), log id: 58a825aa
2014-08-14 03:41:23,404 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) Domain f35f9a3a-9846-455c-97e9-1fe964c38405:TEST_SD_171 recovered from problem. vds: TEST_HOST_171
2014-08-14 03:41:23,404 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) Domain f35f9a3a-9846-455c-97e9-1fe964c38405:TEST_SD_171 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2014-08-14 03:41:23,450 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] FINISH, HSMClearTaskVDSCommand, log id: 58a825aa
2014-08-14 03:41:23,450 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@4c04121c, log id: 25a60b75
2014-08-14 03:41:23,450 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-28) [1f411341] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: f18c3df6-7d93-4542-8e36-324ba9e27e92 Type: StoragePool
2014-08-14 03:41:23,476 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-28) [1f411341] Correlation ID: 1f411341, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center TEST_DC_182. Setting status to Non Responsive.
2014-08-14 03:41:23,484 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [1f411341] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed

c9957912-7794-465a-8f9d-3f42e59322fb::INFO::2014-08-14 03:03:40,758::clusterlock::279::SANLock::(release) Releasing cluster lock for domain 494d1109-12e5-4080-a513-9403c0935d7c
c9957912-7794-465a-8f9d-3f42e59322fb::DEBUG::2014-08-14 03:03:41,299::clusterlock::289::SANLock::(release) Cluster lock for domain 494d1109-12e5-4080-a513-9403c0935d7c successfully released
c9957912-7794-465a-8f9d-3f42e59322fb::ERROR::2014-08-14 03:03:41,300::task::866::TaskManager.Task::(_setError) Task=`c9957912-7794-465a-8f9d-3f42e59322fb`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 334, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 316, in startSpm
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 544, in __init__
    "not exist" % repr(self._inbox))
RuntimeError: SPM_MailMonitor create failed - inbox '/rhev/data-center/f18c3df6-7d93-4542-8e36-324ba9e27e92/mastersd/dom_md/inbox' does not exist
c9957912-7794-465a-8f9d-3f42e59322fb::DEBUG::2014-08-14 03:03:41,302::task::885::TaskManager.Task::(_run) Task=`c9957912-7794-465a-8f9d-3f42e59322fb`::Task._run: c9957912-7794-465a-8f9d-3f42e59322
fb () {} failed - stopping task

Version-Release number of selected component (if applicable):
RHEVM Version: 3.4.1-0.25.el6ev
OS Version: RHEL - 6Server -
Kernel Version: 2.6.32 - 431.el6.x86_64
KVM Version: - 2.415.el6_5.9
LIBVIRT Version: libvirt-0.10.2-29.el6_5.7
VDSM Version: vdsm-4.14.7-2.el6ev

How reproducible:
Create large environment with more that 180 DCs/Clusters/Hosts/SDs

Steps to Reproduce:

Actual results:
DC and Hosts became non responsive

Expected results:
All is up and running

Additional info:

Comment 1 Yuri Obshansky 2014-08-14 09:14:58 UTC
Created attachment 926719 [details]

Comment 2 Yuri Obshansky 2014-08-14 10:23:23 UTC
Created attachment 926734 [details]
vdsm log

Comment 3 Yuri Obshansky 2014-08-14 10:28:38 UTC
Created attachment 926735 [details]
engine log

Comment 4 Liran Zelkha 2014-08-14 12:47:44 UTC
I don't understand. Please provide more info on the setup. How many hosts, how many VMs per host. Where is the failure - VDSM or engine?

Comment 5 Yuri Obshansky 2014-08-14 12:52:51 UTC
Current setup has:
200 Test Data Centers where each contains Clusters, Fake Host and Storage Domain.
To create that setup I use another Data Center with 2 Hosts and 200 VMs with qemu.
Test Data Centers and Hosts periodically became Non responsive. 
I see errors in both log files engine.log and vdsm.log

Comment 8 Allon Mureinik 2015-04-19 15:42:38 UTC
Closing old bugs, as per Itamar's guidlines.
If you think this bug is worth fixing, please feel free to reopen.

Note You need to log in before you can comment on or make changes to this bug.