Bug 1130067

Summary: [scale] Data Centers periodically became non responsive on large environment: 200 DCs
Product: Red Hat Enterprise Virtualization Manager Reporter: Yuri Obshansky <yobshans>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: medium    
Version: 3.4.1-1CC: amureini, bazulay, cshao, ecohen, gklein, gouyang, hadong, huiwa, iheim, leiwang, lpeer, lzelkha, scohen, yaniwang, ycui, yeylon, yobshans
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-19 15:42:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot
none
vdsm log
none
engine log none

Description Yuri Obshansky 2014-08-14 09:13:43 UTC
Description of problem:
Data Centers periodically became non responsive on large environment which has 200 DCs + Clusters + Fake Hosts + Storage Domains 
Initial setup has 1 DC, 1 Cluster, 2 Hosts, 1 SD and 200 VMs
As result hosts became non responsive too
and data center is not functional for some period.

engine.log 
2014-08-14 03:41:22,222 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Command HSMGetAllTasksStatusesVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM
2014-08-14 03:41:22,273 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] hostFromVds::selectedVds - TEST_HOST_182, spmStatus Free, storage pool TEST_DC_182
2014-08-14 03:41:22,284 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] starting spm on vds TEST_HOST_182, storage pool TEST_DC_182, prevId -1, LVER 208
2014-08-14 03:41:22,284 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] START, SpmStartVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6, storagePoolId = f18c3df6-7d93-4542-8e36-324ba9e27e92, prevId=-1, prevLVER=208, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 25a60b75
2014-08-14 03:41:22,342 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling started: taskId = 87dad494-8562-4eff-86c5-6b8494106fff
2014-08-14 03:41:23,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Failed in HSMGetTaskStatusVDS method
2014-08-14 03:41:23,351 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling ended: taskId = 87dad494-8562-4eff-86c5-6b8494106fff task status = finished
2014-08-14 03:41:23,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = SPM_MailMonitor create failed - inbox '/rhev/data-center/f18c3df6-7d93-4542-8e36-324ba9e27e92/mastersd/dom_md/inbox' does not exist, code = 100
2014-08-14 03:41:23,400 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] spmStart polling ended, spm status: Free
2014-08-14 03:41:23,400 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] START, HSMClearTaskVDSCommand(HostName = TEST_HOST_182, HostId = fdbbfc91-da72-43f9-8e6c-719be61503c6, taskId=87dad494-8562-4eff-86c5-6b8494106fff), log id: 58a825aa
2014-08-14 03:41:23,404 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) Domain f35f9a3a-9846-455c-97e9-1fe964c38405:TEST_SD_171 recovered from problem. vds: TEST_HOST_171
2014-08-14 03:41:23,404 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) Domain f35f9a3a-9846-455c-97e9-1fe964c38405:TEST_SD_171 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2014-08-14 03:41:23,450 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] FINISH, HSMClearTaskVDSCommand, log id: 58a825aa
2014-08-14 03:41:23,450 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [5881dde1] FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@4c04121c, log id: 25a60b75
2014-08-14 03:41:23,450 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-28) [1f411341] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: f18c3df6-7d93-4542-8e36-324ba9e27e92 Type: StoragePool
2014-08-14 03:41:23,476 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-28) [1f411341] Correlation ID: 1f411341, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center TEST_DC_182. Setting status to Non Responsive.
2014-08-14 03:41:23,484 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [1f411341] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed

vdsm.log
c9957912-7794-465a-8f9d-3f42e59322fb::INFO::2014-08-14 03:03:40,758::clusterlock::279::SANLock::(release) Releasing cluster lock for domain 494d1109-12e5-4080-a513-9403c0935d7c
c9957912-7794-465a-8f9d-3f42e59322fb::DEBUG::2014-08-14 03:03:41,299::clusterlock::289::SANLock::(release) Cluster lock for domain 494d1109-12e5-4080-a513-9403c0935d7c successfully released
c9957912-7794-465a-8f9d-3f42e59322fb::ERROR::2014-08-14 03:03:41,300::task::866::TaskManager.Task::(_setError) Task=`c9957912-7794-465a-8f9d-3f42e59322fb`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 334, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 316, in startSpm
    maxHostID)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 544, in __init__
    "not exist" % repr(self._inbox))
RuntimeError: SPM_MailMonitor create failed - inbox '/rhev/data-center/f18c3df6-7d93-4542-8e36-324ba9e27e92/mastersd/dom_md/inbox' does not exist
c9957912-7794-465a-8f9d-3f42e59322fb::DEBUG::2014-08-14 03:03:41,302::task::885::TaskManager.Task::(_run) Task=`c9957912-7794-465a-8f9d-3f42e59322fb`::Task._run: c9957912-7794-465a-8f9d-3f42e59322
fb () {} failed - stopping task

Version-Release number of selected component (if applicable):
RHEVM Version: 3.4.1-0.25.el6ev
OS Version: RHEL - 6Server - 6.5.0.1.el6
Kernel Version: 2.6.32 - 431.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.9
LIBVIRT Version: libvirt-0.10.2-29.el6_5.7
VDSM Version: vdsm-4.14.7-2.el6ev

How reproducible:
Create large environment with more that 180 DCs/Clusters/Hosts/SDs

Steps to Reproduce:
1.
2.
3.

Actual results:
DC and Hosts became non responsive

Expected results:
All is up and running

Additional info:

Comment 1 Yuri Obshansky 2014-08-14 09:14:58 UTC
Created attachment 926719 [details]
Screenshot

Comment 2 Yuri Obshansky 2014-08-14 10:23:23 UTC
Created attachment 926734 [details]
vdsm log

Comment 3 Yuri Obshansky 2014-08-14 10:28:38 UTC
Created attachment 926735 [details]
engine log

Comment 4 Liran Zelkha 2014-08-14 12:47:44 UTC
I don't understand. Please provide more info on the setup. How many hosts, how many VMs per host. Where is the failure - VDSM or engine?

Comment 5 Yuri Obshansky 2014-08-14 12:52:51 UTC
Current setup has:
200 Test Data Centers where each contains Clusters, Fake Host and Storage Domain.
To create that setup I use another Data Center with 2 Hosts and 200 VMs with qemu.
Test Data Centers and Hosts periodically became Non responsive. 
I see errors in both log files engine.log and vdsm.log

Comment 8 Allon Mureinik 2015-04-19 15:42:38 UTC
Closing old bugs, as per Itamar's guidlines.
If you think this bug is worth fixing, please feel free to reopen.