1168540 – [scale] RHEV-M with remote database is not functional, every hour DC crashes.

Bug 1168540 - [scale] RHEV-M with remote database is not functional, every hour DC crashes.

Summary: [scale] RHEV-M with remote database is not functional, every hour DC crashes.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	rhev35rcblocker rhev35gablocker
TreeView+	depends on / blocked

Reported:	2014-11-27 09:21 UTC by Yuri Obshansky
Modified:	2016-02-10 19:17 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-12-01 16:42:24 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine.log (809.81 KB, application/x-gzip) 2014-11-27 09:21 UTC, Yuri Obshansky	no flags	Details
vdsm.log (4.02 MB, application/x-gzip) 2014-11-27 09:22 UTC, Yuri Obshansky	no flags	Details
screenshot errors (251.68 KB, image/jpeg) 2014-11-27 09:23 UTC, Yuri Obshansky	no flags	Details
View All

Description Yuri Obshansky 2014-11-27 09:21:14 UTC

Description of problem:
Problem detected during local/remote database benchmark test.
Load test on RHEV-M (200 VMs) with local database passed. 
RHEV-M with remote database is not functional already with 100 VMs.
Every hour DC crashes and started reconstruct
Important: Host created with XMLRPC protocol
Errors:
engine.log
2014-11-27 10:23:38,683 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-28) [65813a3b] Failed in HSMGetTaskStatusVDS method
....
2014-11-27 10:23:38,683 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [65813a3b] Start SPM Task failed - result: cleanSuccess, messa
ge: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Storage domain does not exist, code = 358
......
2014-11-27 10:23:45,379 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-28) [3859571f] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 58e1de38-4a77-476c-b917-a097096b9f38 Type: StoragePool
2014-11-27 10:23:45,390 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-28) [3859571f] Correlation ID: 3859571f, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center DC-REAL. Setting status to Non Responsive.
2014-11-27 10:23:45,393 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [3859571f] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed
2014-11-27 10:23:56,001 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-40) [7116263e] Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value 
TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM]]

vdsm.log:
845d40b4-9eaf-44b8-b09c-971534b080a3::ERROR::2014-11-27 10:10:42,032::task::645::Storage.TaskManager.Task::(_loadMetaFile) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 624, in _loadMetaFile
    for line in getProcPool().readLines(filename):
  File "/usr/share/vdsm/storage/outOfProcess.py", line 329, in readLines
    return ioproc.readlines(path)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 454, in readlines
    return self.readfile(path, direct).splitlines()
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 442, in readfile
    "direct": direct}, self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 351, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
Thread-14::DEBUG::2014-11-27 10:10:42,033::__init__::232::IOProcess::(_processLogs) (null)|Receiving request...
845d40b4-9eaf-44b8-b09c-971534b080a3::ERROR::2014-11-27 10:10:42,061::taskManager::222::Storage.TaskManager::(loadDumpedTasks) taskManager: Skipping directory: 6fb087a8-c5a7-4eed-bf38-f7395624c3ec
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/taskManager.py", line 214, in loadDumpedTasks
    t = Task.loadTask(store, taskID)
  File "/usr/share/vdsm/storage/task.py", line 1152, in loadTask
    t._load(store, ext)
  File "/usr/share/vdsm/storage/task.py", line 739, in _load
    self._loadJobMetaFile(taskDir, jn)
  File "/usr/share/vdsm/storage/task.py", line 689, in _loadJobMetaFile
    self._loadMetaFile(taskFile, self.jobs[n], Job.fields)
  File "/usr/share/vdsm/storage/task.py", line 646, in _loadMetaFile
    raise se.TaskMetaDataLoadError(filename)
TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/58e1de38-4a77-476c-b917-a097096b9f38/mastersd/master/tasks/6fb087a8-c5a7-4eed-bf38-f7395624c3ec/6fb087a8-c5a7-4eed-bf38-f7395624c3ec.job.0',)

Thread-346::ERROR::2014-11-27 10:19:10,341::task::866::Storage.TaskManager.Task::(_setError) Task=`c9bf74d9-4ebd-4d3d-afee-f478ed9534d9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3099, in getVolumeSize
    apparentsize = str(dom.getVSize(imgUUID, volUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 316, in getVSize
    return self.oop.os.stat(volPath).st_size
  File "/usr/share/vdsm/storage/outOfProcess.py", line 241, in stat
    return self._iop.stat(path)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 374, in stat
    resdict = self._sendCommand("stat", {"path": path}, self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 345, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out


Version-Release number of selected component (if applicable):
RHEV-M 3.5.0-0.17.beta.el6ev
RHEL - 6Server - 6.6.0.2.el6
libvirt-0.10.2-46.el6_6.1
vdsm-4.16.7.1-1.el6ev
Hardware:
RHEV-M server:
8 x Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
16 G RAM

Host server:
64 x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (32 cores)
400 G RAM

Remote Database server:
Virtual machine
8 x Westmere E56xx/L56xx/X56xx (Nehalem-C)
16 G RAM

How reproducible:
Install RHEV-M
1 DC, 1 Cluster, 1 Host, 200 VMs, Netapp Storage Domain


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Examine network using iperf tool:

from engine to remote db
[  3] local 10.35.3.13 port 53722 connected with 10.35.17.28 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec

from engine to host
[  3] local 10.35.3.13 port 56820 connected with 10.35.19.152 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.03 GBytes   887 Mbits/sec

> from remote db to engine
> [  3] local 10.35.17.28 port 34569 connected with 10.35.3.13 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  1.05 GBytes   900 Mbits/sec
>
> from host to engine
> [  3] local 10.35.19.152 port 57585 connected with 10.35.3.13 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec   980 MBytes   822 Mbits/sec

Comment 1 Yuri Obshansky 2014-11-27 09:21:53 UTC

Created attachment 961917 [details]
engine.log

Comment 2 Yuri Obshansky 2014-11-27 09:22:41 UTC

Created attachment 961918 [details]
vdsm.log

Comment 3 Yuri Obshansky 2014-11-27 09:23:23 UTC

Created attachment 961919 [details]
screenshot errors

Comment 4 Piotr Kliczewski 2014-11-27 09:32:05 UTC

It seems that the bug #1102147 is related.

Comment 5 Yuri Obshansky 2014-11-27 09:40:13 UTC

VM profile: 1 G RAM, 2 CPUs, 10G disk

Comment 6 Liron Aravot 2014-11-27 10:14:47 UTC

The problem shoudn't be related to remote db,
the repoStats is called every  minutes and lastCheck returns very high values, which i assume is related to the heavy load the host is under.
you can increase the MaxStorageVdsTimeoutCheckSec config value so the domain won't be considered as "problematic" under that report results.
(read also https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c89)

Thread-10375::INFO::2014-11-27 10:15:13,996::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'6a70504d-00df-4869-8695-0f4420
53f6b6': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00100719', 'lastCheck': '0.5', 'valid': True}, '8bf73417-ab48-40db-8590-514f64fc75b0': {'code
': 0, 'version': 3, 'acquired': True, 'delay': '0.000568427', 'lastCheck': '69.8', 'valid': True}}


Thread-10375::INFO::2014-11-27 10:17:46,504::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'6a70504d-00df-4869-8695-0f442053f6b6': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00186739', 'lastCheck': '50.9', 'valid': True}, '8bf73417-ab48-40db-8590-514f64fc75b0': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000568427', 'lastCheck': '222.3', 'valid': True}}

Comment 7 Yuri Obshansky 2014-11-27 10:18:11 UTC

So, why I can load host with 200 VMs using local database?
As I mentioned above, test passed successfully.
How can you describe that?

Comment 9 Michal Skrivanek 2014-11-28 11:11:17 UTC

(In reply to Yuri Obshansky from comment #7)
> So, why I can load host with 200 VMs using local database?
> As I mentioned above, test passed successfully.

Can you describe the differences between those 2 runs? E.g. is this the same HW? Is storage shared with anyone/anything? Same scripts were used to start all those VMs in both cases? Was there any difference in the rate they have been created and/or started? What is the load on host(s) with 100 VMs, highest CPU, highest I/O? Are the guests installed/doing anything?

Thanks,
michal

Comment 10 Liron Aravot 2014-12-01 13:37:42 UTC

(In reply to Yuri Obshansky from comment #7)
> So, why I can load host with 200 VMs using local database?
> As I mentioned above, test passed successfully.
> How can you describe that?

Hi Yuri,
I performed RCA for the log of the failure, as i mentioned - in this log there's no evidence for the DB as a cause for the issue, the problem is between the host-storage and the vm/host monitoring coupling.

When there will be logs of a succefull run, we'll be able to compare the differences and to analyze why it succeeded,
as it seems to me - the DB isn't the cause here, and it was matter of luck.

Comment 12 Yuri Obshansky 2014-12-01 16:42:24 UTC

Hi, I tried reproduce it several times unsuccessfully.
Looks like you are right and this problem is not related to remote database.
Let's close it.

Note You need to log in before you can comment on or make changes to this bug.