Bug 1168540

Summary: [scale] RHEV-M with remote database is not functional, every hour DC crashes.
Product: Red Hat Enterprise Virtualization Manager Reporter: Yuri Obshansky <yobshans>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED NOTABUG QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: ecohen, gklein, iheim, laravot, lpeer, lsurette, michal.skrivanek, oourfali, pkliczew, pstehlik, rbalakri, Rhev-m-bugs, yeylon, yobshans
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-01 16:42:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1164308, 1164311    
Attachments:
Description Flags
engine.log
none
vdsm.log
none
screenshot errors none

Description Yuri Obshansky 2014-11-27 09:21:14 UTC
Description of problem:
Problem detected during local/remote database benchmark test.
Load test on RHEV-M (200 VMs) with local database passed. 
RHEV-M with remote database is not functional already with 100 VMs.
Every hour DC crashes and started reconstruct
Important: Host created with XMLRPC protocol
Errors:
engine.log
2014-11-27 10:23:38,683 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-28) [65813a3b] Failed in HSMGetTaskStatusVDS method
....
2014-11-27 10:23:38,683 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [65813a3b] Start SPM Task failed - result: cleanSuccess, messa
ge: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Storage domain does not exist, code = 358
......
2014-11-27 10:23:45,379 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-28) [3859571f] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 58e1de38-4a77-476c-b917-a097096b9f38 Type: StoragePool
2014-11-27 10:23:45,390 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-28) [3859571f] Correlation ID: 3859571f, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center DC-REAL. Setting status to Non Responsive.
2014-11-27 10:23:45,393 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [3859571f] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed
2014-11-27 10:23:56,001 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-40) [7116263e] Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value 
TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM]]

vdsm.log:
845d40b4-9eaf-44b8-b09c-971534b080a3::ERROR::2014-11-27 10:10:42,032::task::645::Storage.TaskManager.Task::(_loadMetaFile) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 624, in _loadMetaFile
    for line in getProcPool().readLines(filename):
  File "/usr/share/vdsm/storage/outOfProcess.py", line 329, in readLines
    return ioproc.readlines(path)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 454, in readlines
    return self.readfile(path, direct).splitlines()
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 442, in readfile
    "direct": direct}, self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 351, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
Thread-14::DEBUG::2014-11-27 10:10:42,033::__init__::232::IOProcess::(_processLogs) (null)|Receiving request...
845d40b4-9eaf-44b8-b09c-971534b080a3::ERROR::2014-11-27 10:10:42,061::taskManager::222::Storage.TaskManager::(loadDumpedTasks) taskManager: Skipping directory: 6fb087a8-c5a7-4eed-bf38-f7395624c3ec
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/taskManager.py", line 214, in loadDumpedTasks
    t = Task.loadTask(store, taskID)
  File "/usr/share/vdsm/storage/task.py", line 1152, in loadTask
    t._load(store, ext)
  File "/usr/share/vdsm/storage/task.py", line 739, in _load
    self._loadJobMetaFile(taskDir, jn)
  File "/usr/share/vdsm/storage/task.py", line 689, in _loadJobMetaFile
    self._loadMetaFile(taskFile, self.jobs[n], Job.fields)
  File "/usr/share/vdsm/storage/task.py", line 646, in _loadMetaFile
    raise se.TaskMetaDataLoadError(filename)
TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/58e1de38-4a77-476c-b917-a097096b9f38/mastersd/master/tasks/6fb087a8-c5a7-4eed-bf38-f7395624c3ec/6fb087a8-c5a7-4eed-bf38-f7395624c3ec.job.0',)

Thread-346::ERROR::2014-11-27 10:19:10,341::task::866::Storage.TaskManager.Task::(_setError) Task=`c9bf74d9-4ebd-4d3d-afee-f478ed9534d9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3099, in getVolumeSize
    apparentsize = str(dom.getVSize(imgUUID, volUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 316, in getVSize
    return self.oop.os.stat(volPath).st_size
  File "/usr/share/vdsm/storage/outOfProcess.py", line 241, in stat
    return self._iop.stat(path)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 374, in stat
    resdict = self._sendCommand("stat", {"path": path}, self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 345, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out


Version-Release number of selected component (if applicable):
RHEV-M 3.5.0-0.17.beta.el6ev
RHEL - 6Server - 6.6.0.2.el6
libvirt-0.10.2-46.el6_6.1
vdsm-4.16.7.1-1.el6ev
Hardware:
RHEV-M server:
8 x Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
16 G RAM

Host server:
64 x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (32 cores)
400 G RAM

Remote Database server:
Virtual machine
8 x Westmere E56xx/L56xx/X56xx (Nehalem-C)
16 G RAM

How reproducible:
Install RHEV-M
1 DC, 1 Cluster, 1 Host, 200 VMs, Netapp Storage Domain


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Examine network using iperf tool:

from engine to remote db
[  3] local 10.35.3.13 port 53722 connected with 10.35.17.28 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec

from engine to host
[  3] local 10.35.3.13 port 56820 connected with 10.35.19.152 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.03 GBytes   887 Mbits/sec

> from remote db to engine
> [  3] local 10.35.17.28 port 34569 connected with 10.35.3.13 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  1.05 GBytes   900 Mbits/sec
>
> from host to engine
> [  3] local 10.35.19.152 port 57585 connected with 10.35.3.13 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec   980 MBytes   822 Mbits/sec

Comment 1 Yuri Obshansky 2014-11-27 09:21:53 UTC
Created attachment 961917 [details]
engine.log

Comment 2 Yuri Obshansky 2014-11-27 09:22:41 UTC
Created attachment 961918 [details]
vdsm.log

Comment 3 Yuri Obshansky 2014-11-27 09:23:23 UTC
Created attachment 961919 [details]
screenshot errors

Comment 4 Piotr Kliczewski 2014-11-27 09:32:05 UTC
It seems that the bug #1102147 is related.

Comment 5 Yuri Obshansky 2014-11-27 09:40:13 UTC
VM profile: 1 G RAM, 2 CPUs, 10G disk

Comment 6 Liron Aravot 2014-11-27 10:14:47 UTC
The problem shoudn't be related to remote db,
the repoStats is called every  minutes and lastCheck returns very high values, which i assume is related to the heavy load the host is under.
you can increase the MaxStorageVdsTimeoutCheckSec config value so the domain won't be considered as "problematic" under that report results.
(read also https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c89)

Thread-10375::INFO::2014-11-27 10:15:13,996::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'6a70504d-00df-4869-8695-0f4420
53f6b6': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00100719', 'lastCheck': '0.5', 'valid': True}, '8bf73417-ab48-40db-8590-514f64fc75b0': {'code
': 0, 'version': 3, 'acquired': True, 'delay': '0.000568427', 'lastCheck': '69.8', 'valid': True}}


Thread-10375::INFO::2014-11-27 10:17:46,504::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'6a70504d-00df-4869-8695-0f442053f6b6': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00186739', 'lastCheck': '50.9', 'valid': True}, '8bf73417-ab48-40db-8590-514f64fc75b0': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000568427', 'lastCheck': '222.3', 'valid': True}}

Comment 7 Yuri Obshansky 2014-11-27 10:18:11 UTC
So, why I can load host with 200 VMs using local database?
As I mentioned above, test passed successfully.
How can you describe that?

Comment 9 Michal Skrivanek 2014-11-28 11:11:17 UTC
(In reply to Yuri Obshansky from comment #7)
> So, why I can load host with 200 VMs using local database?
> As I mentioned above, test passed successfully.

Can you describe the differences between those 2 runs? E.g. is this the same HW? Is storage shared with anyone/anything? Same scripts were used to start all those VMs in both cases? Was there any difference in the rate they have been created and/or started? What is the load on host(s) with 100 VMs, highest CPU, highest I/O? Are the guests installed/doing anything?

Thanks,
michal

Comment 10 Liron Aravot 2014-12-01 13:37:42 UTC
(In reply to Yuri Obshansky from comment #7)
> So, why I can load host with 200 VMs using local database?
> As I mentioned above, test passed successfully.
> How can you describe that?

Hi Yuri,
I performed RCA for the log of the failure, as i mentioned - in this log there's no evidence for the DB as a cause for the issue, the problem is between the host-storage and the vm/host monitoring coupling.

When there will be logs of a succefull run, we'll be able to compare the differences and to analyze why it succeeded,
as it seems to me - the DB isn't the cause here, and it was matter of luck.

Comment 12 Yuri Obshansky 2014-12-01 16:42:24 UTC
Hi, I tried reproduce it several times unsuccessfully.
Looks like you are right and this problem is not related to remote database.
Let's close it.