Description of problem: Problem detected during local/remote database benchmark test. Load test on RHEV-M (200 VMs) with local database passed. RHEV-M with remote database is not functional already with 100 VMs. Every hour DC crashes and started reconstruct Important: Host created with XMLRPC protocol Errors: engine.log 2014-11-27 10:23:38,683 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-28) [65813a3b] Failed in HSMGetTaskStatusVDS method .... 2014-11-27 10:23:38,683 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-28) [65813a3b] Start SPM Task failed - result: cleanSuccess, messa ge: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Storage domain does not exist, code = 358 ...... 2014-11-27 10:23:45,379 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-28) [3859571f] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 58e1de38-4a77-476c-b917-a097096b9f38 Type: StoragePool 2014-11-27 10:23:45,390 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-28) [3859571f] Correlation ID: 3859571f, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center DC-REAL. Setting status to Non Responsive. 2014-11-27 10:23:45,393 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-28) [3859571f] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed 2014-11-27 10:23:56,001 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-40) [7116263e] Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM]] vdsm.log: 845d40b4-9eaf-44b8-b09c-971534b080a3::ERROR::2014-11-27 10:10:42,032::task::645::Storage.TaskManager.Task::(_loadMetaFile) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 624, in _loadMetaFile for line in getProcPool().readLines(filename): File "/usr/share/vdsm/storage/outOfProcess.py", line 329, in readLines return ioproc.readlines(path) File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 454, in readlines return self.readfile(path, direct).splitlines() File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 442, in readfile "direct": direct}, self.timeout) File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 351, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 2] No such file or directory Thread-14::DEBUG::2014-11-27 10:10:42,033::__init__::232::IOProcess::(_processLogs) (null)|Receiving request... 845d40b4-9eaf-44b8-b09c-971534b080a3::ERROR::2014-11-27 10:10:42,061::taskManager::222::Storage.TaskManager::(loadDumpedTasks) taskManager: Skipping directory: 6fb087a8-c5a7-4eed-bf38-f7395624c3ec Traceback (most recent call last): File "/usr/share/vdsm/storage/taskManager.py", line 214, in loadDumpedTasks t = Task.loadTask(store, taskID) File "/usr/share/vdsm/storage/task.py", line 1152, in loadTask t._load(store, ext) File "/usr/share/vdsm/storage/task.py", line 739, in _load self._loadJobMetaFile(taskDir, jn) File "/usr/share/vdsm/storage/task.py", line 689, in _loadJobMetaFile self._loadMetaFile(taskFile, self.jobs[n], Job.fields) File "/usr/share/vdsm/storage/task.py", line 646, in _loadMetaFile raise se.TaskMetaDataLoadError(filename) TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/58e1de38-4a77-476c-b917-a097096b9f38/mastersd/master/tasks/6fb087a8-c5a7-4eed-bf38-f7395624c3ec/6fb087a8-c5a7-4eed-bf38-f7395624c3ec.job.0',) Thread-346::ERROR::2014-11-27 10:19:10,341::task::866::Storage.TaskManager.Task::(_setError) Task=`c9bf74d9-4ebd-4d3d-afee-f478ed9534d9`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3099, in getVolumeSize apparentsize = str(dom.getVSize(imgUUID, volUUID)) File "/usr/share/vdsm/storage/fileSD.py", line 316, in getVSize return self.oop.os.stat(volPath).st_size File "/usr/share/vdsm/storage/outOfProcess.py", line 241, in stat return self._iop.stat(path) File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 374, in stat resdict = self._sendCommand("stat", {"path": path}, self.timeout) File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 345, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out Version-Release number of selected component (if applicable): RHEV-M 3.5.0-0.17.beta.el6ev RHEL - 6Server - 6.6.0.2.el6 libvirt-0.10.2-46.el6_6.1 vdsm-4.16.7.1-1.el6ev Hardware: RHEV-M server: 8 x Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz 16 G RAM Host server: 64 x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (32 cores) 400 G RAM Remote Database server: Virtual machine 8 x Westmere E56xx/L56xx/X56xx (Nehalem-C) 16 G RAM How reproducible: Install RHEV-M 1 DC, 1 Cluster, 1 Host, 200 VMs, Netapp Storage Domain Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Examine network using iperf tool: from engine to remote db [ 3] local 10.35.3.13 port 53722 connected with 10.35.17.28 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.09 GBytes 935 Mbits/sec from engine to host [ 3] local 10.35.3.13 port 56820 connected with 10.35.19.152 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.03 GBytes 887 Mbits/sec > from remote db to engine > [ 3] local 10.35.17.28 port 34569 connected with 10.35.3.13 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 1.05 GBytes 900 Mbits/sec > > from host to engine > [ 3] local 10.35.19.152 port 57585 connected with 10.35.3.13 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 980 MBytes 822 Mbits/sec
Created attachment 961917 [details] engine.log
Created attachment 961918 [details] vdsm.log
Created attachment 961919 [details] screenshot errors
It seems that the bug #1102147 is related.
VM profile: 1 G RAM, 2 CPUs, 10G disk
The problem shoudn't be related to remote db, the repoStats is called every minutes and lastCheck returns very high values, which i assume is related to the heavy load the host is under. you can increase the MaxStorageVdsTimeoutCheckSec config value so the domain won't be considered as "problematic" under that report results. (read also https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c89) Thread-10375::INFO::2014-11-27 10:15:13,996::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'6a70504d-00df-4869-8695-0f4420 53f6b6': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00100719', 'lastCheck': '0.5', 'valid': True}, '8bf73417-ab48-40db-8590-514f64fc75b0': {'code ': 0, 'version': 3, 'acquired': True, 'delay': '0.000568427', 'lastCheck': '69.8', 'valid': True}} Thread-10375::INFO::2014-11-27 10:17:46,504::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'6a70504d-00df-4869-8695-0f442053f6b6': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00186739', 'lastCheck': '50.9', 'valid': True}, '8bf73417-ab48-40db-8590-514f64fc75b0': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000568427', 'lastCheck': '222.3', 'valid': True}}
So, why I can load host with 200 VMs using local database? As I mentioned above, test passed successfully. How can you describe that?
(In reply to Yuri Obshansky from comment #7) > So, why I can load host with 200 VMs using local database? > As I mentioned above, test passed successfully. Can you describe the differences between those 2 runs? E.g. is this the same HW? Is storage shared with anyone/anything? Same scripts were used to start all those VMs in both cases? Was there any difference in the rate they have been created and/or started? What is the load on host(s) with 100 VMs, highest CPU, highest I/O? Are the guests installed/doing anything? Thanks, michal
(In reply to Yuri Obshansky from comment #7) > So, why I can load host with 200 VMs using local database? > As I mentioned above, test passed successfully. > How can you describe that? Hi Yuri, I performed RCA for the log of the failure, as i mentioned - in this log there's no evidence for the DB as a cause for the issue, the problem is between the host-storage and the vm/host monitoring coupling. When there will be logs of a succefull run, we'll be able to compare the differences and to analyze why it succeeded, as it seems to me - the DB isn't the cause here, and it was matter of luck.
Hi, I tried reproduce it several times unsuccessfully. Looks like you are right and this problem is not related to remote database. Let's close it.