Created attachment 956683 [details] logs from engine and host Description of problem: after some time starting a VM is very slow, interestingly the qemu process is executed immediately but in RHEVM UI the VM is still down and starting task is active. Then there's an event 'Failed to run VM $vm...' and then another VM with 'Invalid status on DC... Error: Network error during communication with the Host) [why is it _H_ost??]. checking the logs it seems that something fails and then it is unknown what happened and later it is (wrongly?) detected as network issue. of course there is _NO_ network issue. - see ERRORs flow... [root@jb-rhevm35 ~]# sed -n '/2014-11-12 11:3/,$p' /var/log/ovirt-engine/engine.log | grep ERROR 2014-11-12 11:34:17,338 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Failed in ResumeBrokerVDS method, for vds: dell-r210ii-13; host: 10.34.62.205 2014-11-12 11:34:17,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command ResumeBrokerVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, vmId=c2524851-4625-4719-ab49-8bb7a539333a) execution failed. Exception: ClassCastException: java.lang.Boolean cannot be cast to java.util.Map 2014-11-12 11:34:17,352 ERROR [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] VDS::pause Failed resume vm c2524851-4625-4719-ab49-8bb7a539333a in vds = 160be2d3-02bc-407a-9688-8a7edd8e4fd7 : dell-r210ii-13, error = java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map 2014-11-12 11:34:17,355 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception. With error message VdcBLLException: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map (Failed with error ENGINE and code 5001) 2014-11-12 11:34:17,377 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Correlation ID: 318cd8f9, Job ID: 2c02c537-d6fb-4427-8312-6bb1dab86edf, Call Stack: null, Custom Event ID: -1, Message: Failed to resume VM test (Host: dell-r210ii-13, User: admin). 2014-11-12 11:37:48,571 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-1) Command CreateVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, vmId=c2524851-4625-4719-ab49-8bb7a539333a, vm=VM [test]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2014-11-12 11:37:48,572 ERROR [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp-/127.0.0.1:8702-1) VDS::create Failed creating vm test in vds = 160be2d3-02bc-407a-9688-8a7edd8e4fd7 : dell-r210ii-13 error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2014-11-12 11:37:48,573 ERROR [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022) 2014-11-12 11:37:48,577 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-91) Command SpmStatusVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, storagePoolId = 00000002-0002-0002-0002-000000000068) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured 2014-11-12 11:37:48,607 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-91) [48f7f0cc] Running command: SetStoragePoolStatusCommand(Error = VDS_NETWORK_ERROR, AuditLogType = SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR, Status = NonResponsive, VdsId = null, StoragePoolId = 00000002-0002-0002-0002-000000000068, ForceDelete = false) internal: true. Entities affected : ID: 00000002-0002-0002-0002-000000000068 Type: StoragePool 2014-11-12 11:37:48,573 DEBUG [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022) 2014-11-12 11:37:48,640 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-1) Correlation ID: 343e5408, Job ID: 218516c4-5d0b-489d-9b2d-da00a37e4bf7, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM test (User: admin). - qemu-kvm # ps -eo start,comm | grep '[q]emu-kvm' 10:34:48 qemu-kvm - exceptions 2014-11-12 11:34:17,355 DEBUG [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map (Failed with error ENGINE and code 5001) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.resumeVm(RunVmCommand.java:208) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:333) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:275) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:100) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1307) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1932) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1331) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:344) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:191) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:158) [bll.jar:] at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:167) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:90) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_71] ... 2014-11-12 11:37:48,573 DEBUG [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:508) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:230) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:347) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:275) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:100) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1307) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1932) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1331) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:344) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:430) [bll.jar:] Version-Release number of selected component (if applicable): vt9 / vdsm-4.16.7.3-1.el6ev.x86_64 / rhevm-backend-3.5.0-0.19.beta.el6ev.noarch / rhevh66 - 20141107.0.el6ev - libvirt-0.10.2-46.el6_6.1.x86_64 & qemu-kvm-rhev-0.12.1.2-2.445.el6.x86_64 How reproducible: unknown Steps to Reproduce: 1. it happens :) (after some time it just happens) 2. start vm or start paused vm 3. Actual results: task takes cca 3 mins but in fact qemu-kvm process is executed immediately Expected results: status of vm should be starting as qemu-kvm process is executed Additional info: no denials in audit.log (if relevant), can following be relevant? Nov 12 10:34:07 dell-r210ii-13 vdsm vm.Vm WARNING vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::_readPauseCode unsupported by libvirt vm Nov 12 10:34:49 dell-r210ii-13 vdsm vm.Vm WARNING vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::_readPauseCode unsupported by libvirt vm
It looks like there was true returned whereas code expects a dictionary. JsonRpcServer::DEBUG::2014-11-12 10:34:17,328::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request libvirtEventLoop::DEBUG::2014-11-12 10:34:17,332::vm::5464::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::event Resumed detail 0 opaque None Thread-45391::DEBUG::2014-11-12 10:34:17,333::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.cont' in bridge with True Seems like virt issue.
VDSM's try/finally blocks in pause() and resume() skips the "usual" return value resulting in a surprise on the engine side
(In reply to Michal Skrivanek from comment #2) > VDSM's try/finally blocks in pause() and resume() skips the "usual" return > value resulting in a surprise on the engine side It seems there is something more underneath. Attaching some vdsm logs and /var/log/messages excerpt from the affected host
Created attachment 956788 [details] vdsm logs and messages
Easiest way to trigger this issue seems to do Run Once -> start in Pause mode and then resume vm still investigating why VDSM logs are silent. libvirt debug logs aren't helping either, yet.
The root cause is the schema mismatch between schema. On the basis of incorrect schema, VDSM incorrectly maps the return value to No value, and this confuses engine. Apparently JSONRPC and XMLRPC were using different code paths here: XMLRPC had hardcoded mapping, JSONRPC relies on schema. Other flows may be affected.
http://gerrit.ovirt.org/#/c/35105 fixes this issue - among other things
this causes many exceptions in the engine.log, unpredictable behavior, though limited to the two specific verbs (it seems). Fix is low risk and ready hence proposing fixing for 3.5 GA
*** Bug 1163751 has been marked as a duplicate of this bug. ***
*** Bug 1167315 has been marked as a duplicate of this bug. ***
Verified, bug could not reproduce. if same issue will occur bug will be reopened. Verified using vt13.1: engine: 3.5.0-0.23.beta.el6ev host: vdsm-4.16.8.1-2.el7ev.x86_64 libvirt-1.1.1-29.el7_0.4.x86_64 sanlock-3.1.0-2.el7.x86_64 qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64
RHEV-M 3.5.0 has been released