Bug 1163089

Summary: Starting of VM is slow - error = java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Belka <jbelka>
Component: ovirt-engineAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: bazulay, dougsland, ebenahar, ecohen, gklein, iheim, lpeer, lsurette, michal.skrivanek, ofrenkel, oourfali, pbrilla, pkliczew, rbalakri, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: vt12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-17 08:28:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1164308, 1164311    
Attachments:
Description Flags
logs from engine and host
none
vdsm logs and messages none

Description Jiri Belka 2014-11-12 11:13:32 UTC
Created attachment 956683 [details]
logs from engine and host

Description of problem:
after some time starting a VM is very slow, interestingly the qemu process is executed immediately but in RHEVM UI the VM is still down and starting task is active. Then there's an event 'Failed to run VM $vm...' and then another VM with 'Invalid status on DC... Error: Network error during communication with the Host) [why is it _H_ost??].

checking the logs it seems that something fails and then it is unknown what happened and later it is (wrongly?) detected as network issue. of course there is _NO_ network issue.

- see ERRORs flow...

[root@jb-rhevm35 ~]# sed -n '/2014-11-12 11:3/,$p' /var/log/ovirt-engine/engine.log | grep ERROR
2014-11-12 11:34:17,338 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Failed in ResumeBrokerVDS method, for vds: dell-r210ii-13; host: 10.34.62.205
2014-11-12 11:34:17,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command ResumeBrokerVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, vmId=c2524851-4625-4719-ab49-8bb7a539333a) execution failed. Exception: ClassCastException: java.lang.Boolean cannot be cast to java.util.Map
2014-11-12 11:34:17,352 ERROR [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] VDS::pause Failed resume vm c2524851-4625-4719-ab49-8bb7a539333a in vds = 160be2d3-02bc-407a-9688-8a7edd8e4fd7 : dell-r210ii-13, error = java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map
2014-11-12 11:34:17,355 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception. With error message VdcBLLException: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map (Failed with error ENGINE and code 5001)
2014-11-12 11:34:17,377 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Correlation ID: 318cd8f9, Job ID: 2c02c537-d6fb-4427-8312-6bb1dab86edf, Call Stack: null, Custom Event ID: -1, Message: Failed to resume VM test (Host: dell-r210ii-13, User: admin).
2014-11-12 11:37:48,571 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-1) Command CreateVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, vmId=c2524851-4625-4719-ab49-8bb7a539333a, vm=VM [test]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2014-11-12 11:37:48,572 ERROR [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp-/127.0.0.1:8702-1) VDS::create Failed creating vm test in vds = 160be2d3-02bc-407a-9688-8a7edd8e4fd7 : dell-r210ii-13 error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2014-11-12 11:37:48,573 ERROR [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2014-11-12 11:37:48,577 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-91) Command SpmStatusVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, storagePoolId = 00000002-0002-0002-0002-000000000068) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured
2014-11-12 11:37:48,607 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-91) [48f7f0cc] Running command: SetStoragePoolStatusCommand(Error = VDS_NETWORK_ERROR, AuditLogType = SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR, Status = NonResponsive, VdsId = null, StoragePoolId = 00000002-0002-0002-0002-000000000068, ForceDelete = false) internal: true. Entities affected :  ID: 00000002-0002-0002-0002-000000000068 Type: StoragePool
2014-11-12 11:37:48,573 DEBUG [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2014-11-12 11:37:48,640 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-1) Correlation ID: 343e5408, Job ID: 218516c4-5d0b-489d-9b2d-da00a37e4bf7, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM test (User: admin).

- qemu-kvm

# ps -eo start,comm | grep '[q]emu-kvm'
10:34:48 qemu-kvm

- exceptions

2014-11-12 11:34:17,355 DEBUG [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map (Failed with error ENGINE and code 5001)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.resumeVm(RunVmCommand.java:208) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:333) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:275) [bll.jar:]
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:100) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1307) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1932) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1331) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:344) [bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:191) [bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:158) [bll.jar:]
        at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:167) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:90) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_71]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_71]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_71]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_71]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_71]

...
2014-11-12 11:37:48,573 DEBUG [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:508) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:230) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:275) [bll.jar:]
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:100) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1307) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1932) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1331) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:344) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:430) [bll.jar:]

Version-Release number of selected component (if applicable):
vt9 / vdsm-4.16.7.3-1.el6ev.x86_64 / rhevm-backend-3.5.0-0.19.beta.el6ev.noarch / rhevh66 - 20141107.0.el6ev - libvirt-0.10.2-46.el6_6.1.x86_64 & qemu-kvm-rhev-0.12.1.2-2.445.el6.x86_64

How reproducible:
unknown

Steps to Reproduce:
1. it happens :) (after some time it just happens)
2. start vm or start paused vm
3.

Actual results:
task takes cca 3 mins but in fact qemu-kvm process is executed immediately

Expected results:
status of vm should be starting as qemu-kvm process is executed

Additional info:
no denials in audit.log (if relevant), can following be relevant?

Nov 12 10:34:07 dell-r210ii-13 vdsm vm.Vm WARNING vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::_readPauseCode unsupported by libvirt vm
Nov 12 10:34:49 dell-r210ii-13 vdsm vm.Vm WARNING vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::_readPauseCode unsupported by libvirt vm

Comment 1 Piotr Kliczewski 2014-11-12 14:17:52 UTC
It looks like there was true returned whereas code expects a dictionary.

JsonRpcServer::DEBUG::2014-11-12 10:34:17,328::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request
libvirtEventLoop::DEBUG::2014-11-12 10:34:17,332::vm::5464::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::event Resumed detail 0 opaque None
Thread-45391::DEBUG::2014-11-12 10:34:17,333::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.cont' in bridge with True

Seems like virt issue.

Comment 2 Michal Skrivanek 2014-11-12 14:46:19 UTC
VDSM's try/finally blocks in pause() and resume() skips the "usual" return value resulting in a surprise on the engine side

Comment 3 Francesco Romani 2014-11-12 15:40:23 UTC
(In reply to Michal Skrivanek from comment #2)
> VDSM's try/finally blocks in pause() and resume() skips the "usual" return
> value resulting in a surprise on the engine side

It seems there is something more underneath. Attaching some vdsm logs and /var/log/messages excerpt from the affected host

Comment 4 Francesco Romani 2014-11-12 15:42:05 UTC
Created attachment 956788 [details]
vdsm logs and messages

Comment 5 Francesco Romani 2014-11-12 17:17:08 UTC
Easiest way to trigger this issue seems to do
Run Once -> start in Pause mode
and then resume vm

still investigating why VDSM logs are silent.
libvirt debug logs aren't helping either, yet.

Comment 6 Francesco Romani 2014-11-13 07:44:44 UTC
The root cause is the schema mismatch between schema. On the basis of incorrect schema, VDSM incorrectly maps the return value to No value, and this confuses engine. Apparently JSONRPC and XMLRPC were using different code paths here: XMLRPC had hardcoded mapping, JSONRPC relies on schema.

Other flows may be affected.

Comment 7 Francesco Romani 2014-11-13 12:28:48 UTC
http://gerrit.ovirt.org/#/c/35105 fixes this issue - among other things

Comment 8 Michal Skrivanek 2014-11-18 07:35:00 UTC
this causes many exceptions in the engine.log, unpredictable behavior, though limited to the two specific verbs (it seems). Fix is low risk and ready hence proposing fixing for 3.5 GA

Comment 9 Omer Frenkel 2014-11-19 10:19:22 UTC
*** Bug 1163751 has been marked as a duplicate of this bug. ***

Comment 10 Michal Skrivanek 2014-11-24 13:58:15 UTC
*** Bug 1167315 has been marked as a duplicate of this bug. ***

Comment 11 Nisim Simsolo 2014-12-16 09:26:28 UTC
Verified, bug could not reproduce.
if same issue will occur bug will be reopened.
Verified using vt13.1:
engine:  3.5.0-0.23.beta.el6ev
host: vdsm-4.16.8.1-2.el7ev.x86_64
libvirt-1.1.1-29.el7_0.4.x86_64
sanlock-3.1.0-2.el7.x86_64
qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64

Comment 12 Omer Frenkel 2015-02-17 08:28:21 UTC
RHEV-M 3.5.0 has been released