Bug 1163089 - Starting of VM is slow - error = java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map
Summary: Starting of VM is slow - error = java.lang.ClassCastException: java.lang.Bool...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Francesco Romani
QA Contact: Nisim Simsolo
URL:
Whiteboard: virt
: 1163751 1167315 (view as bug list)
Depends On:
Blocks: rhev35rcblocker rhev35gablocker
TreeView+ depends on / blocked
 
Reported: 2014-11-12 11:13 UTC by Jiri Belka
Modified: 2015-09-22 13:10 UTC (History)
17 users (show)

Fixed In Version: vt12
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-17 08:28:21 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from engine and host (15.40 MB, application/x-tar)
2014-11-12 11:13 UTC, Jiri Belka
no flags Details
vdsm logs and messages (2.00 MB, application/x-tar)
2014-11-12 15:42 UTC, Francesco Romani
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 35105 0 master MERGED vm: state change missing return value 2020-05-13 16:14:29 UTC
oVirt gerrit 35150 0 ovirt-3.5 MERGED vm: state change missing return value 2020-05-13 16:14:29 UTC

Description Jiri Belka 2014-11-12 11:13:32 UTC
Created attachment 956683 [details]
logs from engine and host

Description of problem:
after some time starting a VM is very slow, interestingly the qemu process is executed immediately but in RHEVM UI the VM is still down and starting task is active. Then there's an event 'Failed to run VM $vm...' and then another VM with 'Invalid status on DC... Error: Network error during communication with the Host) [why is it _H_ost??].

checking the logs it seems that something fails and then it is unknown what happened and later it is (wrongly?) detected as network issue. of course there is _NO_ network issue.

- see ERRORs flow...

[root@jb-rhevm35 ~]# sed -n '/2014-11-12 11:3/,$p' /var/log/ovirt-engine/engine.log | grep ERROR
2014-11-12 11:34:17,338 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Failed in ResumeBrokerVDS method, for vds: dell-r210ii-13; host: 10.34.62.205
2014-11-12 11:34:17,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command ResumeBrokerVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, vmId=c2524851-4625-4719-ab49-8bb7a539333a) execution failed. Exception: ClassCastException: java.lang.Boolean cannot be cast to java.util.Map
2014-11-12 11:34:17,352 ERROR [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] VDS::pause Failed resume vm c2524851-4625-4719-ab49-8bb7a539333a in vds = 160be2d3-02bc-407a-9688-8a7edd8e4fd7 : dell-r210ii-13, error = java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map
2014-11-12 11:34:17,355 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception. With error message VdcBLLException: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map (Failed with error ENGINE and code 5001)
2014-11-12 11:34:17,377 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Correlation ID: 318cd8f9, Job ID: 2c02c537-d6fb-4427-8312-6bb1dab86edf, Call Stack: null, Custom Event ID: -1, Message: Failed to resume VM test (Host: dell-r210ii-13, User: admin).
2014-11-12 11:37:48,571 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-1) Command CreateVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, vmId=c2524851-4625-4719-ab49-8bb7a539333a, vm=VM [test]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2014-11-12 11:37:48,572 ERROR [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp-/127.0.0.1:8702-1) VDS::create Failed creating vm test in vds = 160be2d3-02bc-407a-9688-8a7edd8e4fd7 : dell-r210ii-13 error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2014-11-12 11:37:48,573 ERROR [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2014-11-12 11:37:48,577 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-91) Command SpmStatusVDSCommand(HostName = dell-r210ii-13, HostId = 160be2d3-02bc-407a-9688-8a7edd8e4fd7, storagePoolId = 00000002-0002-0002-0002-000000000068) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured
2014-11-12 11:37:48,607 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-91) [48f7f0cc] Running command: SetStoragePoolStatusCommand(Error = VDS_NETWORK_ERROR, AuditLogType = SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR, Status = NonResponsive, VdsId = null, StoragePoolId = 00000002-0002-0002-0002-000000000068, ForceDelete = false) internal: true. Entities affected :  ID: 00000002-0002-0002-0002-000000000068 Type: StoragePool
2014-11-12 11:37:48,573 DEBUG [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2014-11-12 11:37:48,640 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-1) Correlation ID: 343e5408, Job ID: 218516c4-5d0b-489d-9b2d-da00a37e4bf7, Call Stack: null, Custom Event ID: -1, Message: Failed to run VM test (User: admin).

- qemu-kvm

# ps -eo start,comm | grep '[q]emu-kvm'
10:34:48 qemu-kvm

- exceptions

2014-11-12 11:34:17,355 DEBUG [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-7-thread-44) [318cd8f9] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.util.Map (Failed with error ENGINE and code 5001)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.resumeVm(RunVmCommand.java:208) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:333) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:275) [bll.jar:]
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:100) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1307) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1932) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1331) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:344) [bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:191) [bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:158) [bll.jar:]
        at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:167) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:90) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_71]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_71]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_71]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_71]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_71]

...
2014-11-12 11:37:48,573 DEBUG [org.ovirt.engine.core.bll.RunVmOnceCommand] (ajp-/127.0.0.1:8702-1) Command org.ovirt.engine.core.bll.RunVmOnceCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:508) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:230) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:347) [bll.jar:]
        at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:275) [bll.jar:]
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:100) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1168) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1307) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1932) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1331) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:344) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:430) [bll.jar:]

Version-Release number of selected component (if applicable):
vt9 / vdsm-4.16.7.3-1.el6ev.x86_64 / rhevm-backend-3.5.0-0.19.beta.el6ev.noarch / rhevh66 - 20141107.0.el6ev - libvirt-0.10.2-46.el6_6.1.x86_64 & qemu-kvm-rhev-0.12.1.2-2.445.el6.x86_64

How reproducible:
unknown

Steps to Reproduce:
1. it happens :) (after some time it just happens)
2. start vm or start paused vm
3.

Actual results:
task takes cca 3 mins but in fact qemu-kvm process is executed immediately

Expected results:
status of vm should be starting as qemu-kvm process is executed

Additional info:
no denials in audit.log (if relevant), can following be relevant?

Nov 12 10:34:07 dell-r210ii-13 vdsm vm.Vm WARNING vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::_readPauseCode unsupported by libvirt vm
Nov 12 10:34:49 dell-r210ii-13 vdsm vm.Vm WARNING vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::_readPauseCode unsupported by libvirt vm

Comment 1 Piotr Kliczewski 2014-11-12 14:17:52 UTC
It looks like there was true returned whereas code expects a dictionary.

JsonRpcServer::DEBUG::2014-11-12 10:34:17,328::__init__::504::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request
libvirtEventLoop::DEBUG::2014-11-12 10:34:17,332::vm::5464::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`c2524851-4625-4719-ab49-8bb7a539333a`::event Resumed detail 0 opaque None
Thread-45391::DEBUG::2014-11-12 10:34:17,333::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.cont' in bridge with True

Seems like virt issue.

Comment 2 Michal Skrivanek 2014-11-12 14:46:19 UTC
VDSM's try/finally blocks in pause() and resume() skips the "usual" return value resulting in a surprise on the engine side

Comment 3 Francesco Romani 2014-11-12 15:40:23 UTC
(In reply to Michal Skrivanek from comment #2)
> VDSM's try/finally blocks in pause() and resume() skips the "usual" return
> value resulting in a surprise on the engine side

It seems there is something more underneath. Attaching some vdsm logs and /var/log/messages excerpt from the affected host

Comment 4 Francesco Romani 2014-11-12 15:42:05 UTC
Created attachment 956788 [details]
vdsm logs and messages

Comment 5 Francesco Romani 2014-11-12 17:17:08 UTC
Easiest way to trigger this issue seems to do
Run Once -> start in Pause mode
and then resume vm

still investigating why VDSM logs are silent.
libvirt debug logs aren't helping either, yet.

Comment 6 Francesco Romani 2014-11-13 07:44:44 UTC
The root cause is the schema mismatch between schema. On the basis of incorrect schema, VDSM incorrectly maps the return value to No value, and this confuses engine. Apparently JSONRPC and XMLRPC were using different code paths here: XMLRPC had hardcoded mapping, JSONRPC relies on schema.

Other flows may be affected.

Comment 7 Francesco Romani 2014-11-13 12:28:48 UTC
http://gerrit.ovirt.org/#/c/35105 fixes this issue - among other things

Comment 8 Michal Skrivanek 2014-11-18 07:35:00 UTC
this causes many exceptions in the engine.log, unpredictable behavior, though limited to the two specific verbs (it seems). Fix is low risk and ready hence proposing fixing for 3.5 GA

Comment 9 Omer Frenkel 2014-11-19 10:19:22 UTC
*** Bug 1163751 has been marked as a duplicate of this bug. ***

Comment 10 Michal Skrivanek 2014-11-24 13:58:15 UTC
*** Bug 1167315 has been marked as a duplicate of this bug. ***

Comment 11 Nisim Simsolo 2014-12-16 09:26:28 UTC
Verified, bug could not reproduce.
if same issue will occur bug will be reopened.
Verified using vt13.1:
engine:  3.5.0-0.23.beta.el6ev
host: vdsm-4.16.8.1-2.el7ev.x86_64
libvirt-1.1.1-29.el7_0.4.x86_64
sanlock-3.1.0-2.el7.x86_64
qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64

Comment 12 Omer Frenkel 2015-02-17 08:28:21 UTC
RHEV-M 3.5.0 has been released


Note You need to log in before you can comment on or make changes to this bug.