Created attachment 943187 [details] engine, vdsm and qemu logs Description of problem: Start VM operation is reported as failed although the VM was started successfully. Took 3 minutes for the VM to actually start on the host Version-Release number of selected component (if applicable): rhev3.5 vt4 Host: RHEL6.5 vdsm-4.16.5-2.el6ev.x86_64 libvirt-0.10.2-29.el6_5.12.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64 rhevm-3.5.0-0.13.beta.el6ev.noarch How reproducible: Not always Steps to Reproduce: 1. On a local DC, with 1 host 2. Create a VM and start it. Actual results: Create command is called: 2014-10-01 21:48:52,463 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] VDSM gets the request: Thread-11759::DEBUG::2014-10-01 21:48:54,789::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.create' The qemu process starts on the host From qemu log: 2014-10-01 18:48:54.888+0000: starting up After 3 minutes, the operation is reported as failed by engine: 2014-10-01 21:51:52,700 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (org.ovirt.thread.pool-7-thread-37) [388a73f2] HostName = green-vdsc 2014-10-01 21:51:52,700 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (org.ovirt.thread.pool-7-thread-37) [388a73f2] Command CreateVDSCommand(HostName = green-vdsc, HostId = c14677f4-ffb6-4f08-b3b2-112f0c60e918, vmId=2cd9533d-c322-429e-acad-9d59fe34f76f, vm=VM [lo1]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues' 2014-10-01 21:51:52,700 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (org.ovirt.thread.pool-7-thread-37) [388a73f2] FINISH, CreateVDSCommand, log id: 530b7000 2014-10-01 21:51:52,700 ERROR [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (org.ovirt.thread.pool-7-thread-37) [388a73f2] VDS::create Failed creating vm lo1 in vds = c14677f4-ffb6-4f08-b3b2-112f0c60e918 : green-vdsc error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues' Expected results: Engine shouldn't report the operation as failed if it succeeded. Additional info: engine, vdsm and qemu logs
CORRECTION: Instead of - "Took 3 minutes for the VM to actually start on the host" It should be - "The VM was started immediately on the host"
Martin, can you please take a look? I think it happens because of the following errors in VDSM: Thread-11762::DEBUG::2014-10-01 21:48:55,465::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 74 edom: 10 level: 2 message: argument unsupported: QEMU driver does not support <metadata> elem ent Thread-11762::ERROR::2014-10-01 21:48:55,466::vm::373::vm.Vm::(_sampleCpuTune) vmId=`2cd9533d-c322-429e-acad-9d59fe34f76f`::libvirt does not support metadata
No it isn't, this is just an annoying message that will go away in CentOS 6.6. We handle that gracefully (and this comes from the stats thread btw).
seems like you lost the response from vdsm. The message is lost. How come there is no connection error at least on the vdsm side? It looks like UDP…is it a proper json-rpc behavior?
The communication is based on TCP. Looking at the logs I can see that there is weird behavior in the engine side. The engine did nothing for 3 mins. Can you please provide logs with engine debug enabled. Please set debug for org.ovirt logger.
(In reply to Piotr Kliczewski from comment #5) > The communication is based on TCP. Looking at the logs I can see that there > is weird behavior in the engine side. The engine did nothing for 3 mins. > > Can you please provide logs with engine debug enabled. Please set debug for > org.ovirt logger. I'm unable to reproduce the issue, therefore I can't provide the requested information
From provided information I am not able to understand what is the issue. Are you ok with closing this bug and reopening if we will have more information?
makes sense. please reopen if it happens again. Check for engine's behavior, maybe short of disk space, maybe some other stress conditions, maybe network is flaky...
Created attachment 949941 [details] logs-23.10 We've encountered the bug again in other setup. Uploading the logs. I don't have an upstream engine build from sources, so I can't check it in debug mode. Anyway, this is cerntenly a serius bug which has to be fixed. 2014-10-23 17:18:10,551 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-11) [3d1695e6] Command CreateVDSCommand(HostName = camel-vdsb, HostId = 0707087c-d7b0-4b12-8f73-fc8b8 3cd42e8, vmId=32e8a529-6e21-4b73-a803-e21dfca56cfd, vm=VM [vm_snap]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
I can see that it is failing but I am not really sure what caused the failure. Today we merged jsonrpc version updated [1]. Thanks to this change we have number of fixes and improvements. Now when it fails we will see what was caused the failure thanks to [2]. Can you please test with the latest 3.5? [1] http://gerrit.ovirt.org/#/c/34433/ [2] http://gerrit.ovirt.org/#/c/33888/
*** Bug 1156049 has been marked as a duplicate of this bug. ***
*** Bug 1154393 has been marked as a duplicate of this bug. ***
moving to infra, as seems like a general communication/infra issue please move back to virt if you think otherwise
We noticed that response from vm.create is updated after returned which causes RuntimeError 'dictionary changed size during iteration' during marshaling to json.
I think the correct fix is: http://gerrit.ovirt.org/34147
*** Bug 1163073 has been marked as a duplicate of this bug. ***
patches merged on the relevant branches
Verified many times using the next builds: rhevm-3.5.0-0.21.el6ev.noarch qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 libvirt-0.10.2-46.el6_6.2.x86_64 sanlock-2.8-1.el6.x86_64 vdsm-4.16.8.1-2.el6ev.x86_64