Bug 1143968 - [scale] jsonrpc: restapi: timeout while starting a large number of VMs
Summary: [scale] jsonrpc: restapi: timeout while starting a large number of VMs
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.1
Assignee: Francesco Romani
QA Contact: Pavel Stehlik
URL:
Whiteboard: virt
Depends On:
Blocks: 1148583
TreeView+ depends on / blocked
 
Reported: 2014-09-18 10:46 UTC by Francesco Romani
Modified: 2016-02-10 19:49 UTC (History)
15 users (show)

Fixed In Version: ovirt-3.5.1_rc1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-21 16:03:15 UTC
oVirt Team: Virt


Attachments (Terms of Use)
scripts and logs of engine and vdsm (313.01 KB, application/octet-stream)
2014-09-18 10:46 UTC, Francesco Romani
no flags Details
simpler script to mass start VMs sequentially (1.67 KB, text/plain)
2014-09-18 10:47 UTC, Francesco Romani
no flags Details
more logs in the same scenario/testcase, but engine's host timezone fixed (1.83 MB, application/octet-stream)
2014-09-18 11:50 UTC, Francesco Romani
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 34813 master MERGED vm: safe(r) Vm.conf update in creation thread Never
oVirt gerrit 34938 ovirt-3.5 MERGED vm: safe(r) Vm.conf update in creation thread Never

Description Francesco Romani 2014-09-18 10:46:55 UTC
Created attachment 938834 [details]
scripts and logs of engine and vdsm

Description of problem:
As addendum to oVirt 3.5 test day (http://www.ovirt.org/OVirt_3.5_TestDay) I was running benchmarks against jsonrpc, using oVirt SDK, which in turn uses the rest API.

One test involves starting 100 VMs. The creation fails for timeout:

2014-09-18 05:08:05,922 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] START, CreateVDSCommand(HostName = toji, HostId = 99cdddd1-b34c-45ef-966a-f1459e104284, vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, vm=VM [XS_C000]), log id: 3d0d1f2
2014-09-18 05:08:05,987 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand kvmEnable=true,keyboardLayout=en-us,nice=0,displayNetwork=ovirtmgmt,copyPasteEnable=true,timeOffset=0,transparentHugePages=true,vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4,acpiEnable=true,custom={device_0f11a461-7a40-45d7-a056-19a05c5dfe1edevice_abfbc445-8c08-4c24-b459-749ab7f64411=VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=abfbc445-8c08-4c24-b459-749ab7f64411, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=1}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}, device_0f11a461-7a40-45d7-a056-19a05c5dfe1edevice_abfbc445-8c08-4c24-b459-749ab7f64411device_8d8ee7e2-cc0e-44ac-8102-cc53717fc740=VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=8d8ee7e2-cc0e-44ac-8102-cc53717fc740, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=2}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}, device_0f11a461-7a40-45d7-a056-19a05c5dfe1edevice_abfbc445-8c08-4c24-b459-749ab7f64411device_8d8ee7e2-cc0e-44ac-8102-cc53717fc740device_223bdfb0-9c7a-4a06-80af-681c42bc7c64=VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=223bdfb0-9c7a-4a06-80af-681c42bc7c64, device=spicevmc, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=3}, managed=false, plugged=true, readOnly=false, deviceAlias=channel2, customProperties={}, snapshotId=null}, device_0f11a461-7a40-45d7-a056-19a05c5dfe1e=VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=0f11a461-7a40-45d7-a056-19a05c5dfe1e, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={slot=0x01, bus=0x00, domain=0x0000, type=pci, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}},spiceSslCipherSuite=DEFAULT,memSize=16,smp=1,emulatedMachine=rhel6.5.0,vmType=kvm,memGuaranteedSize=16,display=qxl,smartcardEnable=false,bootMenuEnable=false,spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard,smpCoresPerSocket=1,numaTune={mode=preferred},maxVCpus=16,devices=[{address={bus=0x00, domain=0x0000, type=pci, slot=0x02, function=0x0}, specParams={vram=32768, heads=1}, device=qxl, type=video, deviceId=938a44fb-3af7-4a35-adba-d490bdda8f54}, {shared=false, iface=ide, index=2, address={unit=0, bus=1, target=0, controller=0, type=drive}, specParams={path=}, path=, device=cdrom, type=disk, readonly=true, deviceId=0d35fe8f-98d2-4c4a-8315-51b75fbba576}, {shared=false, index=0, volumeID=851aeb5b-83fd-4135-94e4-5891a8269b9f, propagateErrors=off, format=cow, type=disk, iface=virtio, bootOrder=1, address={bus=0x00, domain=0x0000, type=pci, slot=0x05, function=0x0}, domainID=81d8571c-0871-471f-930b-08685a502d3c, imageID=6a204b39-bebc-48ce-8683-cea8681addc9, specParams={}, optional=false, device=disk, poolID=00000002-0002-0002-0002-000000000064, readonly=false, deviceId=6a204b39-bebc-48ce-8683-cea8681addc9}, {address={bus=0x00, domain=0x0000, type=pci, slot=0x06, function=0x0}, specParams={model=virtio}, device=memballoon, type=balloon, deviceId=f25d3601-3289-4c4e-aa0c-eb2816ca1901}, {index=0, model=virtio-scsi, address={bus=0x00, domain=0x0000, type=pci, slot=0x03, function=0x0}, specParams={}, device=scsi, type=controller, deviceId=9e330005-350e-48fa-b4b6-a6ee18188f4c}, {address={bus=0x00, domain=0x0000, type=pci, slot=0x04, function=0x0}, specParams={}, device=virtio-serial, type=controller, deviceId=29714f6f-4172-4900-aa12-64249611db12}],vmName=XS_C000,cpuType=SandyBridge,fileTransferEnable=true
2014-09-18 05:11:06,446 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] Command org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand return value
 org.ovirt.engine.core.vdsbroker.vdsbroker.OneVmReturnForXmlRpc@7435de93
2014-09-18 05:11:06,447 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] HostName = toji
2014-09-18 05:11:06,460 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] Command CreateVDSCommand(HostName = toji, HostId = 99cdddd1-b34c-45ef-966a-f1459e104284, vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, vm=VM [XS_C000]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-09-18 05:11:06,462 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] FINISH, CreateVDSCommand, log id: 3d0d1f2
2014-09-18 05:11:06,463 ERROR [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] VDS::create Failed creating vm XS_C000 in vds = 99cdddd1-b34c-45ef-966a-f1459e104284 : toji error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-09-18 05:11:06,471 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to /192.168.1.48
2014-09-18 05:11:06,472 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] FINISH, CreateVmVDSCommand, return: Down, log id: 3220ef08
2014-09-18 05:11:06,473 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (ajp--127.0.0.1-8702-2) [4bf1c540] Lock freed to object EngineLock [exclusiveLocks= key: 5ecc37cf-386f-4363-a9fd-7ed4312c60b4 value: VM

On VDSM, however, creation succeeds (see the attached logs).

If I switch the host to XML-RPC, everything works as expected

Version-Release number of selected component (if applicable):
# rpm -qa | grep ovirt-engine
ovirt-engine-setup-plugin-ovirt-engine-common-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-backend-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-websocket-proxy-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-webadmin-portal-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-lib-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-extensions-api-impl-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-tools-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-setup-base-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-setup-plugin-websocket-proxy-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-dbscripts-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-setup-plugin-ovirt-engine-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-restapi-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-userportal-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-sdk-python-3.5.0.7-1.fc19.noarch
ovirt-engine-setup-3.5.0-0.0.master.20140911085446.gite1c5ffd.fc19.noarch
ovirt-engine-cli-3.5.0.5-0.1.20140908.git284eb0d.fc19.noarch

TOji> 12:06:53 root [~]$ rpm -qa | grep vdsm
vdsm-yajsonrpc-4.16.4-0.el6.noarch
vdsm-python-4.16.4-0.el6.noarch
vdsm-4.16.4-0.el6.x86_64
vdsm-python-zombiereaper-4.16.4-0.el6.noarch
vdsm-cli-4.16.4-0.el6.noarch
vdsm-jsonrpc-4.16.4-0.el6.noarch
vdsm-debuginfo-4.16.4-0.el6.x86_64
vdsm-xmlrpc-4.16.4-0.el6.noarch

How reproducible:
Initially worked as expected, now it happens constantly on my environment,
although at random points in the startup sequence.
I can't fully rule out a misconfiguration of my setup.

Steps to Reproduce:
1. create a VM with 16 MB ram, 1 cpu, 1 GB HDD
2. make it a template
3. create 100 VMs from it
4. start them sequentially through rest API

See the attached scripts for reference. 'observe.py could be used as template'

Actual results:
At random points in the sequence, the sequence hangs for timeout error

Expected results:
All the VMs started correctly

Additional info:
VDSM runs on RHEL6.5 on bare metal
Engine runs on a F19 virtual machine with bridged networking, through libvirt.

Comment 1 Francesco Romani 2014-09-18 10:47:58 UTC
Created attachment 938835 [details]
simpler script to mass start VMs sequentially

Comment 2 Francesco Romani 2014-09-18 10:50:58 UTC
VDSM logs are on Rome time (UTC+1), while Engine is on NY time (UTC-5?).
I apologize for the inconvenience.

Comment 3 Piotr Kliczewski 2014-09-18 11:40:48 UTC
I checked the logs and here is my observation:

Vdsm replies to VM.create in few miliseconds but the engine timeouts not receiving answer for 3 minutes. It seems that we are having bottle neck in the engine code. My first guess is ResponseWorker.

Francesco can you run jprofiler and take a snapshot(s) during your test?

Comment 4 Francesco Romani 2014-09-18 11:46:21 UTC
(In reply to Piotr Kliczewski from comment #3)

> Francesco can you run jprofiler and take a snapshot(s) during your test?

Sure thing! there are instructions somewhere I can follow?
I failed to find relevant references on the wiki

Comment 5 Francesco Romani 2014-09-18 11:50:54 UTC
Created attachment 938860 [details]
more logs in the same scenario/testcase, but engine's host timezone fixed

fixed engine timezone to match hypervisor one: both on UTC+1 (Europe/Rome)

around 13:09 I switched communication from JSON-rpc to XML-rpc. Worked as expected after that.

Comment 6 Francesco Romani 2014-09-18 11:52:36 UTC
More observations:
- using restapi, stop and status seems to work nicely as with xml-rpc
- using webadmin, stop seems to work fine.

more important:
the same timeout behaviour happened randomly also using webadmin in my environment.

Comment 7 Piotr Kliczewski 2014-09-18 11:54:04 UTC
I am not aware of any instructions on the wiki but you can take a look here [1].
I am happy to assist.

[1] https://www.ej-technologies.com/products/jprofiler/overview.html

Comment 8 Barak 2014-09-21 10:35:18 UTC
Just to make sure I understand the timeouts:

- on the client (the one runningthe API calls) did you get timeout ?
- on the engine I see the timeout in the logs above.

I wonder at what stage the API call to run VM is returned to the client (not sure this is related ... just making sure I understand the flow).

Juan I assume the client gets a response after the engine gets the response from vdsm ? or before rthat ?

Comment 9 Barak 2014-09-21 11:05:18 UTC
Francesco,

Just to be on the safe side ....
1 - on what env did you run this test (how many hypervisors) ?
2 - what was the actual time it took the vdsm to spawn the VM (synchronious part of execution) ?

Comment 10 Juan Hernández 2014-09-22 08:21:34 UTC
The RESTAPI invokes the RunVm command, and then it may wait or not depending on what the command returns and what the user requested. If the result of the command contains async tasks and the user requested async behavior then the RESTAPI will wait for the async tasks to complete, polling them once per second an unlimited number of times.

If the result doesn't return async tasks or the user didn't request async behavior then the RESTAPI will return the result to the user intermediately after the RunVm command finishes.

Comment 11 Barak 2014-09-22 09:40:56 UTC
(In reply to Juan Hernández from comment #10)
> The RESTAPI invokes the RunVm command, and then it may wait or not depending
> on what the command returns and what the user requested. If the result of
> the command contains async tasks and the user requested async behavior then
> the RESTAPI will wait for the async tasks to complete, polling them once per
> second an unlimited number of times.
> 
> If the result doesn't return async tasks or the user didn't request async
> behavior then the RESTAPI will return the result to the user intermediately
> after the RunVm command finishes.

IIUC In this case does it means that the engine will communicate to VDSM while the client waits for response and will wait till the synchronized part of run VM will be executed and return a response to engine ?

Comment 12 Juan Hernández 2014-09-22 09:48:10 UTC
It means that the RESTAPI will repeatedly call the GetTasksStatusesByTasksIDs query, while the client is waiting for a response. I don't know if that query involves communication to VDSM.

Comment 13 Barak 2014-09-22 11:21:35 UTC
(In reply to Juan Hernández from comment #12)
> It means that the RESTAPI will repeatedly call the
> GetTasksStatusesByTasksIDs query, while the client is waiting for a
> response. I don't know if that query involves communication to VDSM.

Running a VM is not a defined async task ... but it is async by nature.
Omer can you please answer the question in comment #11.

Comment 14 Omer Frenkel 2014-09-22 12:16:54 UTC
yes the client wait during the execute which is the sync part for run vm (not stateless) where engine calls create verb in vdsm.

Comment 15 Barak 2014-09-22 12:51:57 UTC
Francesco - please answer the questions in comment #9

Comment 16 Francesco Romani 2014-09-22 15:55:52 UTC
Sorry, I'm in PTO the whole week and I check mails sporidically...

The issue I'm facing should be reproduceable with this simple script

https://bugzilla.redhat.com/attachment.cgi?id=938835

Or maybe even with this one

=== cut here ===
#!/usr/bin/python

import sys
import time
from ovirtsdk.xml import params
from ovirtsdk.api import API


api = API(url='http://192.168.1.64:8080/api',
          username='admin@internal',
          password='admin')

if len(sys.argv) != 4:
    sys.stderr.write('usage: %s command tmpl start stop\n' % (
                     sys.argv[0]))
    sys.exit(1)

tmpl = sys.argv[1]
start = int(sys.argv[2])
stop = int(sys.argv[3])

for i in range(start, stop):
    name = '%s_C%03i' % (tmpl, i)
    print 'Starting VM: %s' % name
    try:
        api.vms.get(name).start()
        time.sleep(1)
    except Exception as exc:
        print (exc)
=== cut here ===

(both will require minimal editing to get right API parameters)
and the information I'm adding here

(In reply to Barak from comment #9)
> Francesco,
> 
> Just to be on the safe side ....
> 1 - on what env did you run this test (how many hypervisors) ?

The environment is:

* Hypervisor host: CentOS7 running on AMD Athlon 5350 (http://www.cpu-world.com/CPUs/Jaguar/AMD-Athlon%205350%20-%20AD5350JAH44HM.html) 8 GB ram. Lastest VDSM from 4.16.x (RPM from last test day), the test of the virt stack is the QEMU stack.
* Engine: last 3.5.0 RC (RPM from last test day) running on a virtual machine with 2 VCPU and 8 GB RAM. Real HW is a core i7 3770 16 GB RAM. Both guest and real host are up-to-date F19

Test were run against one hypervisor at time (of two configured).
I had 200 VMs, each one 16 MB ram, 1 VCPU, 1 GB HDD on NFS, no OS installed.
Test was run with the first 100 VMs. Please note that I sometimes saw failures even after starting the first couple of VMs.

> 2 - what was the actual time it took the vdsm to spawn the VM (synchronious
> part of execution) ?

Considering the log I attached here:
https://bugzilla.redhat.com/show_bug.cgi?id=1143968#c5

We can see that I received a timeout after a few calls. I'm attaching one example below but all of these look the same, returning -if I read correctly the logs in ~15ms.

Thread-163::DEBUG::2014-09-18 13:00:25,714::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.create' in bridge with {u'vmParams': {u'acpiEnable': u'true', u'emulatedMachine': u'rhel6.5.0', u'vmId': u'5ecc37cf-386f-4363-a9fd-7ed4312c60b4', u'memGuaranteedSize': 16, u'transparentHugePages': u'true', u'spiceSslCipherSuite': u'DEFAULT', u'cpuType': u'SandyBridge', u'smp': u'1', u'numaTune': {u'mode': u'preferred'}, u'custom': {u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762ddevice_f53382bb-09ed-40c6-bf8f-9744df943ba1': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=f53382bb-09ed-40c6-bf8f-9744df943ba1, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=1}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}', u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762ddevice_f53382bb-09ed-40c6-bf8f-9744df943ba1device_835cd3a8-6002-4590-9d18-9797bdd1afaf': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=835cd3a8-6002-4590-9d18-9797bdd1afaf, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=2}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}', u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762ddevice_f53382bb-09ed-40c6-bf8f-9744df943ba1device_835cd3a8-6002-4590-9d18-9797bdd1afafdevice_addb1082-34f7-4c9f-ad55-863e33f26408': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=addb1082-34f7-4c9f-ad55-863e33f26408, device=spicevmc, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=3}, managed=false, plugged=true, readOnly=false, deviceAlias=channel2, customProperties={}, snapshotId=null}', u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762d': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=23a9f0c4-0b62-4ba6-98be-eeeb5162762d, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={slot=0x01, bus=0x00, domain=0x0000, type=pci, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}'}, u'vmType': u'kvm', u'memSize': 16, u'smpCoresPerSocket': u'1', u'vmName': u'XS_C000', u'nice': u'0', u'bootMenuEnable': u'false', u'copyPasteEnable': u'true', u'smartcardEnable': u'false', u'fileTransferEnable': u'true', u'keyboardLayout': u'en-us', u'kvmEnable': u'true', u'displayNetwork': u'ovirtmgmt', u'devices': [{u'device': u'qxl', u'specParams': {u'vram': u'32768', u'heads': u'1'}, u'type': u'video', u'deviceId': u'938a44fb-3af7-4a35-adba-d490bdda8f54', u'address': {u'slot': u'0x02', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}}, {u'index': u'2', u'iface': u'ide', u'address': {u'bus': u'1', u'controller': u'0', u'type': u'drive', u'target': u'0', u'unit': u'0'}, u'specParams': {u'path': u''}, u'readonly': u'true', u'deviceId': u'0d35fe8f-98d2-4c4a-8315-51b75fbba576', u'path': u'', u'device': u'cdrom', u'shared': u'false', u'type': u'disk'}, {u'index': 0, u'iface': u'virtio', u'format': u'cow', u'bootOrder': u'1', u'poolID': u'00000002-0002-0002-0002-000000000064', u'volumeID': u'851aeb5b-83fd-4135-94e4-5891a8269b9f', u'imageID': u'6a204b39-bebc-48ce-8683-cea8681addc9', u'specParams': {}, u'readonly': u'false', u'domainID': u'81d8571c-0871-471f-930b-08685a502d3c', u'optional': u'false', u'deviceId': u'6a204b39-bebc-48ce-8683-cea8681addc9', u'address': {u'slot': u'0x05', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}, {u'device': u'memballoon', u'specParams': {u'model': u'virtio'}, u'type': u'balloon', u'deviceId': u'f25d3601-3289-4c4e-aa0c-eb2816ca1901', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}}, {u'index': u'0', u'specParams': {}, u'deviceId': u'9e330005-350e-48fa-b4b6-a6ee18188f4c', u'address': {u'slot': u'0x03', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'scsi', u'model': u'virtio-scsi', u'type': u'controller'}, {u'device': u'virtio-serial', u'specParams': {}, u'type': u'controller', u'deviceId': u'29714f6f-4172-4900-aa12-64249611db12', u'address': {u'slot': u'0x04', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}}], u'timeOffset': u'0', u'maxVCpus': u'16', u'spiceSecureChannels': u'smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard', u'display': u'qxl'}, u'vmID': u'5ecc37cf-386f-4363-a9fd-7ed4312c60b4'}
Thread-163::DEBUG::2014-09-18 13:00:25,751::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.create' in bridge with {u'acpiEnable': u'true', u'emulatedMachine': u'rhel6.5.0', 'pid': '0', u'memGuaranteedSize': 16, u'transparentHugePages': u'true', 'displaySecurePort': '-1', u'spiceSslCipherSuite': u'DEFAULT', u'cpuType': u'SandyBridge', u'smp': u'1', u'numaTune': {u'mode': u'preferred'}, u'custom': {u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762ddevice_f53382bb-09ed-40c6-bf8f-9744df943ba1': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=f53382bb-09ed-40c6-bf8f-9744df943ba1, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=1}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}', u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762ddevice_f53382bb-09ed-40c6-bf8f-9744df943ba1device_835cd3a8-6002-4590-9d18-9797bdd1afaf': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=835cd3a8-6002-4590-9d18-9797bdd1afaf, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=2}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}', u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762ddevice_f53382bb-09ed-40c6-bf8f-9744df943ba1device_835cd3a8-6002-4590-9d18-9797bdd1afafdevice_addb1082-34f7-4c9f-ad55-863e33f26408': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=addb1082-34f7-4c9f-ad55-863e33f26408, device=spicevmc, type=CHANNEL, bootOrder=0, specParams={}, address={bus=0, controller=0, type=virtio-serial, port=3}, managed=false, plugged=true, readOnly=false, deviceAlias=channel2, customProperties={}, snapshotId=null}', u'device_23a9f0c4-0b62-4ba6-98be-eeeb5162762d': u'VmDevice {vmId=5ecc37cf-386f-4363-a9fd-7ed4312c60b4, deviceId=23a9f0c4-0b62-4ba6-98be-eeeb5162762d, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={slot=0x01, bus=0x00, domain=0x0000, type=pci, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}'}, u'vmType': u'kvm', u'memSize': 16, u'smpCoresPerSocket': u'1', u'vmName': u'XS_C000', u'nice': u'0', 'status': 'WaitForLaunch', u'bootMenuEnable': u'false', u'vmId': u'5ecc37cf-386f-4363-a9fd-7ed4312c60b4', u'copyPasteEnable': u'true', 'displayIp': '192.168.1.48', 'displayPort': '-1', u'smartcardEnable': u'false', u'spiceSecureChannels': u'smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard', u'fileTransferEnable': u'true', 'nicModel': 'rtl8139,pv', u'keyboardLayout': u'en-us', u'kvmEnable': u'true', u'displayNetwork': u'ovirtmgmt', u'devices': [{u'specParams': {u'vram': u'32768', u'heads': u'1'}, 'deviceType': u'video', u'deviceId': u'938a44fb-3af7-4a35-adba-d490bdda8f54', u'address': {u'slot': u'0x02', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'qxl', u'type': u'video'}, {u'index': u'2', u'iface': u'ide', u'address': {u'bus': u'1', u'controller': u'0', u'type': u'drive', u'target': u'0', u'unit': u'0'}, u'specParams': {u'path': u''}, u'readonly': u'true', 'deviceType': u'disk', u'deviceId': u'0d35fe8f-98d2-4c4a-8315-51b75fbba576', u'path': u'', u'device': u'cdrom', u'shared': u'false', u'type': u'disk'}, {u'index': 0, u'iface': u'virtio', 'reqsize': '0', u'format': u'cow', 'deviceType': u'disk', u'bootOrder': u'1', u'poolID': u'00000002-0002-0002-0002-000000000064', u'volumeID': u'851aeb5b-83fd-4135-94e4-5891a8269b9f', u'imageID': u'6a204b39-bebc-48ce-8683-cea8681addc9', u'specParams': {}, u'readonly': u'false', u'domainID': u'81d8571c-0871-471f-930b-08685a502d3c', u'optional': u'false', u'deviceId': u'6a204b39-bebc-48ce-8683-cea8681addc9', u'address': {u'slot': u'0x05', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}, {u'specParams': {u'model': u'virtio'}, 'deviceType': u'balloon', u'deviceId': u'f25d3601-3289-4c4e-aa0c-eb2816ca1901', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'memballoon', u'type': u'balloon'}, {u'index': u'0', u'specParams': {}, 'deviceType': u'controller', u'deviceId': u'9e330005-350e-48fa-b4b6-a6ee18188f4c', u'address': {u'slot': u'0x03', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'scsi', u'model': u'virtio-scsi', u'type': u'controller'}, {u'specParams': {}, 'deviceType': u'controller', u'deviceId': u'29714f6f-4172-4900-aa12-64249611db12', u'address': {u'slot': u'0x04', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'virtio-serial', u'type': u'controller'}], u'timeOffset': u'0', u'maxVCpus': u'16', 'clientIp': '', u'display': u'qxl'}

Comment 17 Oved Ourfali 2014-10-23 10:48:49 UTC
We're not sure the issue still happens.
Piotr - please try to reproduce, and sync with Francesco. If needed, either provide a fix, or close it.

Comment 18 Francesco Romani 2014-10-23 10:56:25 UTC
Will retry once I'm sure all the relevant patches for 3.5.x branch are are in both on Engine and on VDSM side.

Comment 19 Piotr Kliczewski 2014-10-24 12:20:36 UTC
I tired on my local env and do not see this issue.

Can you please check with the latest master?

Comment 20 Piotr Kliczewski 2014-10-29 15:34:35 UTC
We noticed that response from vm.create is updated after returned which causes RuntimeError 'dictionary changed size during iteration' during marshaling to json.

Comment 21 Francesco Romani 2014-10-29 15:38:25 UTC
turns out that json-rpc (and possibly recent changes needed for live merge) just exposed an old flow in the Vm creation path.

Core issue is that we update Vm.conf in an unsafe manner, without properly locking Vm.conf.

Eventually, this was spotted into the VDSM logs:



    Thread-281::ERROR::2014-10-28 16:31:20,971::__init__::244::root::(sendReply)
    Traceback (most recent call last):
      File "/usr/lib/python2.6/site-packages/yajsonrpc/__init__.py", line 242, in sendReply
        encodedObjects.append(response.encode())
      File "/usr/lib/python2.6/site-packages/yajsonrpc/__init__.py", line 140, in encode
        return json.dumps(res, 'utf-8')
      File "/usr/lib64/python2.6/json/__init__.py", line 237, in dumps
        **kw).encode(obj)
      File "/usr/lib64/python2.6/json/encoder.py", line 367, in encode
        chunks = list(self.iterencode(o))
      File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
        for chunk in self._iterencode_dict(o, markers):
      File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
        for chunk in self._iterencode(value, markers):
      File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
        for chunk in self._iterencode_dict(o, markers):
      File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
        for chunk in self._iterencode(value, markers):
      File "/usr/lib64/python2.6/json/encoder.py", line 306, in _iterencode
        for chunk in self._iterencode_list(o, markers):
      File "/usr/lib64/python2.6/json/encoder.py", line 204, in _iterencode_list
        for chunk in self._iterencode(value, markers):
      File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
        for chunk in self._iterencode_dict(o, markers):
      File "/usr/lib64/python2.6/json/encoder.py", line 247, in _iterencode_dict
        for key, value in items:
    RuntimeError: dictionary changed size during iteration


We don't know yet which part of the creation part is changing the Vm.conf.
The nesting of the calls suggests the device is indeed the root cause.

Comment 22 Francesco Romani 2014-10-30 09:03:25 UTC
(In reply to Francesco Romani from comment #21)
> turns out that json-rpc (and possibly recent changes needed for live merge)
> just exposed an old flow in the Vm creation path.

s/flow/flaw/

Preliminary verification suggests that http://gerrit.ovirt.org/#/c/34599 solves the issue - or at very least makes it much much less frequent. However 34599 is not really (yet) suitable for merging as it moves potentially expensive storage calls in the sync part.

Comment 23 Omer Frenkel 2014-10-30 10:02:12 UTC
is bug 1154389 duplicate of this one?

Comment 24 Francesco Romani 2014-10-30 15:44:06 UTC
(In reply to Omer Frenkel from comment #23)
> is bug 1154389 duplicate of this one?

seems related but not sure is the same issue. Needs separate investigation.

Comment 25 Carlos Mestre González 2014-12-10 10:50:30 UTC
Just to update this, I've seen this issue in one of our automation storage tests creating only 5 vms in vt12, hasn't run in vt13.

Comment 26 Sandro Bonazzola 2015-01-15 14:25:45 UTC
This is an automated message: 
This bug should be fixed in oVirt 3.5.1 RC1, moving to QA

Comment 27 Sandro Bonazzola 2015-01-21 16:03:15 UTC
oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.