Bug 1102701 - [Scale] - there no correlation about the vm status between the engine to vdsm
Summary: [Scale] - there no correlation about the vm status between the engine to vdsm
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Nir Soffer
QA Contact: Yuri Obshansky
URL:
Whiteboard: storage
Depends On: 1083771
Blocks: rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-05-29 12:39 UTC by Eldad Marciano
Modified: 2016-02-10 20:24 UTC (History)
12 users (show)

Fixed In Version: vt1.3, 4.16.0-1.el6_5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-16 13:41:11 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm.log (2.23 MB, application/zip)
2014-06-02 15:32 UTC, Eldad Marciano
no flags Details

Description Eldad Marciano 2014-05-29 12:39:54 UTC
Description of problem:

On scale deployment.
There is no correlation between the engine to vdsm

engine saying this (VM test_0093 was started by admin@internal (Host: host01-rack06.scale.openstack.engineering.redhat.com):

2014-05-29 11:50:08,357 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-4) [551fce46] START, CreateVDSCommand(HostName = host01-rack06.scale.openstack.engineering.redhat.com, HostId = c9f64077-87fe-45b5-80d9-1f133d38ce7b, vmId=a284c1f1-36f9-4e26-94bc-f0b464b75053, vm=VM [test_0093]), log id: d4befe7
2014-05-29 11:50:12,463 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-4) [551fce46] org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand spiceSslCipherSuite=DEFAULT,memSize=256,kvmEnable=true,smp=1,vmType=kvm,emulatedMachine=rhel6.5.0,keyboardLayout=en-us,memGuaranteedSize=1024,pitReinjection=false,nice=0,display=qxl,smartcardEnable=false,smpCoresPerSocket=1,spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard,timeOffset=0,transparentHugePages=true,vmId=a284c1f1-36f9-4e26-94bc-f0b464b75053,devices=[Ljava.util.HashMap;@258e5204,acpiEnable=true,vmName=test_0093,cpuType=SandyBridge,custom={device_164795dd-1798-42bc-96b8-54b1d1e9e8dcdevice_78974fc0-0bbe-4e6b-a483-f4ac2271daa9device_7b881cce-b72d-4d3c-a1df-61dab04ebe99device_7fb377d4-0a53-43c7-9a4a-f794d5c10a9adevice_8007dc3d-30ee-4391-b544-27e8992f3d77=VmDevice {vmId=a284c1f1-36f9-4e26-94bc-f0b464b75053, deviceId=8007dc3d-30ee-4391-b544-27e8992f3d77, device=virtio-serial, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x05, function=0x0}, managed=false, plugged=true, readOnly=false, deviceAlias=virtio-serial0, customProperties={}, snapshotId=null}, device_164795dd-1798-42bc-96b8-54b1d1e9e8dc=VmDevice {vmId=a284c1f1-36f9-4e26-94bc-f0b464b75053, deviceId=164795dd-1798-42bc-96b8-54b1d1e9e8dc, device=spicevmc, type=CHANNEL, bootOrder=0, specParams={}, address={port=3, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel2, customProperties={}, snapshotId=null}, device_164795dd-1798-42bc-96b8-54b1d1e9e8dcdevice_78974fc0-0bbe-4e6b-a483-f4ac2271daa9device_7b881cce-b72d-4d3c-a1df-61dab04ebe99=VmDevice {vmId=a284c1f1-36f9-4e26-94bc-f0b464b75053, deviceId=7b881cce-b72d-4d3c-a1df-61dab04ebe99, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}, device_164795dd-1798-42bc-96b8-54b1d1e9e8dcdevice_78974fc0-0bbe-4e6b-a483-f4ac2271daa9device_7b881cce-b72d-4d3c-a1df-61dab04ebe99device_7fb377d4-0a53-43c7-9a4a-f794d5c10a9a=VmDevice {vmId=a284c1f1-36f9-4e26-94bc-f0b464b75053, deviceId=7fb377d4-0a53-43c7-9a4a-f794d5c10a9a, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=2, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel1, customProperties={}, snapshotId=null}, device_164795dd-1798-42bc-96b8-54b1d1e9e8dcdevice_78974fc0-0bbe-4e6b-a483-f4ac2271daa9=VmDevice {vmId=a284c1f1-36f9-4e26-94bc-f0b464b75053, deviceId=78974fc0-0bbe-4e6b-a483-f4ac2271daa9, device=unix, type=CHANNEL, bootOrder=0, specParams={}, address={port=1, bus=0, controller=0, type=virtio-serial}, managed=false, plugged=true, readOnly=false, deviceAlias=channel0, customProperties={}, snapshotId=null}}
2014-05-29 11:50:12,464 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-4) [551fce46] FINISH, CreateVDSCommand, log id: d4befe7
2014-05-29 11:50:12,472 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-4-thread-4) [551fce46] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 3e1c4f87
2014-05-29 11:50:12,472 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-4) [551fce46] Lock freed to object EngineLock [exclusiveLocks= key: a284c1f1-36f9-4e26-94bc-f0b464b75053 value: VM
, sharedLocks= ]
2014-05-29 11:50:12,480 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-4) [551fce46] Correlation ID: 551fce46, Job ID: efde582c-bae3-443c-aa35-44197541d219, Call Stack: null, Custom Event ID: -1, Message: VM test_0093 was started by admin@internal (Host: host01-rack06.scale.openstack.engineering.redhat.com).



In vdsm side:
there is no kvm process with the same vm name (ps -ef |grep -i 'test_0093').
vdsm.log just report the xmlrpc of that vm.


I have also checked the UI, and he display the vm as 'Waiting to launch' over 33 min.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.build 100 vms and running them.

Actual results:
after ~50 vms running, the rest of the vms is not running and waiting for launch over 33 min.

Expected results:
time required.

Additional info:
also the host displayed the vm as running(as we can see at the engine log, the engine think that vm is active).
refer to bug https://bugzilla.redhat.com/show_bug.cgi?id=1102664

Comment 1 Martin Sivák 2014-05-29 13:12:41 UTC
WaitForLaunch means that VDSM was asked to start the VM, but no report about the VM has came back yet.

So it is expected that the numbers do not agree.

Comment 2 Martin Sivák 2014-05-29 15:22:46 UTC
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3085, in getVolumeSize
    apparentsize = str(volClass.getVSize(dom, imgUUID, volUUID, bs=1))
  File "/usr/share/vdsm/storage/fileVolume.py", line 418, in getVSize
    return int(sdobj.oop.os.stat(volPath).st_size / bs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 312, in callCrabRPCFunction
    raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool

I very much suspect this (repeated) exception to be the reason for the VMs to stay in WaitForLaunch for such a long time.

Comment 3 Martin Sivák 2014-05-29 15:41:25 UTC
Im CCing storage people as this is outside of the scope of SLA team.

Comment 4 Allon Mureinik 2014-05-30 11:40:58 UTC
Fede/Nir, weren't we working on removing this?

Comment 5 Eldad Marciano 2014-06-02 15:30:04 UTC
100 % reproduced while maintenance host with 100+ vms.
still under investigation. 
adding logs and findings ASAP.

Comment 6 Eldad Marciano 2014-06-02 15:32:37 UTC
Created attachment 901478 [details]
vdsm.log

grep for 'No free file handlers in pool'

Comment 7 Nir Soffer 2014-06-09 11:11:46 UTC
(In reply to Allon Mureinik from comment #4)
> Fede/Nir, weren't we working on removing this?

I was not working on removing this.

Comment 9 Nir Soffer 2014-06-11 10:17:59 UTC
This is a limit of the current implementation - there are up to 10 file handlers per domain, and if you are trying to do too many operation concurrently you will get this failure.

Will be fixed by bug 1083771.

Comment 12 Yuri Obshansky 2014-12-23 15:24:38 UTC
Bug verified on:
- RHEV-M 3.5.0-0.22.el6ev
- RHEL - 6Server - 6.6.0.2.el6
- libvirt-0.10.2-46.el6_6.1
- vdsm-4.16.7.6-1.el6ev
It didn't reproduce.
Moved to verified.


Note You need to log in before you can comment on or make changes to this bug.