Bug 1251899 - VM stuck in wait for launch after update vm IO Threads value
VM stuck in wait for launch after update vm IO Threads value
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.6.0
Unspecified Unspecified
low Severity high
: ovirt-3.6.0-rc
: 3.6.0
Assigned To: Tomas Jelinek
Israel Pinto
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-10 05:14 EDT by Israel Pinto
Modified: 2016-03-09 14:43 EST (History)
14 users (show)

See Also:
Fixed In Version: 3.6.0-11
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-09 14:43:44 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mgoldboi: Triaged+


Attachments (Terms of Use)
logs (877.07 KB, application/x-bzip)
2015-08-10 05:15 EDT, Israel Pinto
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 44631 master MERGED core: check the num of IO Threads Never
oVirt gerrit 44676 master MERGED frontend: check the num of IO Threads Never
oVirt gerrit 45088 ovirt-engine-3.6 MERGED core: check the num of IO Threads Never
oVirt gerrit 45089 ovirt-engine-3.6 MERGED frontend: check the num of IO Threads Never

  None (edit)
Description Israel Pinto 2015-08-10 05:14:21 EDT
Description of problem:
VM is stuck in 'wait with launch'
in one host see vdsm error:
'Error fetching vm stats'


Version-Release number of selected component (if applicable):
3.6.0.-5
3.6.0-0.0.master.20150804111407.git122a3a0.el6
VDSM:vdsm-4.17.0-1239.git6575e3f.el7.noarch

How reproducible:
Not all the time

Steps to Reproduce:
1. Create new VM and start it
2. Stop vm, Edit vm: change IO thread number to 12345
3. Start VMywait for ~15 min and showdown VM
4. Start VM again, and set the error on engine:
and on the second host(10.35.161.158):
'The vm start process failed'. 
5. Kill vm process on host (10.35.161.158), see process(below) output contains a lot of io threads 
6. start vm again, vm is up.

Actual results:
VM stuck in 'Wait for launch' and vm status don't change.

Expected results:
VM up

Additional info:
Form host 10.35.161.159 vdsm log:
JsonRpc (StompReactor)::DEBUG::2015-08-10 11:13:14,670::stompreactor::236::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command=u'SEND'>
JsonRpcServer::DEBUG::2015-08-10 11:13:14,671::__init__::540::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request
Thread-13754::ERROR::2015-08-10 11:13:14,672::vm::1416::virt.vm::(_getRunningVmStats) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::Error fetching vm stats
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
    vm_sample.interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
    networks(vm, stats, first_sample, last_sample, interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
    if nic.name.startswith('hostdev'):
AttributeError: name
Thread-13754::DEBUG::2015-08-10 11:13:14,673::stompreactor::305::yajsonrpc.StompServer::(send) Sending response


From host 10.35.161.158:
Thread-32703::ERROR::2015-08-10 11:37:01,763::vm::786::virt.vm::(_startUnderlyingVm) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 731, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1896, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3427, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Requested operation is not valid: domain 'vm_with_ui' is already active
Thread-32703::INFO::2015-08-10 11:37:01,768::vm::1305::virt.vm::(setDownStatus) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::Changed state to Down: Requested operation is not valid: domain 'vm_with_ui' is already active (code=1)


VM process output: ps -ef | grep qemu
qemu      5155     1 96 11:18 ?        00:22:45 /usr/libexec/qemu-kvm -name vm_with_ui -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Conroe -m 4096 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -object iothread,id=iothread2 -object iothread,id=iothread3 -object iothread,id=iothread4 -object iothread,id=iothread5 -object iothread,id=iothread6 -object iothread,id=iothread7 -object iothread,id=iothread8 -object iothread,id=iothread9 -object iothread,id=iothread10 -object iothread,id=iothread11 -object iothread,id=iothread12 -object iothread,id=iothread13 -object iothread,id=iothread14 -object iothread,id=iothread15 -object iothread,id=iothread16 -object iothread,id=iothread17 -object iothread,id=iothread18 -object iothread,id=iothread19 -object iothread,id=iothread20 -object iothread,id=iothread21 -object iothread,id=iothread22 -object iothread,id=iothread23 -object iothread,id=iothread24 -object iothread,id=iothread25 -object iothread,id=iothread26 -object iothread,id=iothread27 -object iothread,id=iothread28 -object iothread,id=iothread29 -object iothread,id=iothread30 -object iothread,id=iothread31 -object iothread,id=iothread32 -object iothread,id=iothread33 -object iothread,id=iothread34 -object iothread,id=iothread35 -object iothread,id=iothread36 -object iothread,id=iothread37 -object iothread,id=iothread38 -object iothread,id=iothread39 -object iothread,id=iothread40 -object iothread,id=iothread41 -object iothread,id=iothread42 -object iothread,id=iothread43 -object iothread,id=iothread44 -object iothread,id=iothread45 -object iothread,id=iothread46 -object iothread,id=iothread47 -object iothread,id=iothread48 -object iothread,id=iothread49 -object iothread,id=iothread50 -object iothread,id=iothread51 -object iothread,id=iothread52 -object iothread,id=iothread53 -object iothread,id=iothread54 -object iothread,id=iothread55 -object iothread,id=iothread56 -object iothread,id=iothread57 -object iothread,id=iothread58 -object iothread,id=iothread59 -object iothread,id=iothread60 -object iothread,id=iothread61 -object iothread,id=iothread62 -object iothread,id=iothread63 -object iothread,id=iothread64 -object iothread,id=iothread65 -object iothread,id=iothread66 -object iothread,id=iothread67 -object iothread,id=iothread68 -object iothread,id=iothread69 -object iothread,id=iothread70 -object iothread,id=iothread71 -object iothread,id=iothread72 -object iothread,id=iothread73 -object iothread,id=iothread74 -object iothread,id=iothread75 -object iothread,id=iothread76 -object iothread,id=iothread77 -object iothread,id=iothread78 -object iothread,id=iothread79 -object iothread,id=iothread80 -object iothread,id=iothread81 -object iothread,id=iothread82 -object iothread,id=iothread83 -object iothread,id=iothread84 -object iothread,id=iothread85 -object iothread,id=iothread86 -object iothread,id=iothread87 -object iothread,id=iothread88 -object iothread,id=iothread89 -object iothread,id=iothread90 -object iothread,id=iothread91 -object iothread,id=iothread92 -object iothread,id=iothread93 -object iothread,id=iothread94 -object iothread,id=iothread95 -object iothread,id=iothread96 -object iothread,id=iothread97 -object iothread,id=iothread98 -object iothread,id=iothread99 -object iothread,id=iothread100 -object iothread,id=iothread101 -object iothread,id=iothread102 -object iothread,id=iothread103 -object iothread,id=iothread104 -object iothread,id=iothread105 -object iothread,id=iothread106 -object iothread,id=iothread107 -object iothread,id=iothread108 -object iothread,id=iothread109 -object iothread,id=iothread110 -object iothread,id=iothread111 -object iothread,id=iothread112 -object iothread,id=iothread113 -object iothread,id=iothread114 -object iothread,id=iothread115 -object iothread,id=iothread116 -object iothread,id=iothread117 -object iothread,id=iothread118 -object iothread,id=iothread119 -object iothread,id=iothread120 -object iothread,id=iothread121 -object iothread,id=iothread122 -object iothread,id=iothread123 -object iothread,id=iothread124 -object iothread,id=iothread125 -object iothread,id=iothread

attached hosts and engine log.
Comment 1 Israel Pinto 2015-08-10 05:15:38 EDT
Created attachment 1060987 [details]
logs
Comment 2 Tomas Jelinek 2015-08-10 07:35:22 EDT
@Francesco I would say the root cause can be this:

  File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
    vm_sample.interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
    networks(vm, stats, first_sample, last_sample, interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
    if nic.name.startswith('hostdev'):

Do you think it could explain the symptoms in this bug?
Comment 3 Francesco Romani 2015-08-10 07:55:03 EDT
(In reply to Tomas Jelinek from comment #2)
> @Francesco I would say the root cause can be this:
> 
>   File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
>     vm_sample.interval)
>   File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
>     networks(vm, stats, first_sample, last_sample, interval)
>   File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
>     if nic.name.startswith('hostdev'):
> 
> Do you think it could explain the symptoms in this bug?

This surely causes missing events to be sent to Engine, and lack of statistics reported later on.

But a call of getAllVmStats should report status correctly Up.

Isreal, can you please provide the output of the

'vdsClient -s 0 getAllVmStats', run from the hypervisor host, when you reproduce the issue?

As a side note, I believe that "12345" iothreads is WAY too high to be meaningful. I see many periodic thread being replaced, this is a symptom of sluggish or stuck libvirt, which can in turn be caused by sluggish or stuck QEMU.

Should'nt we cap the amount of iothreads?
Comment 4 Francesco Romani 2015-08-10 08:21:17 EDT
(In reply to Francesco Romani from comment #3)
> (In reply to Tomas Jelinek from comment #2)
> > @Francesco I would say the root cause can be this:
> > 
> >   File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
> >     vm_sample.interval)
> >   File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
> >     networks(vm, stats, first_sample, last_sample, interval)
> >   File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
> >     if nic.name.startswith('hostdev'):
> > 
> > Do you think it could explain the symptoms in this bug?
> 
> This surely causes missing events to be sent to Engine, and lack of
> statistics reported later on.
> 
> But a call of getAllVmStats should report status correctly Up.
> 
> Isreal, can you please provide the output of the
> 
> 'vdsClient -s 0 getAllVmStats', run from the hypervisor host, when you
> reproduce the issue?
> 
> As a side note, I believe that "12345" iothreads is WAY too high to be
> meaningful. I see many periodic thread being replaced, this is a symptom of
> sluggish or stuck libvirt, which can in turn be caused by sluggish or stuck
> QEMU.
> 
> Should'nt we cap the amount of iothreads?

Closer inspection of logs suggests that libvirt/qemu just takes very long amount of time to boot when configured with "12345" iothreads. This can explain why libvirt is not responding and why the network device name's missing:
is just that VDSM has not yet heard back from libvirt, so the VM is still half baked and into an inconsistent state.

The main (and only) bug here is that there is no cap for iothreads. I can't think of any scenario on which such high amount is desiderable. I think we should cap it either in VDSM or -better- in Engine.

Or we can file a scalability bug in QEMU -if this value indeed makes sense.
Comment 5 Tomas Jelinek 2015-08-10 09:06:20 EDT
ok, so after a discussion with Francesco we have agreed that:
- we should make an engine-config param to set the max amount of IO Threads per VM
- this value will than be used by WA/UP and backend to validate the input
- the default max value will be 4

@Michal - do you agree?
Comment 6 Michal Skrivanek 2015-08-10 09:26:47 EDT
let's add a simple threshold to filter out nonsense values 0-127 should be enough for everyone and shouldn't affect qemu/host too much
Comment 7 Tomas Jelinek 2015-08-20 04:39:25 EDT
the fix was just to add a check that the num of iothreads will not be too high.
Comment 8 Israel Pinto 2015-09-08 07:17:11 EDT
verify with:
Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.13.master.el6
VDSM: vdsm-4.17.5-1.el7ev 

update IO threads to 12 - Pass
update IO threads to 127 - Pass
update IO threads to 12345(did not let to update) - Pass
Comment 11 errata-xmlrpc 2016-03-09 14:43:44 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.