Bug 1251899 - VM stuck in wait for launch after update vm IO Threads value
Summary: VM stuck in wait for launch after update vm IO Threads value
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Tomas Jelinek
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-10 09:14 UTC by Israel Pinto
Modified: 2016-03-09 19:43 UTC (History)
14 users (show)

Fixed In Version: 3.6.0-11
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-09 19:43:44 UTC
oVirt Team: Virt
Target Upstream Version:
mgoldboi: Triaged+


Attachments (Terms of Use)
logs (877.07 KB, application/x-bzip)
2015-08-10 09:15 UTC, Israel Pinto
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 44631 master MERGED core: check the num of IO Threads 2020-04-24 18:19:12 UTC
oVirt gerrit 44676 master MERGED frontend: check the num of IO Threads 2020-04-24 18:19:12 UTC
oVirt gerrit 45088 ovirt-engine-3.6 MERGED core: check the num of IO Threads 2020-04-24 18:19:12 UTC
oVirt gerrit 45089 ovirt-engine-3.6 MERGED frontend: check the num of IO Threads 2020-04-24 18:19:12 UTC

Description Israel Pinto 2015-08-10 09:14:21 UTC
Description of problem:
VM is stuck in 'wait with launch'
in one host see vdsm error:
'Error fetching vm stats'


Version-Release number of selected component (if applicable):
3.6.0.-5
3.6.0-0.0.master.20150804111407.git122a3a0.el6
VDSM:vdsm-4.17.0-1239.git6575e3f.el7.noarch

How reproducible:
Not all the time

Steps to Reproduce:
1. Create new VM and start it
2. Stop vm, Edit vm: change IO thread number to 12345
3. Start VMywait for ~15 min and showdown VM
4. Start VM again, and set the error on engine:
and on the second host(10.35.161.158):
'The vm start process failed'. 
5. Kill vm process on host (10.35.161.158), see process(below) output contains a lot of io threads 
6. start vm again, vm is up.

Actual results:
VM stuck in 'Wait for launch' and vm status don't change.

Expected results:
VM up

Additional info:
Form host 10.35.161.159 vdsm log:
JsonRpc (StompReactor)::DEBUG::2015-08-10 11:13:14,670::stompreactor::236::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command=u'SEND'>
JsonRpcServer::DEBUG::2015-08-10 11:13:14,671::__init__::540::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request
Thread-13754::ERROR::2015-08-10 11:13:14,672::vm::1416::virt.vm::(_getRunningVmStats) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::Error fetching vm stats
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
    vm_sample.interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
    networks(vm, stats, first_sample, last_sample, interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
    if nic.name.startswith('hostdev'):
AttributeError: name
Thread-13754::DEBUG::2015-08-10 11:13:14,673::stompreactor::305::yajsonrpc.StompServer::(send) Sending response


From host 10.35.161.158:
Thread-32703::ERROR::2015-08-10 11:37:01,763::vm::786::virt.vm::(_startUnderlyingVm) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 731, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1896, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3427, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Requested operation is not valid: domain 'vm_with_ui' is already active
Thread-32703::INFO::2015-08-10 11:37:01,768::vm::1305::virt.vm::(setDownStatus) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::Changed state to Down: Requested operation is not valid: domain 'vm_with_ui' is already active (code=1)


VM process output: ps -ef | grep qemu
qemu      5155     1 96 11:18 ?        00:22:45 /usr/libexec/qemu-kvm -name vm_with_ui -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Conroe -m 4096 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -object iothread,id=iothread2 -object iothread,id=iothread3 -object iothread,id=iothread4 -object iothread,id=iothread5 -object iothread,id=iothread6 -object iothread,id=iothread7 -object iothread,id=iothread8 -object iothread,id=iothread9 -object iothread,id=iothread10 -object iothread,id=iothread11 -object iothread,id=iothread12 -object iothread,id=iothread13 -object iothread,id=iothread14 -object iothread,id=iothread15 -object iothread,id=iothread16 -object iothread,id=iothread17 -object iothread,id=iothread18 -object iothread,id=iothread19 -object iothread,id=iothread20 -object iothread,id=iothread21 -object iothread,id=iothread22 -object iothread,id=iothread23 -object iothread,id=iothread24 -object iothread,id=iothread25 -object iothread,id=iothread26 -object iothread,id=iothread27 -object iothread,id=iothread28 -object iothread,id=iothread29 -object iothread,id=iothread30 -object iothread,id=iothread31 -object iothread,id=iothread32 -object iothread,id=iothread33 -object iothread,id=iothread34 -object iothread,id=iothread35 -object iothread,id=iothread36 -object iothread,id=iothread37 -object iothread,id=iothread38 -object iothread,id=iothread39 -object iothread,id=iothread40 -object iothread,id=iothread41 -object iothread,id=iothread42 -object iothread,id=iothread43 -object iothread,id=iothread44 -object iothread,id=iothread45 -object iothread,id=iothread46 -object iothread,id=iothread47 -object iothread,id=iothread48 -object iothread,id=iothread49 -object iothread,id=iothread50 -object iothread,id=iothread51 -object iothread,id=iothread52 -object iothread,id=iothread53 -object iothread,id=iothread54 -object iothread,id=iothread55 -object iothread,id=iothread56 -object iothread,id=iothread57 -object iothread,id=iothread58 -object iothread,id=iothread59 -object iothread,id=iothread60 -object iothread,id=iothread61 -object iothread,id=iothread62 -object iothread,id=iothread63 -object iothread,id=iothread64 -object iothread,id=iothread65 -object iothread,id=iothread66 -object iothread,id=iothread67 -object iothread,id=iothread68 -object iothread,id=iothread69 -object iothread,id=iothread70 -object iothread,id=iothread71 -object iothread,id=iothread72 -object iothread,id=iothread73 -object iothread,id=iothread74 -object iothread,id=iothread75 -object iothread,id=iothread76 -object iothread,id=iothread77 -object iothread,id=iothread78 -object iothread,id=iothread79 -object iothread,id=iothread80 -object iothread,id=iothread81 -object iothread,id=iothread82 -object iothread,id=iothread83 -object iothread,id=iothread84 -object iothread,id=iothread85 -object iothread,id=iothread86 -object iothread,id=iothread87 -object iothread,id=iothread88 -object iothread,id=iothread89 -object iothread,id=iothread90 -object iothread,id=iothread91 -object iothread,id=iothread92 -object iothread,id=iothread93 -object iothread,id=iothread94 -object iothread,id=iothread95 -object iothread,id=iothread96 -object iothread,id=iothread97 -object iothread,id=iothread98 -object iothread,id=iothread99 -object iothread,id=iothread100 -object iothread,id=iothread101 -object iothread,id=iothread102 -object iothread,id=iothread103 -object iothread,id=iothread104 -object iothread,id=iothread105 -object iothread,id=iothread106 -object iothread,id=iothread107 -object iothread,id=iothread108 -object iothread,id=iothread109 -object iothread,id=iothread110 -object iothread,id=iothread111 -object iothread,id=iothread112 -object iothread,id=iothread113 -object iothread,id=iothread114 -object iothread,id=iothread115 -object iothread,id=iothread116 -object iothread,id=iothread117 -object iothread,id=iothread118 -object iothread,id=iothread119 -object iothread,id=iothread120 -object iothread,id=iothread121 -object iothread,id=iothread122 -object iothread,id=iothread123 -object iothread,id=iothread124 -object iothread,id=iothread125 -object iothread,id=iothread

attached hosts and engine log.

Comment 1 Israel Pinto 2015-08-10 09:15:38 UTC
Created attachment 1060987 [details]
logs

Comment 2 Tomas Jelinek 2015-08-10 11:35:22 UTC
@Francesco I would say the root cause can be this:

  File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
    vm_sample.interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
    networks(vm, stats, first_sample, last_sample, interval)
  File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
    if nic.name.startswith('hostdev'):

Do you think it could explain the symptoms in this bug?

Comment 3 Francesco Romani 2015-08-10 11:55:03 UTC
(In reply to Tomas Jelinek from comment #2)
> @Francesco I would say the root cause can be this:
> 
>   File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
>     vm_sample.interval)
>   File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
>     networks(vm, stats, first_sample, last_sample, interval)
>   File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
>     if nic.name.startswith('hostdev'):
> 
> Do you think it could explain the symptoms in this bug?

This surely causes missing events to be sent to Engine, and lack of statistics reported later on.

But a call of getAllVmStats should report status correctly Up.

Isreal, can you please provide the output of the

'vdsClient -s 0 getAllVmStats', run from the hypervisor host, when you reproduce the issue?

As a side note, I believe that "12345" iothreads is WAY too high to be meaningful. I see many periodic thread being replaced, this is a symptom of sluggish or stuck libvirt, which can in turn be caused by sluggish or stuck QEMU.

Should'nt we cap the amount of iothreads?

Comment 4 Francesco Romani 2015-08-10 12:21:17 UTC
(In reply to Francesco Romani from comment #3)
> (In reply to Tomas Jelinek from comment #2)
> > @Francesco I would say the root cause can be this:
> > 
> >   File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats
> >     vm_sample.interval)
> >   File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce
> >     networks(vm, stats, first_sample, last_sample, interval)
> >   File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks
> >     if nic.name.startswith('hostdev'):
> > 
> > Do you think it could explain the symptoms in this bug?
> 
> This surely causes missing events to be sent to Engine, and lack of
> statistics reported later on.
> 
> But a call of getAllVmStats should report status correctly Up.
> 
> Isreal, can you please provide the output of the
> 
> 'vdsClient -s 0 getAllVmStats', run from the hypervisor host, when you
> reproduce the issue?
> 
> As a side note, I believe that "12345" iothreads is WAY too high to be
> meaningful. I see many periodic thread being replaced, this is a symptom of
> sluggish or stuck libvirt, which can in turn be caused by sluggish or stuck
> QEMU.
> 
> Should'nt we cap the amount of iothreads?

Closer inspection of logs suggests that libvirt/qemu just takes very long amount of time to boot when configured with "12345" iothreads. This can explain why libvirt is not responding and why the network device name's missing:
is just that VDSM has not yet heard back from libvirt, so the VM is still half baked and into an inconsistent state.

The main (and only) bug here is that there is no cap for iothreads. I can't think of any scenario on which such high amount is desiderable. I think we should cap it either in VDSM or -better- in Engine.

Or we can file a scalability bug in QEMU -if this value indeed makes sense.

Comment 5 Tomas Jelinek 2015-08-10 13:06:20 UTC
ok, so after a discussion with Francesco we have agreed that:
- we should make an engine-config param to set the max amount of IO Threads per VM
- this value will than be used by WA/UP and backend to validate the input
- the default max value will be 4

@Michal - do you agree?

Comment 6 Michal Skrivanek 2015-08-10 13:26:47 UTC
let's add a simple threshold to filter out nonsense values 0-127 should be enough for everyone and shouldn't affect qemu/host too much

Comment 7 Tomas Jelinek 2015-08-20 08:39:25 UTC
the fix was just to add a check that the num of iothreads will not be too high.

Comment 8 Israel Pinto 2015-09-08 11:17:11 UTC
verify with:
Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.13.master.el6
VDSM: vdsm-4.17.5-1.el7ev 

update IO threads to 12 - Pass
update IO threads to 127 - Pass
update IO threads to 12345(did not let to update) - Pass

Comment 11 errata-xmlrpc 2016-03-09 19:43:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.