Description of problem: VM is stuck in 'wait with launch' in one host see vdsm error: 'Error fetching vm stats' Version-Release number of selected component (if applicable): 3.6.0.-5 3.6.0-0.0.master.20150804111407.git122a3a0.el6 VDSM:vdsm-4.17.0-1239.git6575e3f.el7.noarch How reproducible: Not all the time Steps to Reproduce: 1. Create new VM and start it 2. Stop vm, Edit vm: change IO thread number to 12345 3. Start VMywait for ~15 min and showdown VM 4. Start VM again, and set the error on engine: and on the second host(10.35.161.158): 'The vm start process failed'. 5. Kill vm process on host (10.35.161.158), see process(below) output contains a lot of io threads 6. start vm again, vm is up. Actual results: VM stuck in 'Wait for launch' and vm status don't change. Expected results: VM up Additional info: Form host 10.35.161.159 vdsm log: JsonRpc (StompReactor)::DEBUG::2015-08-10 11:13:14,670::stompreactor::236::Broker.StompAdapter::(handle_frame) Handling message <StompFrame command=u'SEND'> JsonRpcServer::DEBUG::2015-08-10 11:13:14,671::__init__::540::jsonrpc.JsonRpcServer::(serve_requests) Waiting for request Thread-13754::ERROR::2015-08-10 11:13:14,672::vm::1416::virt.vm::(_getRunningVmStats) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::Error fetching vm stats Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats vm_sample.interval) File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce networks(vm, stats, first_sample, last_sample, interval) File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks if nic.name.startswith('hostdev'): AttributeError: name Thread-13754::DEBUG::2015-08-10 11:13:14,673::stompreactor::305::yajsonrpc.StompServer::(send) Sending response From host 10.35.161.158: Thread-32703::ERROR::2015-08-10 11:37:01,763::vm::786::virt.vm::(_startUnderlyingVm) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 731, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 1896, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3427, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: Requested operation is not valid: domain 'vm_with_ui' is already active Thread-32703::INFO::2015-08-10 11:37:01,768::vm::1305::virt.vm::(setDownStatus) vmId=`b88b433c-7eeb-4236-83f9-1997c9f9d0eb`::Changed state to Down: Requested operation is not valid: domain 'vm_with_ui' is already active (code=1) VM process output: ps -ef | grep qemu qemu 5155 1 96 11:18 ? 00:22:45 /usr/libexec/qemu-kvm -name vm_with_ui -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Conroe -m 4096 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -object iothread,id=iothread2 -object iothread,id=iothread3 -object iothread,id=iothread4 -object iothread,id=iothread5 -object iothread,id=iothread6 -object iothread,id=iothread7 -object iothread,id=iothread8 -object iothread,id=iothread9 -object iothread,id=iothread10 -object iothread,id=iothread11 -object iothread,id=iothread12 -object iothread,id=iothread13 -object iothread,id=iothread14 -object iothread,id=iothread15 -object iothread,id=iothread16 -object iothread,id=iothread17 -object iothread,id=iothread18 -object iothread,id=iothread19 -object iothread,id=iothread20 -object iothread,id=iothread21 -object iothread,id=iothread22 -object iothread,id=iothread23 -object iothread,id=iothread24 -object iothread,id=iothread25 -object iothread,id=iothread26 -object iothread,id=iothread27 -object iothread,id=iothread28 -object iothread,id=iothread29 -object iothread,id=iothread30 -object iothread,id=iothread31 -object iothread,id=iothread32 -object iothread,id=iothread33 -object iothread,id=iothread34 -object iothread,id=iothread35 -object iothread,id=iothread36 -object iothread,id=iothread37 -object iothread,id=iothread38 -object iothread,id=iothread39 -object iothread,id=iothread40 -object iothread,id=iothread41 -object iothread,id=iothread42 -object iothread,id=iothread43 -object iothread,id=iothread44 -object iothread,id=iothread45 -object iothread,id=iothread46 -object iothread,id=iothread47 -object iothread,id=iothread48 -object iothread,id=iothread49 -object iothread,id=iothread50 -object iothread,id=iothread51 -object iothread,id=iothread52 -object iothread,id=iothread53 -object iothread,id=iothread54 -object iothread,id=iothread55 -object iothread,id=iothread56 -object iothread,id=iothread57 -object iothread,id=iothread58 -object iothread,id=iothread59 -object iothread,id=iothread60 -object iothread,id=iothread61 -object iothread,id=iothread62 -object iothread,id=iothread63 -object iothread,id=iothread64 -object iothread,id=iothread65 -object iothread,id=iothread66 -object iothread,id=iothread67 -object iothread,id=iothread68 -object iothread,id=iothread69 -object iothread,id=iothread70 -object iothread,id=iothread71 -object iothread,id=iothread72 -object iothread,id=iothread73 -object iothread,id=iothread74 -object iothread,id=iothread75 -object iothread,id=iothread76 -object iothread,id=iothread77 -object iothread,id=iothread78 -object iothread,id=iothread79 -object iothread,id=iothread80 -object iothread,id=iothread81 -object iothread,id=iothread82 -object iothread,id=iothread83 -object iothread,id=iothread84 -object iothread,id=iothread85 -object iothread,id=iothread86 -object iothread,id=iothread87 -object iothread,id=iothread88 -object iothread,id=iothread89 -object iothread,id=iothread90 -object iothread,id=iothread91 -object iothread,id=iothread92 -object iothread,id=iothread93 -object iothread,id=iothread94 -object iothread,id=iothread95 -object iothread,id=iothread96 -object iothread,id=iothread97 -object iothread,id=iothread98 -object iothread,id=iothread99 -object iothread,id=iothread100 -object iothread,id=iothread101 -object iothread,id=iothread102 -object iothread,id=iothread103 -object iothread,id=iothread104 -object iothread,id=iothread105 -object iothread,id=iothread106 -object iothread,id=iothread107 -object iothread,id=iothread108 -object iothread,id=iothread109 -object iothread,id=iothread110 -object iothread,id=iothread111 -object iothread,id=iothread112 -object iothread,id=iothread113 -object iothread,id=iothread114 -object iothread,id=iothread115 -object iothread,id=iothread116 -object iothread,id=iothread117 -object iothread,id=iothread118 -object iothread,id=iothread119 -object iothread,id=iothread120 -object iothread,id=iothread121 -object iothread,id=iothread122 -object iothread,id=iothread123 -object iothread,id=iothread124 -object iothread,id=iothread125 -object iothread,id=iothread attached hosts and engine log.
Created attachment 1060987 [details] logs
@Francesco I would say the root cause can be this: File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats vm_sample.interval) File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce networks(vm, stats, first_sample, last_sample, interval) File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks if nic.name.startswith('hostdev'): Do you think it could explain the symptoms in this bug?
(In reply to Tomas Jelinek from comment #2) > @Francesco I would say the root cause can be this: > > File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats > vm_sample.interval) > File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce > networks(vm, stats, first_sample, last_sample, interval) > File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks > if nic.name.startswith('hostdev'): > > Do you think it could explain the symptoms in this bug? This surely causes missing events to be sent to Engine, and lack of statistics reported later on. But a call of getAllVmStats should report status correctly Up. Isreal, can you please provide the output of the 'vdsClient -s 0 getAllVmStats', run from the hypervisor host, when you reproduce the issue? As a side note, I believe that "12345" iothreads is WAY too high to be meaningful. I see many periodic thread being replaced, this is a symptom of sluggish or stuck libvirt, which can in turn be caused by sluggish or stuck QEMU. Should'nt we cap the amount of iothreads?
(In reply to Francesco Romani from comment #3) > (In reply to Tomas Jelinek from comment #2) > > @Francesco I would say the root cause can be this: > > > > File "/usr/share/vdsm/virt/vm.py", line 1413, in _getRunningVmStats > > vm_sample.interval) > > File "/usr/share/vdsm/virt/vmstats.py", line 42, in produce > > networks(vm, stats, first_sample, last_sample, interval) > > File "/usr/share/vdsm/virt/vmstats.py", line 217, in networks > > if nic.name.startswith('hostdev'): > > > > Do you think it could explain the symptoms in this bug? > > This surely causes missing events to be sent to Engine, and lack of > statistics reported later on. > > But a call of getAllVmStats should report status correctly Up. > > Isreal, can you please provide the output of the > > 'vdsClient -s 0 getAllVmStats', run from the hypervisor host, when you > reproduce the issue? > > As a side note, I believe that "12345" iothreads is WAY too high to be > meaningful. I see many periodic thread being replaced, this is a symptom of > sluggish or stuck libvirt, which can in turn be caused by sluggish or stuck > QEMU. > > Should'nt we cap the amount of iothreads? Closer inspection of logs suggests that libvirt/qemu just takes very long amount of time to boot when configured with "12345" iothreads. This can explain why libvirt is not responding and why the network device name's missing: is just that VDSM has not yet heard back from libvirt, so the VM is still half baked and into an inconsistent state. The main (and only) bug here is that there is no cap for iothreads. I can't think of any scenario on which such high amount is desiderable. I think we should cap it either in VDSM or -better- in Engine. Or we can file a scalability bug in QEMU -if this value indeed makes sense.
ok, so after a discussion with Francesco we have agreed that: - we should make an engine-config param to set the max amount of IO Threads per VM - this value will than be used by WA/UP and backend to validate the input - the default max value will be 4 @Michal - do you agree?
let's add a simple threshold to filter out nonsense values 0-127 should be enough for everyone and shouldn't affect qemu/host too much
the fix was just to add a check that the num of iothreads will not be too high.
verify with: Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.13.master.el6 VDSM: vdsm-4.17.5-1.el7ev update IO threads to 12 - Pass update IO threads to 127 - Pass update IO threads to 12345(did not let to update) - Pass
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html