Description of problem: When you use external provider there is a race between VM boot and network config time. We need to resolve this to allow external providers API testing.
Can you please help reproduce with Neutron? We need some fix for 3.6 and a better one for 4.0.
after_vm_start is not shipped in the rpm, and therefore, the VM is stuck in pause state. Nothing waits for 15 seconds to continue it.
*** Bug 1357386 has been marked as a duplicate of this bug. ***
Should this be MODIFIED?
Hi When starting the vm, it is first goes to 'not responding' state(question mark), after 15-30 seconds the vm continue to start up. Is this is the expected behavior?
In virsh it is reported as paused for few seconds and then running.
'not responding' is an ugly VM state, where qemu does not respond to Vdsm. this issue is unrelated to this bug, but it is worth researching - can you share vdsm.log and engine.log of the 'not responding' time?
Created attachment 1195053 [details] logs
From 17:19:00 to 17:20:20 in the logs
I did not observe 'not responding' on my setup, it was usually showing 'starting' in most cases (only sometimes noticing the paused state) I will look at this.
Arik, can you tell why Engine deemed the VM as non-responding? I see nothing in vdsm.log (comment 8) hinting that.
(In reply to Dan Kenigsberg from comment #11) The only trigger for setting VM status to non-responding is when the engine gets the VM from VDSM with negative value of 'monitorResponse'. I believe the attached vdsm log does not cover the relevant period of time. The VM is set by the monitoring to non-responding not long after its creation: 2016-08-28 11:21:08,775 ... [5b802702] VM v2 was started by admin@internal (Host: silver-vdsa.qa.lab.tlv.redhat.com). 2016-08-28 11:21:14,972 ... Fetched 1 VMs from VDS 'b477701d-6559-41b1-ae96-2b5dfa6091f6' 2016-08-28 11:21:14,978 ... VM '7ebb5925-e76f-4d2f-a148-7671c313fe84'(v2) moved from 'WaitForLaunch' --> 'NotResponding' 2016-08-28 11:21:15,378 ... Message: VM v2 is not responding. The VM creation was done in ~11:21:08 and its correlation id was 5b802702. VDSM log (a) does not cover this time (but there might be some offset between the engine and the host); and (b) I can't find this correlation id in VDSM log. Michael, can you please double check that it is the right VDSM log?
(In reply to Arik from comment #12) Ah ok, I looked at the wrong VM. We're probably taking about VM 'sr-vm2' that became non-responding on: 2016-08-28 17:19:15,856 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-54) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotResponding' And in VDSM log I see: jsonrpc.Executor/1::WARNING::2016-08-28 17:19:10,129::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`19bafd0e-ef84-45f9-ad33-a28d79bfdbad`::monitor become unresponsive (command timeout, age=4326160.58) (there's time offset between the engine and that host for sure btw)
Yes it is the right logs, no offset between host and engine.. Attaching new logs 11:36:00-11:38:00 for (sr-vm2) 2016-08-30 11:36:17,718 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [627a0728] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotRespond ing' 2016-08-30 11:36:26,766 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding' --> 'Up'
Created attachment 1195765 [details] New Logs
(In reply to Michael Burman from comment #14) > Yes it is the right logs, no offset between host and engine.. Attaching new > logs 11:36:00-11:38:00 for (sr-vm2) > > 2016-08-30 11:36:17,718 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] > (DefaultQuartzScheduler_Worker-32) [627a0728] VM > '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' > --> 'NotRespond > ing' > > 2016-08-30 11:36:26,766 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] > (ForkJoinPool-1-worker-4) [] VM > '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding' > --> 'Up' Seems like because the VM is taking (relatively) long to boot. We knew that could happen (https://gerrit.ovirt.org/#/c/61310/ - toward the end of commit) message. Is this an issue?
mburman, if Francesco's suggestion is right, you should be seeing this state change for every VM that stays more than 60 seconds in WaitForLaunch, not only in VM that are paused by the external network provider hook. Is that the case?
(In reply to Dan Kenigsberg from comment #17) > mburman, if Francesco's suggestion is right, you should be seeing this state > change for every VM that stays more than 60 seconds in WaitForLaunch, not > only in VM that are paused by the external network provider hook. Is that > the case? After a deeper look, we could also have a race in startup (not related to patch 61310), which is more likely to happen with this hook deployed; or, the other way around, this race is rarely seen in regular boots. I've posted a *draft* to tackle this race: https://gerrit.ovirt.org/#/c/63008 please note: we could also have a similar race on shutdown.
(In reply to Dan Kenigsberg from comment #17) > mburman, if Francesco's suggestion is right, you should be seeing this state > change for every VM that stays more than 60 seconds in WaitForLaunch, not > only in VM that are paused by the external network provider hook. Is that > the case? Don't know, never saw this state before
Do you always see the non-responding when paused? Would you edit /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet to have PAUSE_FLAG = 0 # libvirt.VIR_DOMAIN_START_PAUSED and retry?
I think so..it's going from waiting for launch > Not responding > UP When editing /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet to have PAUSE_FLAG = 0 # libvirt.VIR_DOMAIN_START_PAUSED as you asked it goes from powering up > up
Ok, so this bug can be verified. It's up to you/ virt qe/ virt dev if a separate bug regarding the intermediate 'non responding' state (introduced in 4.0.4) should be open.
Ok, so base on last comment this bug is verified on - 3.6.9-0.1.el6 and vdsm-4.17.34-1.el7ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1925.html