Bug 1314371

Summary: Resolve race when using external network providers hook
Product: Red Hat Enterprise Virtualization Manager Reporter: Yaniv Lavi <ylavi>
Component: vdsmAssignee: Marcin Mirecki <mmirecki>
Status: CLOSED ERRATA QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: medium    
Version: 3.6.0CC: ahadas, bazulay, bgraveno, bugs, danken, fromani, gklein, lsurette, mburman, mmirecki, myakove, srevivo, ycui, ykaul
Target Milestone: ovirt-3.6.9Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
This update adds functionality to enable the hook implementer to change the virtual machine state from start-up to paused, then later resume it. This ensures that operations that require virtual machines to be started in the paused state can be completed. For example, to attach an external NIC the virtual machine should not be started until the plugging of the NIC is complete on the virtual machine and the entire NIC infrastructure outside the virtual machine. The paused state can be set in the following two hooks: - before_device_create - before_vm_start. Resuming the virtual machine can be done in the following hooks: - after_device_create - after_vm_start The flags which control the virtual machine state can be read and set using the following functions in the hooking library: - load_vm_launch_flags_from_file(vm_id) Get the current state of the flags, including changes made by other hooks - dump_vm_launch_flags_from_file(vm_id, flags) Write the new state of the startup flags, all subsequent hooks will see the modified state of the flags
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-21 18:07:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1314375    
Attachments:
Description Flags
logs
none
New Logs none

Description Yaniv Lavi 2016-03-03 13:12:58 UTC
Description of problem:
When you use external provider there is a race between VM boot and network config time. We need to resolve this to allow external providers API testing.

Comment 1 Yaniv Lavi 2016-03-03 13:14:16 UTC
Can you please help reproduce with Neutron?

We need some fix for 3.6 and a better one for 4.0.

Comment 2 Michael Burman 2016-07-18 08:23:35 UTC
after_vm_start is not shipped in the rpm, and therefore, the VM is stuck in pause state. Nothing waits for 15 seconds to continue it.

Comment 3 Yaniv Lavi 2016-07-20 08:17:33 UTC
*** Bug 1357386 has been marked as a duplicate of this bug. ***

Comment 4 Yaniv Lavi 2016-07-25 08:56:34 UTC
Should this be MODIFIED?

Comment 5 Michael Burman 2016-08-28 08:31:12 UTC
Hi

When starting the vm, it is first goes to 'not responding' state(question mark), after 15-30 seconds the vm continue to start up. 
Is this is the expected behavior?

Comment 6 Michael Burman 2016-08-28 09:06:36 UTC
In virsh it is reported as paused for few seconds and then running.

Comment 7 Dan Kenigsberg 2016-08-28 14:04:29 UTC
'not responding' is an ugly VM state, where qemu does not respond to Vdsm. this issue is unrelated to this bug, but it is worth researching - can you share vdsm.log and engine.log of the 'not responding' time?

Comment 8 Michael Burman 2016-08-28 14:24:37 UTC
Created attachment 1195053 [details]
logs

Comment 9 Michael Burman 2016-08-28 14:26:18 UTC
From 17:19:00 to 17:20:20 in the logs

Comment 10 Marcin Mirecki 2016-08-29 08:33:37 UTC
I did not observe 'not responding' on my setup, it was usually showing 'starting' in most cases (only sometimes noticing the paused state)
I will look at this.

Comment 11 Dan Kenigsberg 2016-08-29 12:03:45 UTC
Arik, can you tell why Engine deemed the VM as non-responding? I see nothing in vdsm.log (comment 8) hinting that.

Comment 12 Arik 2016-08-30 08:27:57 UTC
(In reply to Dan Kenigsberg from comment #11)
The only trigger for setting VM status to non-responding is when the engine gets the VM from VDSM with negative value of 'monitorResponse'.

I believe the attached vdsm log does not cover the relevant period of time.
The VM is set by the monitoring to non-responding not long after its creation:
2016-08-28 11:21:08,775 ... [5b802702]  VM v2 was started by admin@internal (Host: silver-vdsa.qa.lab.tlv.redhat.com).
2016-08-28 11:21:14,972 ... Fetched 1 VMs from VDS 'b477701d-6559-41b1-ae96-2b5dfa6091f6'
2016-08-28 11:21:14,978 ... VM '7ebb5925-e76f-4d2f-a148-7671c313fe84'(v2) moved from 'WaitForLaunch' --> 'NotResponding'
2016-08-28 11:21:15,378 ... Message: VM v2 is not responding.

The VM creation was done in ~11:21:08 and its correlation id was 5b802702.
VDSM log (a) does not cover this time (but there might be some offset between the engine and the host); and (b) I can't find this correlation id in VDSM log.

Michael, can you please double check that it is the right VDSM log?

Comment 13 Arik 2016-08-30 08:39:46 UTC
(In reply to Arik from comment #12)
Ah ok, I looked at the wrong VM. We're probably taking about VM 'sr-vm2' that became non-responding on:
2016-08-28 17:19:15,856 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-54) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotResponding'
And in VDSM log I see:
jsonrpc.Executor/1::WARNING::2016-08-28 17:19:10,129::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`19bafd0e-ef84-45f9-ad33-a28d79bfdbad`::monitor become unresponsive (command timeout, age=4326160.58)

(there's time offset between the engine and that host for sure btw)

Comment 14 Michael Burman 2016-08-30 08:42:45 UTC
Yes it is the right logs, no offset between host and engine.. Attaching new logs 11:36:00-11:38:00 for (sr-vm2)

2016-08-30 11:36:17,718 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [627a0728] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotRespond
ing'

2016-08-30 11:36:26,766 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding' --> 'Up'

Comment 15 Michael Burman 2016-08-30 08:44:09 UTC
Created attachment 1195765 [details]
New Logs

Comment 16 Francesco Romani 2016-08-30 11:16:08 UTC
(In reply to Michael Burman from comment #14)
> Yes it is the right logs, no offset between host and engine.. Attaching new
> logs 11:36:00-11:38:00 for (sr-vm2)
> 
> 2016-08-30 11:36:17,718 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
> (DefaultQuartzScheduler_Worker-32) [627a0728] VM
> '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch'
> --> 'NotRespond
> ing'
> 
> 2016-08-30 11:36:26,766 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
> (ForkJoinPool-1-worker-4) [] VM
> '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding'
> --> 'Up'

Seems like because the VM is taking (relatively) long to boot. We knew that could happen (https://gerrit.ovirt.org/#/c/61310/ - toward the end of commit) message.
Is this an issue?

Comment 17 Dan Kenigsberg 2016-08-30 11:41:38 UTC
mburman, if Francesco's suggestion is right, you should be seeing this state change for every VM that stays more than 60 seconds in WaitForLaunch, not only in VM that are paused by the external network provider hook. Is that the case?

Comment 18 Francesco Romani 2016-08-30 11:58:57 UTC
(In reply to Dan Kenigsberg from comment #17)
> mburman, if Francesco's suggestion is right, you should be seeing this state
> change for every VM that stays more than 60 seconds in WaitForLaunch, not
> only in VM that are paused by the external network provider hook. Is that
> the case?

After a deeper look, we could also have a race in startup (not related to patch 61310), which is more likely to happen with this hook deployed;
or, the other way around, this race is rarely seen in regular boots.

I've posted a *draft* to tackle this race: https://gerrit.ovirt.org/#/c/63008

please note: we could also have a similar race on shutdown.

Comment 19 Michael Burman 2016-08-30 14:28:20 UTC
(In reply to Dan Kenigsberg from comment #17)
> mburman, if Francesco's suggestion is right, you should be seeing this state
> change for every VM that stays more than 60 seconds in WaitForLaunch, not
> only in VM that are paused by the external network provider hook. Is that
> the case?

Don't know, never saw this state before

Comment 20 Dan Kenigsberg 2016-08-31 07:37:36 UTC
Do you always see the non-responding when paused?

Would you edit /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet
to have
  PAUSE_FLAG = 0  # libvirt.VIR_DOMAIN_START_PAUSED
and retry?

Comment 21 Michael Burman 2016-08-31 08:09:44 UTC
I think so..it's going from waiting for launch > Not responding > UP


When editing /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet
to have
  PAUSE_FLAG = 0  # libvirt.VIR_DOMAIN_START_PAUSED
as you asked it goes from powering up > up

Comment 22 Dan Kenigsberg 2016-08-31 10:20:11 UTC
Ok, so this bug can be verified. It's up to you/ virt qe/ virt dev if a separate bug regarding the intermediate 'non responding' state (introduced in 4.0.4) should be open.

Comment 23 Michael Burman 2016-08-31 11:31:26 UTC
Ok, so base on last comment this bug is verified on - 3.6.9-0.1.el6 and vdsm-4.17.34-1.el7ev.noarch

Comment 25 errata-xmlrpc 2016-09-21 18:07:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1925.html