Bug 1314371 - Resolve race when using external network providers hook
Summary: Resolve race when using external network providers hook
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-3.6.9
: ---
Assignee: Marcin Mirecki
QA Contact: Meni Yakove
URL:
Whiteboard:
: 1357386 (view as bug list)
Depends On:
Blocks: 1314375
TreeView+ depends on / blocked
 
Reported: 2016-03-03 13:12 UTC by Yaniv Lavi
Modified: 2016-09-21 18:07 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
This update adds functionality to enable the hook implementer to change the virtual machine state from start-up to paused, then later resume it. This ensures that operations that require virtual machines to be started in the paused state can be completed. For example, to attach an external NIC the virtual machine should not be started until the plugging of the NIC is complete on the virtual machine and the entire NIC infrastructure outside the virtual machine. The paused state can be set in the following two hooks: - before_device_create - before_vm_start. Resuming the virtual machine can be done in the following hooks: - after_device_create - after_vm_start The flags which control the virtual machine state can be read and set using the following functions in the hooking library: - load_vm_launch_flags_from_file(vm_id) Get the current state of the flags, including changes made by other hooks - dump_vm_launch_flags_from_file(vm_id, flags) Write the new state of the startup flags, all subsequent hooks will see the modified state of the flags
Clone Of:
Environment:
Last Closed: 2016-09-21 18:07:02 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (545.60 KB, application/x-gzip)
2016-08-28 14:24 UTC, Michael Burman
no flags Details
New Logs (640.62 KB, application/x-gzip)
2016-08-30 08:44 UTC, Michael Burman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1925 0 normal SHIPPED_LIVE vdsm 3.6.9 bug fix and enhancement update 2016-09-21 21:58:32 UTC
oVirt gerrit 57395 0 master ABANDONED virt: Adding 'delayedLaunch' to vm start 2016-05-17 09:45:55 UTC
oVirt gerrit 57538 0 master MERGED virt: Collect vm start flags from device hooks when running vm 2016-06-09 07:24:32 UTC
oVirt gerrit 58810 0 master NEW hooks: openstack hook should wait for nic activation and unpause vm 2016-06-10 07:52:22 UTC
oVirt gerrit 58881 0 ovirt-4.0 MERGED virt: Collect vm start flags from device hooks when running vm 2016-06-13 12:43:04 UTC
oVirt gerrit 58898 0 ovirt-3.6 MERGED virt: Collect vm start flags from device hooks when running vm 2016-06-29 14:13:28 UTC
oVirt gerrit 58917 0 master MERGED hooks: openstack hook should wait a while for nic activation 2016-06-29 08:58:20 UTC
oVirt gerrit 59057 0 ovirt-4.0 MERGED tests: hooks: fix monkeypatching 2016-06-13 15:27:48 UTC
oVirt gerrit 59077 0 ovirt-3.6 MERGED tests: hooks: fix monkeypatching 2016-06-29 14:13:38 UTC
oVirt gerrit 59919 0 ovirt-4.0 NEW hooks: openstack hook should wait a while for nic activation 2016-06-29 10:29:16 UTC
oVirt gerrit 60896 0 master MERGED openstacknet hook: include after_vm_start hook in rpm 2016-07-19 05:51:03 UTC
oVirt gerrit 60897 0 ovirt-4.0 MERGED openstacknet hook: include after_vm_start hook in rpm 2016-07-20 08:05:27 UTC
oVirt gerrit 61070 0 ovirt-3.6 MERGED openstacknet hook: include after_vm_start hook in rpm 2016-07-20 11:11:50 UTC

Description Yaniv Lavi 2016-03-03 13:12:58 UTC
Description of problem:
When you use external provider there is a race between VM boot and network config time. We need to resolve this to allow external providers API testing.

Comment 1 Yaniv Lavi 2016-03-03 13:14:16 UTC
Can you please help reproduce with Neutron?

We need some fix for 3.6 and a better one for 4.0.

Comment 2 Michael Burman 2016-07-18 08:23:35 UTC
after_vm_start is not shipped in the rpm, and therefore, the VM is stuck in pause state. Nothing waits for 15 seconds to continue it.

Comment 3 Yaniv Lavi 2016-07-20 08:17:33 UTC
*** Bug 1357386 has been marked as a duplicate of this bug. ***

Comment 4 Yaniv Lavi 2016-07-25 08:56:34 UTC
Should this be MODIFIED?

Comment 5 Michael Burman 2016-08-28 08:31:12 UTC
Hi

When starting the vm, it is first goes to 'not responding' state(question mark), after 15-30 seconds the vm continue to start up. 
Is this is the expected behavior?

Comment 6 Michael Burman 2016-08-28 09:06:36 UTC
In virsh it is reported as paused for few seconds and then running.

Comment 7 Dan Kenigsberg 2016-08-28 14:04:29 UTC
'not responding' is an ugly VM state, where qemu does not respond to Vdsm. this issue is unrelated to this bug, but it is worth researching - can you share vdsm.log and engine.log of the 'not responding' time?

Comment 8 Michael Burman 2016-08-28 14:24:37 UTC
Created attachment 1195053 [details]
logs

Comment 9 Michael Burman 2016-08-28 14:26:18 UTC
From 17:19:00 to 17:20:20 in the logs

Comment 10 Marcin Mirecki 2016-08-29 08:33:37 UTC
I did not observe 'not responding' on my setup, it was usually showing 'starting' in most cases (only sometimes noticing the paused state)
I will look at this.

Comment 11 Dan Kenigsberg 2016-08-29 12:03:45 UTC
Arik, can you tell why Engine deemed the VM as non-responding? I see nothing in vdsm.log (comment 8) hinting that.

Comment 12 Arik 2016-08-30 08:27:57 UTC
(In reply to Dan Kenigsberg from comment #11)
The only trigger for setting VM status to non-responding is when the engine gets the VM from VDSM with negative value of 'monitorResponse'.

I believe the attached vdsm log does not cover the relevant period of time.
The VM is set by the monitoring to non-responding not long after its creation:
2016-08-28 11:21:08,775 ... [5b802702]  VM v2 was started by admin@internal (Host: silver-vdsa.qa.lab.tlv.redhat.com).
2016-08-28 11:21:14,972 ... Fetched 1 VMs from VDS 'b477701d-6559-41b1-ae96-2b5dfa6091f6'
2016-08-28 11:21:14,978 ... VM '7ebb5925-e76f-4d2f-a148-7671c313fe84'(v2) moved from 'WaitForLaunch' --> 'NotResponding'
2016-08-28 11:21:15,378 ... Message: VM v2 is not responding.

The VM creation was done in ~11:21:08 and its correlation id was 5b802702.
VDSM log (a) does not cover this time (but there might be some offset between the engine and the host); and (b) I can't find this correlation id in VDSM log.

Michael, can you please double check that it is the right VDSM log?

Comment 13 Arik 2016-08-30 08:39:46 UTC
(In reply to Arik from comment #12)
Ah ok, I looked at the wrong VM. We're probably taking about VM 'sr-vm2' that became non-responding on:
2016-08-28 17:19:15,856 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-54) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotResponding'
And in VDSM log I see:
jsonrpc.Executor/1::WARNING::2016-08-28 17:19:10,129::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`19bafd0e-ef84-45f9-ad33-a28d79bfdbad`::monitor become unresponsive (command timeout, age=4326160.58)

(there's time offset between the engine and that host for sure btw)

Comment 14 Michael Burman 2016-08-30 08:42:45 UTC
Yes it is the right logs, no offset between host and engine.. Attaching new logs 11:36:00-11:38:00 for (sr-vm2)

2016-08-30 11:36:17,718 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [627a0728] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotRespond
ing'

2016-08-30 11:36:26,766 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding' --> 'Up'

Comment 15 Michael Burman 2016-08-30 08:44:09 UTC
Created attachment 1195765 [details]
New Logs

Comment 16 Francesco Romani 2016-08-30 11:16:08 UTC
(In reply to Michael Burman from comment #14)
> Yes it is the right logs, no offset between host and engine.. Attaching new
> logs 11:36:00-11:38:00 for (sr-vm2)
> 
> 2016-08-30 11:36:17,718 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
> (DefaultQuartzScheduler_Worker-32) [627a0728] VM
> '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch'
> --> 'NotRespond
> ing'
> 
> 2016-08-30 11:36:26,766 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
> (ForkJoinPool-1-worker-4) [] VM
> '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding'
> --> 'Up'

Seems like because the VM is taking (relatively) long to boot. We knew that could happen (https://gerrit.ovirt.org/#/c/61310/ - toward the end of commit) message.
Is this an issue?

Comment 17 Dan Kenigsberg 2016-08-30 11:41:38 UTC
mburman, if Francesco's suggestion is right, you should be seeing this state change for every VM that stays more than 60 seconds in WaitForLaunch, not only in VM that are paused by the external network provider hook. Is that the case?

Comment 18 Francesco Romani 2016-08-30 11:58:57 UTC
(In reply to Dan Kenigsberg from comment #17)
> mburman, if Francesco's suggestion is right, you should be seeing this state
> change for every VM that stays more than 60 seconds in WaitForLaunch, not
> only in VM that are paused by the external network provider hook. Is that
> the case?

After a deeper look, we could also have a race in startup (not related to patch 61310), which is more likely to happen with this hook deployed;
or, the other way around, this race is rarely seen in regular boots.

I've posted a *draft* to tackle this race: https://gerrit.ovirt.org/#/c/63008

please note: we could also have a similar race on shutdown.

Comment 19 Michael Burman 2016-08-30 14:28:20 UTC
(In reply to Dan Kenigsberg from comment #17)
> mburman, if Francesco's suggestion is right, you should be seeing this state
> change for every VM that stays more than 60 seconds in WaitForLaunch, not
> only in VM that are paused by the external network provider hook. Is that
> the case?

Don't know, never saw this state before

Comment 20 Dan Kenigsberg 2016-08-31 07:37:36 UTC
Do you always see the non-responding when paused?

Would you edit /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet
to have
  PAUSE_FLAG = 0  # libvirt.VIR_DOMAIN_START_PAUSED
and retry?

Comment 21 Michael Burman 2016-08-31 08:09:44 UTC
I think so..it's going from waiting for launch > Not responding > UP


When editing /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet
to have
  PAUSE_FLAG = 0  # libvirt.VIR_DOMAIN_START_PAUSED
as you asked it goes from powering up > up

Comment 22 Dan Kenigsberg 2016-08-31 10:20:11 UTC
Ok, so this bug can be verified. It's up to you/ virt qe/ virt dev if a separate bug regarding the intermediate 'non responding' state (introduced in 4.0.4) should be open.

Comment 23 Michael Burman 2016-08-31 11:31:26 UTC
Ok, so base on last comment this bug is verified on - 3.6.9-0.1.el6 and vdsm-4.17.34-1.el7ev.noarch

Comment 25 errata-xmlrpc 2016-09-21 18:07:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1925.html


Note You need to log in before you can comment on or make changes to this bug.