Bug 1314371
Summary: | Resolve race when using external network providers hook | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Yaniv Lavi <ylavi> | ||||||
Component: | vdsm | Assignee: | Marcin Mirecki <mmirecki> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Meni Yakove <myakove> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.6.0 | CC: | ahadas, bazulay, bgraveno, bugs, danken, fromani, gklein, lsurette, mburman, mmirecki, myakove, srevivo, ycui, ykaul | ||||||
Target Milestone: | ovirt-3.6.9 | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
This update adds functionality to enable the hook implementer to change the virtual machine state from start-up to paused, then later resume it.
This ensures that operations that require virtual machines to be started in the paused state can be completed. For example, to attach an external NIC the virtual machine should not be started until the plugging of the NIC is complete on the virtual machine and the entire NIC infrastructure outside the virtual machine.
The paused state can be set in the following two hooks:
- before_device_create
- before_vm_start.
Resuming the virtual machine can be done in the following hooks:
- after_device_create
- after_vm_start
The flags which control the virtual machine state can be read and set using the following functions in the hooking library:
- load_vm_launch_flags_from_file(vm_id)
Get the current state of the flags, including changes made by other hooks
- dump_vm_launch_flags_from_file(vm_id, flags)
Write the new state of the startup flags, all subsequent hooks will see the modified state of the flags
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-09-21 18:07:02 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1314375 | ||||||||
Attachments: |
|
Description
Yaniv Lavi
2016-03-03 13:12:58 UTC
Can you please help reproduce with Neutron? We need some fix for 3.6 and a better one for 4.0. after_vm_start is not shipped in the rpm, and therefore, the VM is stuck in pause state. Nothing waits for 15 seconds to continue it. *** Bug 1357386 has been marked as a duplicate of this bug. *** Should this be MODIFIED? Hi When starting the vm, it is first goes to 'not responding' state(question mark), after 15-30 seconds the vm continue to start up. Is this is the expected behavior? In virsh it is reported as paused for few seconds and then running. 'not responding' is an ugly VM state, where qemu does not respond to Vdsm. this issue is unrelated to this bug, but it is worth researching - can you share vdsm.log and engine.log of the 'not responding' time? Created attachment 1195053 [details]
logs
From 17:19:00 to 17:20:20 in the logs I did not observe 'not responding' on my setup, it was usually showing 'starting' in most cases (only sometimes noticing the paused state) I will look at this. Arik, can you tell why Engine deemed the VM as non-responding? I see nothing in vdsm.log (comment 8) hinting that. (In reply to Dan Kenigsberg from comment #11) The only trigger for setting VM status to non-responding is when the engine gets the VM from VDSM with negative value of 'monitorResponse'. I believe the attached vdsm log does not cover the relevant period of time. The VM is set by the monitoring to non-responding not long after its creation: 2016-08-28 11:21:08,775 ... [5b802702] VM v2 was started by admin@internal (Host: silver-vdsa.qa.lab.tlv.redhat.com). 2016-08-28 11:21:14,972 ... Fetched 1 VMs from VDS 'b477701d-6559-41b1-ae96-2b5dfa6091f6' 2016-08-28 11:21:14,978 ... VM '7ebb5925-e76f-4d2f-a148-7671c313fe84'(v2) moved from 'WaitForLaunch' --> 'NotResponding' 2016-08-28 11:21:15,378 ... Message: VM v2 is not responding. The VM creation was done in ~11:21:08 and its correlation id was 5b802702. VDSM log (a) does not cover this time (but there might be some offset between the engine and the host); and (b) I can't find this correlation id in VDSM log. Michael, can you please double check that it is the right VDSM log? (In reply to Arik from comment #12) Ah ok, I looked at the wrong VM. We're probably taking about VM 'sr-vm2' that became non-responding on: 2016-08-28 17:19:15,856 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-54) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotResponding' And in VDSM log I see: jsonrpc.Executor/1::WARNING::2016-08-28 17:19:10,129::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`19bafd0e-ef84-45f9-ad33-a28d79bfdbad`::monitor become unresponsive (command timeout, age=4326160.58) (there's time offset between the engine and that host for sure btw) Yes it is the right logs, no offset between host and engine.. Attaching new logs 11:36:00-11:38:00 for (sr-vm2) 2016-08-30 11:36:17,718 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-32) [627a0728] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' --> 'NotRespond ing' 2016-08-30 11:36:26,766 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding' --> 'Up' Created attachment 1195765 [details]
New Logs
(In reply to Michael Burman from comment #14) > Yes it is the right logs, no offset between host and engine.. Attaching new > logs 11:36:00-11:38:00 for (sr-vm2) > > 2016-08-30 11:36:17,718 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] > (DefaultQuartzScheduler_Worker-32) [627a0728] VM > '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'WaitForLaunch' > --> 'NotRespond > ing' > > 2016-08-30 11:36:26,766 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] > (ForkJoinPool-1-worker-4) [] VM > '19bafd0e-ef84-45f9-ad33-a28d79bfdbad'(sr-vm2) moved from 'NotResponding' > --> 'Up' Seems like because the VM is taking (relatively) long to boot. We knew that could happen (https://gerrit.ovirt.org/#/c/61310/ - toward the end of commit) message. Is this an issue? mburman, if Francesco's suggestion is right, you should be seeing this state change for every VM that stays more than 60 seconds in WaitForLaunch, not only in VM that are paused by the external network provider hook. Is that the case? (In reply to Dan Kenigsberg from comment #17) > mburman, if Francesco's suggestion is right, you should be seeing this state > change for every VM that stays more than 60 seconds in WaitForLaunch, not > only in VM that are paused by the external network provider hook. Is that > the case? After a deeper look, we could also have a race in startup (not related to patch 61310), which is more likely to happen with this hook deployed; or, the other way around, this race is rarely seen in regular boots. I've posted a *draft* to tackle this race: https://gerrit.ovirt.org/#/c/63008 please note: we could also have a similar race on shutdown. (In reply to Dan Kenigsberg from comment #17) > mburman, if Francesco's suggestion is right, you should be seeing this state > change for every VM that stays more than 60 seconds in WaitForLaunch, not > only in VM that are paused by the external network provider hook. Is that > the case? Don't know, never saw this state before Do you always see the non-responding when paused? Would you edit /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet to have PAUSE_FLAG = 0 # libvirt.VIR_DOMAIN_START_PAUSED and retry? I think so..it's going from waiting for launch > Not responding > UP When editing /usr/libexec/vdsm/hooks/before_device_create/50_openstacknet to have PAUSE_FLAG = 0 # libvirt.VIR_DOMAIN_START_PAUSED as you asked it goes from powering up > up Ok, so this bug can be verified. It's up to you/ virt qe/ virt dev if a separate bug regarding the intermediate 'non responding' state (introduced in 4.0.4) should be open. Ok, so base on last comment this bug is verified on - 3.6.9-0.1.el6 and vdsm-4.17.34-1.el7ev.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1925.html |