Bug 1111938 - Suspend VM ends up with VM in down status
Summary: Suspend VM ends up with VM in down status
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.5.0
Assignee: Francesco Romani
QA Contact: Ilanit Stein
URL:
Whiteboard: virt
Depends On:
Blocks: rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-06-22 10:31 UTC by Ilanit Stein
Modified: 2015-02-17 08:26 UTC (History)
13 users (show)

Fixed In Version: vt3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-17 08:26:21 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (2.55 MB, application/x-bzip2)
2014-06-22 10:32 UTC, Ilanit Stein
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 29113 0 master MERGED vm: avoid race while setting guest cpu status Never
oVirt gerrit 32462 0 ovirt-3.5 MERGED vm: avoid race while setting guest cpu status Never

Description Ilanit Stein 2014-06-22 10:31:01 UTC
Description of problem:
Suspend a VM, turn "UP" VM, with disk to down state.
On REST API, right after suspend VM command, VM status changes into "paused",
and then to "down".  

Version-Release number of selected component (if applicable):
rhevm 3.3.3-0.1000.534.c1e70ae.el6ev
vdsm: 4.13.2-0.17

How reproducible:
occurred once on Jenkins automatic test, but didn't on the following build, with same versions. 

http://jenkins-ci.eng.lab.tlv.redhat.com/view/0%20Unstable%203.3/job/rhevm_3.3_automation_coretools_two_hosts_restapi_vms_nfs_rest_factory/243/

Actual results:
VM turned to "down" state

Expected results:
VM should turn to "suspend" state.

Additional info:
From engine.log:
2014-06-19 18:13:53,362 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-2) [466] Correlation ID: 466, Job ID: 474a9f3e-57e7-4379-b4a5-d2152c6e19f0, Call Stack: null, Custom Event ID: -1, Message: Suspending VM restvm_basic_operations was initiated by User admin@internal (Host: cinteg18.ci.lab.tlv.redhat.com).

2014-06-19 18:13:56,846 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-65) VM restvm_basic_operations 3cf81856-814f-49ab-9d63-62e88bf04634 moved from SavingState --> Paused

2014-06-19 18:13:59,945 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-69) START, DestroyVDSCommand(HostName = cinteg18.ci.lab.tlv.redhat.com, HostId = ef37e6f5-346b-4e4a-8ce2-7384b2401ded, vmId=3cf81856-814f-49ab-9d63-62e88bf04634, force=false, secondsToWait=0, gracefully=false), log id: 37379689
2014-06-19 18:13:59,975 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-69) FINISH, DestroyVDSCommand, log id: 37379689
2014-06-19 18:13:59,994 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-69) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM restvm_basic_operations is down. Exit message: SaveState succeeded
2014-06-19 18:13:59,996 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-69) vm restvm_basic_operations running in db and not running in vds - add to rerun treatment. vds cinteg18.ci.lab.tlv.redhat.com
2014-06-19 18:14:00,015 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (pool-4-thread-48) [38bb83c2] Running command: ProcessDownVmCommand internal: true.

Comment 1 Ilanit Stein 2014-06-22 10:32:50 UTC
Created attachment 911133 [details]
logs

Comment 2 Ilanit Stein 2014-06-22 11:16:13 UTC
vdsm.log show VM destroyed due to connection with libvirt (domain not found) (libvirt log show nothing at time of error):

Thread-545::DEBUG::2014-06-19 18:13:58,020::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '3cf81856-814f-49ab-9d63-62e88bf04634'
Thread-545::DEBUG::2014-06-19 18:13:58,020::vm::2658::vm.Vm::(setDownStatus) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Changed state to Down: SaveState succeeded
Thread-545::DEBUG::2014-06-19 18:13:58,020::vmChannels::205::vds::(unregister) Delete fileno 7 from listener.
Thread-545::DEBUG::2014-06-19 18:13:58,020::sampling::292::vm.Vm::(stop) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Stop statistics collection
. . .
Thread-550::DEBUG::2014-06-19 18:13:59,959::BindingXMLRPC::975::vds::(wrapper) client [10.35.19.10]::call vmDestroy with ('3cf81856-814f-49ab-9d63-62e88bf04634',) {}
Thread-550::INFO::2014-06-19 18:13:59,959::API::324::vds::(destroy) vmContainerLock acquired by vm 3cf81856-814f-49ab-9d63-62e88bf04634
Thread-550::DEBUG::2014-06-19 18:13:59,959::vm::4555::vm.Vm::(destroy) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::destroy Called
Thread-550::INFO::2014-06-19 18:13:59,959::vm::4501::vm.Vm::(releaseVm) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Release VM resources
Thread-550::WARNING::2014-06-19 18:13:59,959::vm::1852::vm.Vm::(_set_lastStatus) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::trying to set state to Powering down when already Down
Thread-550::DEBUG::2014-06-19 18:13:59,960::sampling::292::vm.Vm::(stop) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Stop statistics collection
Thread-550::DEBUG::2014-06-19 18:13:59,961::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '3cf81856-814f-49ab-9d63-62e88bf04634'
Thread-550::WARNING::2014-06-19 18:13:59,961::clientIF::363::vds::(teardownVolumePath) Drive is not a vdsm image: VOLWM_CHUNK_MB:1024 VOLWM_CHUNK_REPLICATE_MULT:2 VOLWM_FREE_PCT:...
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 357, in teardownVolumePath
    res = self.irs.teardownImage(drive['domainID'],
  File "/usr/share/vdsm/vm.py", line 1432, in __getitem__
    raise KeyError(key)
KeyError: 'domainID'
Thread-550::DEBUG::2014-06-19 18:13:59,962::task::579::TaskManager.Task::(_updateState) Task=`47b80177-3111-4606-89a1-1d9092ff971d`::moving from state init -> state preparing
Thread-550::INFO::2014-06-19 18:13:59,963::logUtils::44::dispatcher::(wrapper) Run and protect: teardownImage(sdUUID='54848b17-486e-4be7-841f-8a4dde007a3a', spUUID='49614565-76c4-4426-bb7f-311624c6b48d', imgUUID='d62a50e0-1078-4998-8e02-06024b89da17', volUUID=None)
Thread-550::DEBUG::2014-06-19 18:13:59,963::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.54848b17-486e-4be7-841f-8a4dde007a3a`ReqID=`254c0526-68c0-4aa2-a28b-f54168f613f2`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '3319' at 'teardownImage'

Comment 4 Omer Frenkel 2014-06-22 12:57:13 UTC
i think the issue is that the vm is moved to pause after the hibernate call.
the engine move vm to suspended if it moves from saving-state to down
in this case the vm was paused and moved to down, so it didn't fit the suspend flow. (even though the exit message is 'SaveState succeeded' , the engine doesn't parse or give any attention to the message.
we can improve this in 3.5 with the new exit reason, but here we need to understand why the vm is paused at all?

Comment 5 Francesco Romani 2014-06-23 11:00:02 UTC
The VM goes paused because we receive an 'Suspended' event from libvirt:

libvirtEventLoop::DEBUG::2014-06-19 18:13:56,785::vm::5039::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::event Suspended detail 0 opaque None
Thread-547::DEBUG::2014-06-19 18:13:56,808::BindingXMLRPC::975::vds::(wrapper) client [10.35.19.10]::call vmGetStats with ('3cf81856-814f-49ab-9d63-62e88bf04634',) {}
Thread-547::DEBUG::2014-06-19 18:13:56,810::BindingXMLRPC::982::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Paused', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '5926', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '0', 'hash': '-8697249038581656653', 'balloonInfo': {'balloon_max': '1048576', 'balloon_target': '1048576', 'balloon_cur': '1048576', 'balloon_min': '1048576'}, 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {}, 'vmId': '3cf81856-814f-49ab-9d63-62e88bf04634', 'displayType': 'qxl', 'cpuUser': '0.15', 'disks': {u'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}, u'hda': {'readLatency': '25747', 'apparentsize': '197120', 'writeLatency': '0', 'imageID': 'd62a50e0-1078-4998-8e02-06024b89da17', 'flushLatency': '0', 'readRate': '8.52', 'truesize': '143360', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '0.47', 'elapsedTime': '66', 'vmType': 'kvm', 'cpuSys': '0.60', 'appsList': [], 'guestIPs': ''}]}
libvirtEventLoop::DEBUG::2014-06-19 18:13:56,811::vm::5039::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::event Suspended detail 0 opaque None

There is a comment on the VDSM sources that shed some light on this:

        elif event == libvirt.VIR_DOMAIN_EVENT_SUSPENDED:
            self._guestCpuRunning = False
            if detail == libvirt.VIR_DOMAIN_EVENT_SUSPENDED_PAUSED:
                # Libvirt sometimes send the SUSPENDED/SUSPENDED_PAUSED event
                # after RESUMED/RESUMED_MIGRATED (when VM status is PAUSED
                # when migration completes, see qemuMigrationFinish function).
                # In this case self._dom is None because the function
                # _waitForIncomingMigrationFinish didn't update it yet.

please note that 'detail 0' in the logs is indeed EVENT_SUPSENDED_PAUSED:

enum virDomainEventSuspendedDetailType {

VIR_DOMAIN_EVENT_SUSPENDED_PAUSED	= 	0	
Normal suspend due to admin pause

(from http://libvirt.org/html/libvirt-libvirt.html#virDomainEventSuspendedDetailType)

I cannot yet explain why this happens sporadically, but this doesn't seem something new. What is the clock skew between VDSM and libvirt logs?

Comment 6 Francesco Romani 2014-06-23 11:13:39 UTC
In addition to the comment above, as part of the normal flow VDSM sends the 'pause' command to the underyling VM, so this definitely seems the normal flow.

Comment 7 Francesco Romani 2014-06-23 11:38:39 UTC
OK, after deeper investigation I found a possible explanation.

As stated above, everything said above is part of the normal flow. However, it may happen due a particular interleve of events, that VDSM receives the 'Suspend' notification from libvirt *before* it internally sets the 'Saving State' status.

In this case, the VM will be reported as 'Paused' as it happened, demonstrated by the logs above.

This case can happen for example if the guest agent is slow to respond to the desktop lock request, there is a much higher chance of this behaviour; this specific case however looks like just a bad interleave of events.

Comment 8 Omer Frenkel 2014-06-23 12:27:32 UTC
lowering priority as this is a race that sometimes occurs in jenkins automation

Comment 9 Michal Skrivanek 2014-06-24 07:11:25 UTC
decreasing severity as this doesn't affect the whole system functionality, only the affected VM is thought as Down instead of suspended (so you're losing the saved state on Run)

Comment 10 Francesco Romani 2014-06-24 11:26:36 UTC
candidate patch posted on gerrit

Comment 13 Ilanit Stein 2014-08-14 09:22:38 UTC
Occurred ONCE 3.4 automation:

http://jenkins.qa.lab.tlv.redhat.com:8080/view/Compute/view/3.4-git/view/Dashboard/job/3.4-git-compute-virt-reg_vms-nfs/34/

Failing this case:
reg_vms.reg_vms.063-BasicVmActionsREST;basic_vm_actions.BasicVmActionsREST.basic_vm_actions

Comment 14 Eyal Edri 2014-08-17 07:39:58 UTC
ilanit, can you explain how a 3.4 automaiton job is related to 3.5.0 bug ?

Comment 15 Ilanit Stein 2014-08-19 08:58:04 UTC
This bug was reported for on 3.3 ci automation, and occurred only once.
It also occurred once, recently, on 3.4 we automation (same test case).

It was decided to fix it only for 3.5, because it is:
1. Rare
2. No "exit reason" on 3.3, 3.4

If it happens more often, or if we will find it has some big impact, in some flow then 3.3/3.4 fix can be reconsidered.

Comment 16 Eyal Edri 2014-09-10 20:21:38 UTC
fixed in vt3, moving to on_qa.
if you believe this bug isn't released in vt3, please report to rhev-integ

Comment 17 Ilanit Stein 2014-10-02 08:32:50 UTC
Moving to Verified since this bug is not seen on 3.5 runs. As this test case is kept running for the following 3.5 versions (current vt4), it will be tracked, for each new release.

Comment 18 Omer Frenkel 2015-02-17 08:26:21 UTC
RHEV-M 3.5.0 has been released


Note You need to log in before you can comment on or make changes to this bug.