Description of problem: Suspend a VM, turn "UP" VM, with disk to down state. On REST API, right after suspend VM command, VM status changes into "paused", and then to "down". Version-Release number of selected component (if applicable): rhevm 3.3.3-0.1000.534.c1e70ae.el6ev vdsm: 4.13.2-0.17 How reproducible: occurred once on Jenkins automatic test, but didn't on the following build, with same versions. http://jenkins-ci.eng.lab.tlv.redhat.com/view/0%20Unstable%203.3/job/rhevm_3.3_automation_coretools_two_hosts_restapi_vms_nfs_rest_factory/243/ Actual results: VM turned to "down" state Expected results: VM should turn to "suspend" state. Additional info: From engine.log: 2014-06-19 18:13:53,362 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-2) [466] Correlation ID: 466, Job ID: 474a9f3e-57e7-4379-b4a5-d2152c6e19f0, Call Stack: null, Custom Event ID: -1, Message: Suspending VM restvm_basic_operations was initiated by User admin@internal (Host: cinteg18.ci.lab.tlv.redhat.com). 2014-06-19 18:13:56,846 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-65) VM restvm_basic_operations 3cf81856-814f-49ab-9d63-62e88bf04634 moved from SavingState --> Paused 2014-06-19 18:13:59,945 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-69) START, DestroyVDSCommand(HostName = cinteg18.ci.lab.tlv.redhat.com, HostId = ef37e6f5-346b-4e4a-8ce2-7384b2401ded, vmId=3cf81856-814f-49ab-9d63-62e88bf04634, force=false, secondsToWait=0, gracefully=false), log id: 37379689 2014-06-19 18:13:59,975 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-69) FINISH, DestroyVDSCommand, log id: 37379689 2014-06-19 18:13:59,994 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-69) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM restvm_basic_operations is down. Exit message: SaveState succeeded 2014-06-19 18:13:59,996 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-69) vm restvm_basic_operations running in db and not running in vds - add to rerun treatment. vds cinteg18.ci.lab.tlv.redhat.com 2014-06-19 18:14:00,015 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (pool-4-thread-48) [38bb83c2] Running command: ProcessDownVmCommand internal: true.
Created attachment 911133 [details] logs
vdsm.log show VM destroyed due to connection with libvirt (domain not found) (libvirt log show nothing at time of error): Thread-545::DEBUG::2014-06-19 18:13:58,020::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '3cf81856-814f-49ab-9d63-62e88bf04634' Thread-545::DEBUG::2014-06-19 18:13:58,020::vm::2658::vm.Vm::(setDownStatus) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Changed state to Down: SaveState succeeded Thread-545::DEBUG::2014-06-19 18:13:58,020::vmChannels::205::vds::(unregister) Delete fileno 7 from listener. Thread-545::DEBUG::2014-06-19 18:13:58,020::sampling::292::vm.Vm::(stop) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Stop statistics collection . . . Thread-550::DEBUG::2014-06-19 18:13:59,959::BindingXMLRPC::975::vds::(wrapper) client [10.35.19.10]::call vmDestroy with ('3cf81856-814f-49ab-9d63-62e88bf04634',) {} Thread-550::INFO::2014-06-19 18:13:59,959::API::324::vds::(destroy) vmContainerLock acquired by vm 3cf81856-814f-49ab-9d63-62e88bf04634 Thread-550::DEBUG::2014-06-19 18:13:59,959::vm::4555::vm.Vm::(destroy) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::destroy Called Thread-550::INFO::2014-06-19 18:13:59,959::vm::4501::vm.Vm::(releaseVm) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Release VM resources Thread-550::WARNING::2014-06-19 18:13:59,959::vm::1852::vm.Vm::(_set_lastStatus) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::trying to set state to Powering down when already Down Thread-550::DEBUG::2014-06-19 18:13:59,960::sampling::292::vm.Vm::(stop) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::Stop statistics collection Thread-550::DEBUG::2014-06-19 18:13:59,961::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '3cf81856-814f-49ab-9d63-62e88bf04634' Thread-550::WARNING::2014-06-19 18:13:59,961::clientIF::363::vds::(teardownVolumePath) Drive is not a vdsm image: VOLWM_CHUNK_MB:1024 VOLWM_CHUNK_REPLICATE_MULT:2 VOLWM_FREE_PCT:... Traceback (most recent call last): File "/usr/share/vdsm/clientIF.py", line 357, in teardownVolumePath res = self.irs.teardownImage(drive['domainID'], File "/usr/share/vdsm/vm.py", line 1432, in __getitem__ raise KeyError(key) KeyError: 'domainID' Thread-550::DEBUG::2014-06-19 18:13:59,962::task::579::TaskManager.Task::(_updateState) Task=`47b80177-3111-4606-89a1-1d9092ff971d`::moving from state init -> state preparing Thread-550::INFO::2014-06-19 18:13:59,963::logUtils::44::dispatcher::(wrapper) Run and protect: teardownImage(sdUUID='54848b17-486e-4be7-841f-8a4dde007a3a', spUUID='49614565-76c4-4426-bb7f-311624c6b48d', imgUUID='d62a50e0-1078-4998-8e02-06024b89da17', volUUID=None) Thread-550::DEBUG::2014-06-19 18:13:59,963::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.54848b17-486e-4be7-841f-8a4dde007a3a`ReqID=`254c0526-68c0-4aa2-a28b-f54168f613f2`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '3319' at 'teardownImage'
i think the issue is that the vm is moved to pause after the hibernate call. the engine move vm to suspended if it moves from saving-state to down in this case the vm was paused and moved to down, so it didn't fit the suspend flow. (even though the exit message is 'SaveState succeeded' , the engine doesn't parse or give any attention to the message. we can improve this in 3.5 with the new exit reason, but here we need to understand why the vm is paused at all?
The VM goes paused because we receive an 'Suspended' event from libvirt: libvirtEventLoop::DEBUG::2014-06-19 18:13:56,785::vm::5039::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::event Suspended detail 0 opaque None Thread-547::DEBUG::2014-06-19 18:13:56,808::BindingXMLRPC::975::vds::(wrapper) client [10.35.19.10]::call vmGetStats with ('3cf81856-814f-49ab-9d63-62e88bf04634',) {} Thread-547::DEBUG::2014-06-19 18:13:56,810::BindingXMLRPC::982::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Paused', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '5926', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '0', 'hash': '-8697249038581656653', 'balloonInfo': {'balloon_max': '1048576', 'balloon_target': '1048576', 'balloon_cur': '1048576', 'balloon_min': '1048576'}, 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {}, 'vmId': '3cf81856-814f-49ab-9d63-62e88bf04634', 'displayType': 'qxl', 'cpuUser': '0.15', 'disks': {u'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}, u'hda': {'readLatency': '25747', 'apparentsize': '197120', 'writeLatency': '0', 'imageID': 'd62a50e0-1078-4998-8e02-06024b89da17', 'flushLatency': '0', 'readRate': '8.52', 'truesize': '143360', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '0.47', 'elapsedTime': '66', 'vmType': 'kvm', 'cpuSys': '0.60', 'appsList': [], 'guestIPs': ''}]} libvirtEventLoop::DEBUG::2014-06-19 18:13:56,811::vm::5039::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`3cf81856-814f-49ab-9d63-62e88bf04634`::event Suspended detail 0 opaque None There is a comment on the VDSM sources that shed some light on this: elif event == libvirt.VIR_DOMAIN_EVENT_SUSPENDED: self._guestCpuRunning = False if detail == libvirt.VIR_DOMAIN_EVENT_SUSPENDED_PAUSED: # Libvirt sometimes send the SUSPENDED/SUSPENDED_PAUSED event # after RESUMED/RESUMED_MIGRATED (when VM status is PAUSED # when migration completes, see qemuMigrationFinish function). # In this case self._dom is None because the function # _waitForIncomingMigrationFinish didn't update it yet. please note that 'detail 0' in the logs is indeed EVENT_SUPSENDED_PAUSED: enum virDomainEventSuspendedDetailType { VIR_DOMAIN_EVENT_SUSPENDED_PAUSED = 0 Normal suspend due to admin pause (from http://libvirt.org/html/libvirt-libvirt.html#virDomainEventSuspendedDetailType) I cannot yet explain why this happens sporadically, but this doesn't seem something new. What is the clock skew between VDSM and libvirt logs?
In addition to the comment above, as part of the normal flow VDSM sends the 'pause' command to the underyling VM, so this definitely seems the normal flow.
OK, after deeper investigation I found a possible explanation. As stated above, everything said above is part of the normal flow. However, it may happen due a particular interleve of events, that VDSM receives the 'Suspend' notification from libvirt *before* it internally sets the 'Saving State' status. In this case, the VM will be reported as 'Paused' as it happened, demonstrated by the logs above. This case can happen for example if the guest agent is slow to respond to the desktop lock request, there is a much higher chance of this behaviour; this specific case however looks like just a bad interleave of events.
lowering priority as this is a race that sometimes occurs in jenkins automation
decreasing severity as this doesn't affect the whole system functionality, only the affected VM is thought as Down instead of suspended (so you're losing the saved state on Run)
candidate patch posted on gerrit
Occurred ONCE 3.4 automation: http://jenkins.qa.lab.tlv.redhat.com:8080/view/Compute/view/3.4-git/view/Dashboard/job/3.4-git-compute-virt-reg_vms-nfs/34/ Failing this case: reg_vms.reg_vms.063-BasicVmActionsREST;basic_vm_actions.BasicVmActionsREST.basic_vm_actions
ilanit, can you explain how a 3.4 automaiton job is related to 3.5.0 bug ?
This bug was reported for on 3.3 ci automation, and occurred only once. It also occurred once, recently, on 3.4 we automation (same test case). It was decided to fix it only for 3.5, because it is: 1. Rare 2. No "exit reason" on 3.3, 3.4 If it happens more often, or if we will find it has some big impact, in some flow then 3.3/3.4 fix can be reconsidered.
fixed in vt3, moving to on_qa. if you believe this bug isn't released in vt3, please report to rhev-integ
Moving to Verified since this bug is not seen on 3.5 runs. As this test case is kept running for the following 3.5 versions (current vt4), it will be tracked, for each new release.
RHEV-M 3.5.0 has been released