Bug 1325938 - VM stay in 'powering down' after stopping VM
Summary: VM stay in 'powering down' after stopping VM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 3.6.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.0.0-beta
: 4.0.0
Assignee: Arik
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-11 12:56 UTC by Israel Pinto
Modified: 2016-08-01 12:26 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-01 12:26:26 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.0.0+
rule-engine: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
engine_log (2.15 MB, application/zip)
2016-04-11 12:56 UTC, Israel Pinto
no flags Details
host_1_logs (4.21 MB, application/zip)
2016-04-11 12:57 UTC, Israel Pinto
no flags Details
host_2_logs (3.65 MB, application/zip)
2016-04-11 12:58 UTC, Israel Pinto
no flags Details
vm_tab_screenshot (193.67 KB, image/png)
2016-04-11 13:00 UTC, Israel Pinto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 57230 0 master NEW core: prevent host in maintenance while vm runs on it in db 2016-05-09 08:46:21 UTC

Description Israel Pinto 2016-04-11 12:56:13 UTC
RHEVM: 3.6.5-2
Host: 
OS Version:RHEL - 7.2 - 9.el7_2.1
Kernel Version:3.10.0 - 327.13.1.el7.x86_64
KVM Version:2.3.0 - 31.el7_2.10
LIBVIRT Version:libvirt-1.2.17-13.el7_2.4
VDSM Version:vdsm-4.17.25-0.el7ev
SPICE Version:0.12.4 - 15.el7

VM stay in 'powering down' and the host is in Maintenance.
We run the following secnario:
1. Create vm (name: migration_vm, vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f')
2. [Case 1] Migrate vm. (from host_mix_1 to host_mix_2)- Case pass
3. [Case 2] Maintenance of host. (host_mix_2)- Case failed since the BZ:1303994
    Correlation-Id: hosts_syncAction_b9bbd1b9-574e-4060 
    Engine log:
    2016-04-11 01:06:28,364 INFO  [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-12) [] Received a spice Device without an address when processing VM eef7ef67-8ae9-4400-a842-86630305e0ad devices, skipping device: {tlsPort=5903, specParams={fileTransferEnable=true, copyPasteEnable=true, displayIp=10.35.117.72, displayNetwork=ovirtmgmt}, deviceType=graphics, deviceId=e30429fe-1c4b-41e4-8123-2593f1fae3c1, device=spice, type=graphics, port=5902}
    2016-04-11 01:06:28,425 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-12) [] Rerun VM '4e64a544-57d9-470d-bfc2-214f1df85f3f'. Called from VDS 'host_mixed_2'
    2016-04-11 01:06:28,440 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (org.ovirt.thread.pool-6-thread-46) [] START, MigrateStatusVDSCommand(HostName = host_mixed_2, MigrateStatusVDSCommandParameters:{runAsync='true', hostId='bf676270-3a37-463e-9824-0568f1f8b51d', vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f'}), log id: 530ee11d
    2016-04-11 01:06:29,445 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (org.ovirt.thread.pool-6-thread-46) [] FINISH, MigrateStatusVDSCommand, log id: 530ee11d
    2016-04-11 01:06:29,489 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-46) [] Correlation ID: 15733274, Job ID: 06a86648-409e-4bd2-949d-91e2a9532693, Call Stack: null, Custom Event ID: -1, Message: Failed to migrate VM migration_vm to Host host_mixed_1 . Trying to migrate to another Host.
    2016-04-11 01:06:29,599 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-6-thread-46) [] CanDoAction of action 'InternalMigrateVm' failed for user admin@internal. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_NO_HOSTS
    2016-04-11 01:06:29,600 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-6-thread-46) [] Lock freed to object 'EngineLock:{exclusiveLocks='[4e64a544-57d9-470d-bfc2-214f1df85f3f=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName migration_vm>]', sharedLocks='null'}'
    2016-04-11 01:06:29,619 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-46) [] Correlation ID: 15733274, Job ID: 06a86648-409e-4bd2-949d-91e2a9532693, Call Stack: null, Custom Event ID: -1, Message: Migration failed  while Host is in 'preparing for maintenance' state.
      Consider manual intervention: stopping/migrating Vms as Host's state will not
      turn to maintenance while VMs are still running on it.(VM: migration_vm, Source: host_mixed_2, Destination: <UNKNOWN>).
    2016-04-11 01:06:42,482 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-36) [] RefreshVmList VM id 'eef7ef67-8ae9-4400-a842-86630305e0ad' is migrating to VDS 'host_mixed_1' ignoring it in the refresh until migration is done
    
4. [Case 3] Start migration of VM and cancel it (with migration_vm)  Case pass
    Correlation-Id: vms_syncAction_5bb55d5b-0956-4655
    Engine_log:
    2016-04-11 01:09:49,858 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-3) [vms_syncAction_5bb55d5b-0956-4655] Correlation ID: vms_syncAction_5bb55d5b-0956-4655, Call Stack: null, Custom Event ID: -1, Message: Migration cancelled (VM: migration_vm, Source: host_mixed_2, User: admin@internal).
    2016-04-11 01:09:50,630 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-9) [] Failed to destroy VM '4e64a544-57d9-470d-bfc2-214f1df85f3f' because VM does not exist, ignoring
    2016-04-11 01:09:50,631 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-9) [] FINISH, DestroyVDSCommand, log id: 7e3db1a7
    2016-04-11 01:09:50,634 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-9) [] RefreshVmList VM id '4e64a544-57d9-470d-bfc2-214f1df85f3f' status = 'Down' on VDS 'host_mixed_1' ignoring it in the refresh until migration is done
    2016-04-11 01:09:52,889 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-51) [7eb2a483] VM '4e64a544-57d9-470d-bfc2-214f1df85f3f'(migration_vm) moved from 'MigratingFrom' --> 'Up'
    ---- Why rerun VM ?-----------
    2016-04-11 01:09:52,889 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-51) [7eb2a483] Adding VM '4e64a544-57d9-470d-bfc2-214f1df85f3f' to re-run list
    2016-04-11 01:09:52,932 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-51) [7eb2a483] Rerun VM '4e64a544-57d9-470d-bfc2-214f1df85f3f'. Called from VDS 'host_mixed_2'

5. TearDown: 
          a. Stop migration_vm - failed to stop, no error in the log
          Correlation-Id: vms_syncAction_99213f1c-149b-4716 
          Engine_log:
          2016-04-11 01:10:06,651 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] Running command: StopVmCommand internal: false. Entities affected :  ID: 4e64a544-57d9-470d-bfc2-214f1df85f3f Type: VMAction group STOP_VM with role type USER
          2016-04-11 01:10:06,668 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{runAsync='true', hostId='bf676270-3a37-463e-9824-0568f1f8b51d', vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 5054f31d
          2016-04-11 01:10:06,694 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] START, DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{runAsync='true', hostId='bf676270-3a37-463e-9824-0568f1f8b51d', vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 12fa9d89
          2016-04-11 01:10:07,936 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] FINISH, DestroyVDSCommand, log id: 12fa9d89
          2016-04-11 01:10:07,948 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] FINISH, DestroyVmVDSCommand, return: PoweringDown, log id: 5054f31d
          2016-04-11 01:10:07,969 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] Correlation ID: vms_syncAction_99213f1c-149b-4716, Job ID: 662f1bb6-d6cd-4b9e-96b2-316ec3a18d4e, Call Stack: null, Custom Event ID: -1, Message: VM migration_vm powered off by admin@internal (Host: host_mixed_2) (Reason:  No reason was returned for this operation failure. See logs for further details.).
          2016-04-11 01:10:09,101 WARN  [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-99) [51bfe58b] skipping VM '4e64a544-57d9-470d-bfc2-214f1df85f3f' from this monitoring cycle - the VM data has changed since fetching the data
          b. remove migration_vm - failed to remove vm
          Correlation-Id: vms_delete_17e29614-08c1-434f
          Engine_log:
   		  2016-04-11 01:25:11,230 WARN  [org.ovirt.engine.core.bll.RemoveVmCommand] (ajp-/127.0.0.1:8702-4) [vms_delete_17e29614-08c1-434f] CanDoAction of action 'RemoveVm' failed for user admin@internal. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_RUNNING
		  2016-04-11 01:25:11,230 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (ajp-/127.0.0.1:8702-4) [vms_delete_17e29614-08c1-434f] Lock freed to object 'EngineLock:{exclusiveLocks='[4e64a544-57d9-470d-bfc2-214f1df85f3f=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
		  2016-04-11 01:25:11,246 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-4) [] Operation Failed: [Cannot remove VM. VM is running.]

After the test was finished the VM was in 'powering down' and host_mixed_2 was in Maintenance.

Comment 1 Israel Pinto 2016-04-11 12:56:55 UTC
Created attachment 1145991 [details]
engine_log

Comment 2 Israel Pinto 2016-04-11 12:57:41 UTC
Created attachment 1145992 [details]
host_1_logs

Comment 3 Israel Pinto 2016-04-11 12:58:45 UTC
Created attachment 1145993 [details]
host_2_logs

Comment 4 Israel Pinto 2016-04-11 13:00:06 UTC
Created attachment 1145994 [details]
vm_tab_screenshot

Comment 5 Arik 2016-04-12 08:33:25 UTC
The major problem here is the fact that the host was switched to maintenance while the engine thinks a VM is running on that host.

The engine puts the host in powering-down state and expect it to be down or not to be reported anymore - but the fact that the host switched to maintenance means that the host is not monitored - and thus the VM will be stuck in this state until the host is activated.

The engine switches the host to maintenance because it thinks no VM is running on the host - and is determined according to the number of VMs that are reported by VDSM. VDSM reports that no VM is running on the host, which is true.

So we should probably change the test to check the number of VMs running on the host according to the *database* instead of the report from VDSM.

Comment 6 Michal Skrivanek 2016-04-12 14:13:26 UTC
sounds as a race, rare enough to wait for 4.0. might got exposed more often due to VM status events introduced in 3.6

Comment 7 Sandro Bonazzola 2016-05-02 09:49:03 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 8 Israel Pinto 2016-07-21 09:01:25 UTC
Verify with:
RHEVM Version: 4.0.2-0.2.rc1.el7ev
Hosts:
OS Version:RHEL - 7.2 - 9.el7_2.1
Kernel Version:3.10.0 - 327.22.2.el7.x86_64
KVM Version:2.3.0 - 31.el7_2.16
LIBVIRT Version:libvirt-1.2.17-13.el7_2.5
VDSM Version:vdsm-4.18.5.1-1.el7ev
SPICE Version:0.12.4 - 15.el7_2.1

Steps with automation test(Check that migration is done to source host again):
One VM, 2 Hosts in Cluster
Loop 3 times:
1. Migrate VM 
2. Put Host into maintenance
(no delay between steps)

Test - PASS


Note You need to log in before you can comment on or make changes to this bug.