Hide Forgot
RHEVM: 3.6.5-2 Host: OS Version:RHEL - 7.2 - 9.el7_2.1 Kernel Version:3.10.0 - 327.13.1.el7.x86_64 KVM Version:2.3.0 - 31.el7_2.10 LIBVIRT Version:libvirt-1.2.17-13.el7_2.4 VDSM Version:vdsm-4.17.25-0.el7ev SPICE Version:0.12.4 - 15.el7 VM stay in 'powering down' and the host is in Maintenance. We run the following secnario: 1. Create vm (name: migration_vm, vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f') 2. [Case 1] Migrate vm. (from host_mix_1 to host_mix_2)- Case pass 3. [Case 2] Maintenance of host. (host_mix_2)- Case failed since the BZ:1303994 Correlation-Id: hosts_syncAction_b9bbd1b9-574e-4060 Engine log: 2016-04-11 01:06:28,364 INFO [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-12) [] Received a spice Device without an address when processing VM eef7ef67-8ae9-4400-a842-86630305e0ad devices, skipping device: {tlsPort=5903, specParams={fileTransferEnable=true, copyPasteEnable=true, displayIp=10.35.117.72, displayNetwork=ovirtmgmt}, deviceType=graphics, deviceId=e30429fe-1c4b-41e4-8123-2593f1fae3c1, device=spice, type=graphics, port=5902} 2016-04-11 01:06:28,425 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-12) [] Rerun VM '4e64a544-57d9-470d-bfc2-214f1df85f3f'. Called from VDS 'host_mixed_2' 2016-04-11 01:06:28,440 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (org.ovirt.thread.pool-6-thread-46) [] START, MigrateStatusVDSCommand(HostName = host_mixed_2, MigrateStatusVDSCommandParameters:{runAsync='true', hostId='bf676270-3a37-463e-9824-0568f1f8b51d', vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f'}), log id: 530ee11d 2016-04-11 01:06:29,445 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (org.ovirt.thread.pool-6-thread-46) [] FINISH, MigrateStatusVDSCommand, log id: 530ee11d 2016-04-11 01:06:29,489 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-46) [] Correlation ID: 15733274, Job ID: 06a86648-409e-4bd2-949d-91e2a9532693, Call Stack: null, Custom Event ID: -1, Message: Failed to migrate VM migration_vm to Host host_mixed_1 . Trying to migrate to another Host. 2016-04-11 01:06:29,599 WARN [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-6-thread-46) [] CanDoAction of action 'InternalMigrateVm' failed for user admin@internal. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_NO_HOSTS 2016-04-11 01:06:29,600 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-6-thread-46) [] Lock freed to object 'EngineLock:{exclusiveLocks='[4e64a544-57d9-470d-bfc2-214f1df85f3f=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName migration_vm>]', sharedLocks='null'}' 2016-04-11 01:06:29,619 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-46) [] Correlation ID: 15733274, Job ID: 06a86648-409e-4bd2-949d-91e2a9532693, Call Stack: null, Custom Event ID: -1, Message: Migration failed while Host is in 'preparing for maintenance' state. Consider manual intervention: stopping/migrating Vms as Host's state will not turn to maintenance while VMs are still running on it.(VM: migration_vm, Source: host_mixed_2, Destination: <UNKNOWN>). 2016-04-11 01:06:42,482 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-36) [] RefreshVmList VM id 'eef7ef67-8ae9-4400-a842-86630305e0ad' is migrating to VDS 'host_mixed_1' ignoring it in the refresh until migration is done 4. [Case 3] Start migration of VM and cancel it (with migration_vm) Case pass Correlation-Id: vms_syncAction_5bb55d5b-0956-4655 Engine_log: 2016-04-11 01:09:49,858 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-3) [vms_syncAction_5bb55d5b-0956-4655] Correlation ID: vms_syncAction_5bb55d5b-0956-4655, Call Stack: null, Custom Event ID: -1, Message: Migration cancelled (VM: migration_vm, Source: host_mixed_2, User: admin@internal). 2016-04-11 01:09:50,630 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-9) [] Failed to destroy VM '4e64a544-57d9-470d-bfc2-214f1df85f3f' because VM does not exist, ignoring 2016-04-11 01:09:50,631 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-9) [] FINISH, DestroyVDSCommand, log id: 7e3db1a7 2016-04-11 01:09:50,634 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (ForkJoinPool-1-worker-9) [] RefreshVmList VM id '4e64a544-57d9-470d-bfc2-214f1df85f3f' status = 'Down' on VDS 'host_mixed_1' ignoring it in the refresh until migration is done 2016-04-11 01:09:52,889 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-51) [7eb2a483] VM '4e64a544-57d9-470d-bfc2-214f1df85f3f'(migration_vm) moved from 'MigratingFrom' --> 'Up' ---- Why rerun VM ?----------- 2016-04-11 01:09:52,889 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-51) [7eb2a483] Adding VM '4e64a544-57d9-470d-bfc2-214f1df85f3f' to re-run list 2016-04-11 01:09:52,932 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-51) [7eb2a483] Rerun VM '4e64a544-57d9-470d-bfc2-214f1df85f3f'. Called from VDS 'host_mixed_2' 5. TearDown: a. Stop migration_vm - failed to stop, no error in the log Correlation-Id: vms_syncAction_99213f1c-149b-4716 Engine_log: 2016-04-11 01:10:06,651 INFO [org.ovirt.engine.core.bll.StopVmCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] Running command: StopVmCommand internal: false. Entities affected : ID: 4e64a544-57d9-470d-bfc2-214f1df85f3f Type: VMAction group STOP_VM with role type USER 2016-04-11 01:10:06,668 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] START, DestroyVmVDSCommand( DestroyVmVDSCommandParameters:{runAsync='true', hostId='bf676270-3a37-463e-9824-0568f1f8b51d', vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 5054f31d 2016-04-11 01:10:06,694 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] START, DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{runAsync='true', hostId='bf676270-3a37-463e-9824-0568f1f8b51d', vmId='4e64a544-57d9-470d-bfc2-214f1df85f3f', force='false', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false'}), log id: 12fa9d89 2016-04-11 01:10:07,936 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] FINISH, DestroyVDSCommand, log id: 12fa9d89 2016-04-11 01:10:07,948 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] FINISH, DestroyVmVDSCommand, return: PoweringDown, log id: 5054f31d 2016-04-11 01:10:07,969 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-4) [vms_syncAction_99213f1c-149b-4716] Correlation ID: vms_syncAction_99213f1c-149b-4716, Job ID: 662f1bb6-d6cd-4b9e-96b2-316ec3a18d4e, Call Stack: null, Custom Event ID: -1, Message: VM migration_vm powered off by admin@internal (Host: host_mixed_2) (Reason: No reason was returned for this operation failure. See logs for further details.). 2016-04-11 01:10:09,101 WARN [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-99) [51bfe58b] skipping VM '4e64a544-57d9-470d-bfc2-214f1df85f3f' from this monitoring cycle - the VM data has changed since fetching the data b. remove migration_vm - failed to remove vm Correlation-Id: vms_delete_17e29614-08c1-434f Engine_log: 2016-04-11 01:25:11,230 WARN [org.ovirt.engine.core.bll.RemoveVmCommand] (ajp-/127.0.0.1:8702-4) [vms_delete_17e29614-08c1-434f] CanDoAction of action 'RemoveVm' failed for user admin@internal. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_RUNNING 2016-04-11 01:25:11,230 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (ajp-/127.0.0.1:8702-4) [vms_delete_17e29614-08c1-434f] Lock freed to object 'EngineLock:{exclusiveLocks='[4e64a544-57d9-470d-bfc2-214f1df85f3f=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2016-04-11 01:25:11,246 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-4) [] Operation Failed: [Cannot remove VM. VM is running.] After the test was finished the VM was in 'powering down' and host_mixed_2 was in Maintenance.
Created attachment 1145991 [details] engine_log
Created attachment 1145992 [details] host_1_logs
Created attachment 1145993 [details] host_2_logs
Created attachment 1145994 [details] vm_tab_screenshot
The major problem here is the fact that the host was switched to maintenance while the engine thinks a VM is running on that host. The engine puts the host in powering-down state and expect it to be down or not to be reported anymore - but the fact that the host switched to maintenance means that the host is not monitored - and thus the VM will be stuck in this state until the host is activated. The engine switches the host to maintenance because it thinks no VM is running on the host - and is determined according to the number of VMs that are reported by VDSM. VDSM reports that no VM is running on the host, which is true. So we should probably change the test to check the number of VMs running on the host according to the *database* instead of the report from VDSM.
sounds as a race, rare enough to wait for 4.0. might got exposed more often due to VM status events introduced in 3.6
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
Verify with: RHEVM Version: 4.0.2-0.2.rc1.el7ev Hosts: OS Version:RHEL - 7.2 - 9.el7_2.1 Kernel Version:3.10.0 - 327.22.2.el7.x86_64 KVM Version:2.3.0 - 31.el7_2.16 LIBVIRT Version:libvirt-1.2.17-13.el7_2.5 VDSM Version:vdsm-4.18.5.1-1.el7ev SPICE Version:0.12.4 - 15.el7_2.1 Steps with automation test(Check that migration is done to source host again): One VM, 2 Hosts in Cluster Loop 3 times: 1. Migrate VM 2. Put Host into maintenance (no delay between steps) Test - PASS