Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1505399

Summary: Poweroff the migrating VM will leave VM with migrating state under the engine
Product: [oVirt] ovirt-engine Reporter: Artyom <alukiano>
Component: BLL.VirtAssignee: Arik <ahadas>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.7.4CC: ahadas, alukiano, bugs, logank, lveyde, michal.skrivanek, tjelinek, troels
Target Milestone: ovirt-4.1.9-1Keywords: Triaged
Target Release: ---Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-12 09:59:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1536300    
Attachments:
Description Flags
vdsm and engine logs
none
logs and screenshots none

Description Artyom 2017-10-23 13:54:10 UTC
Description of problem:
Poweroff the migrating VM will leave VM with migrating state under the engine, when it does not exist on any host.

Version-Release number of selected component (if applicable):
vdsm-4.19.35-1.el7ev.x86_64
rhevm-4.1.7.4-0.1.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Start the VM
2. Initiate VM migration
3. When the VM has "migrating" state, poweroff it

Actual results:
The VM disappeared from hosts, but it still has migrating state under the engine

Expected results:
The VM disappeared from hosts, and it must have the down state under the engine

Additional info:

Comment 1 Artyom 2017-10-23 13:57:07 UTC
Created attachment 1342216 [details]
vdsm and engine logs

You can start looking from the line:

2017-10-23 09:47:49,588-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-6) [] EVENT_ID: USER_RUN_VM(32), Correlation ID: vms_syncAction_cdabd5f1-a3dd-4dd6, Job ID: f39281ef-b70b-40c0-b281-e3eb5fbe47ad, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM golden_env_mixed_virtio_0 started on Host host_mixed_1

Comment 2 Tomas Jelinek 2017-10-24 07:48:22 UTC
So, what happens.

1: vm starts migrating, everything is fine

2: StopVmCommand is called which detects that the VM is being migrated so destroys it on both the source and the destination

3: the VM will be reported as down from both source and destination (this is the situation VmAnalyzer will get confused with)

4: first it finds it down on destination (in worker 6):
2017-10-23 09:48:11,451-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-6) [] VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) was unexpectedly detected as 'Down' on VDS 'fca7300c-760a-45aa-aa81-a8968fb8abef'(host_mixed_2) (expected on '19d88d3d-7d9f-4d19-8372-b8f1086f309e')

5: than, VmAnalyzer finds it down on source (in worker 8)

6: VmAnalyzer moves it to to down state (in worker 6). This is still the handling of finding it down on the destination which means the migration failed. Im not sure about the exact place but the vm.migratingToVds is set to null somewhere here

7: VmAnalyzer (in worker 8) handling the down on source thinks the migration ended successfully, so it tries to hand it over to destination. The destination is now null, so the VM is set to MigratingTo to the destination which has been set to null:
2017-10-23 09:48:17,984-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] Handing over VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) to Host 'null'. Setting VM to status 'MigratingTo'

The logs end here. In theory in the next monitoring loop the VM status should change to Down. Is this the case?

Comment 3 Artyom 2017-10-24 08:22:56 UTC
Created attachment 1342592 [details]
logs and screenshots

You can start looking from the engine log(just take in account engine has EDT timezone and hosts IDT):
017-10-24 04:07:14,382-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-15) [] VM 'a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb'(golden_env_mixed_virtio_1) moved from 'PoweringUp' --> 'Up'
2017-10-24 04:07:14,391-04 DEBUG [org.ovirt.engine.core.bll.scheduling.pending.PendingResourceManager] (ForkJoinPool-1-worker-15) [] Clearing pending resources for VM a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb
2017-10-24 04:07:14,400-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-15) [] EVENT_ID: USER_RUN_VM(32), Correlation ID: vms_syncAction_d2e4f344-3fd6-418b, Job ID: 831aab32-b23f-4d71-b220-e6fe1737a15f, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM golden_env_mixed_virtio_1 started on Host host_mixed_1

Comment 4 Yaniv Kaul 2017-11-16 08:21:46 UTC
Arik - this is marked as blocker for oVirt GA. What is the status of the bug?

Comment 5 Michal Skrivanek 2017-11-20 12:05:36 UTC
need to check for exitReason MigrationSucceeded in VmAnalyzer::proceedDownVm()

Comment 6 Arik 2017-11-20 14:14:34 UTC
solution is posted

Comment 7 Arik 2017-11-21 08:02:50 UTC
(In reply to Tomas Jelinek from comment #2)
Let me provide the missing parts:

1: vm starts migrating, everything is fine

2: StopVmCommand is called which detects that the VM is being migrated so destroys it on both the source and the destination. ** the destroy command calls resourceManager.removeAsyncRunningVm which calls MigrateVmCommand#reportCompleted that clears the migrating_to_vds field **

3: the VM will be reported as down from both source and destination

4: first it finds it down on destination (in worker 6):
2017-10-23 09:48:11,451-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-6) [] VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) was unexpectedly detected as 'Down' on VDS 'fca7300c-760a-45aa-aa81-a8968fb8abef'(host_mixed_2) (expected on '19d88d3d-7d9f-4d19-8372-b8f1086f309e')
** The monitoring simply tries to destroy the VM on the destination (which does nothing because it was already destroyed) and it does not update the VM **

5: then, VmAnalyzer finds it down on source (in worker 8)

6: ** the monitoring doesn't move the VM to down state **

7: VmAnalyzer (in worker 8) handling the down on source thinks the migration ended successfully, so it tries to hand it over to destination. The destination is now null, so the VM is set to MigratingTo to the destination which has been set to null:
2017-10-23 09:48:17,984-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] Handing over VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) to Host 'null'. Setting VM to status 'MigratingTo'
** the first patch addresses this phase - we won't do the hand-over unless the exit-reason is MigrationSuccessful. the second patch addresses the case that we poll before getting the event **

The logs end here. In theory in the next monitoring loop the VM status should change to Down. Is this the case?
** once the run_on_vds field is set to null, if the VM is not reported by any host it would remain in its status forever **

Comment 8 Arik 2017-11-26 08:24:57 UTC
Don't we want to backport the fix to 4.1.z? The bug was detected on 4.1.7..

Comment 9 Michal Skrivanek 2017-12-14 05:31:48 UTC
*** Bug 1525529 has been marked as a duplicate of this bug. ***

Comment 10 Yaniv Kaul 2017-12-19 09:28:15 UTC
Any idea why is this bug stuck on MODIFIED?

Comment 11 Michal Skrivanek 2017-12-19 10:32:44 UTC
It's not stuck, there's no 4.1.9 yet

Comment 13 Artyom 2018-01-15 14:09:16 UTC
Checked on rhevm-4.1.9-0.2.el7.noarch, I can see the same problem:

Run 5 loops of:
1. Start the VM
2. Initiate VM migration
3. When the VM has "migrating" state, poweroff it

Under the engine log,
2018-01-15 16:08:38,936+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler6) [3fd2badc] VM 'a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb'(golden_env_mixed_virtio_1) is running in db and not running on VDS '19d88d3d-7d9f-4d19-8372-b8f1086f309e'(host_mixed_1)
2018-01-15 16:08:38,939+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler6) [3fd2badc] Failed during monitoring vm: a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb , error is: java.lang.NullPointerException
2018-01-15 16:08:38,940+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler6) [3fd2badc] Exception:: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.auditVmMigrationAbort(VmAnalyzer.java:440) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.abortVmMigration(VmAnalyzer.java:432) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.proceedDisappearedVm(VmAnalyzer.java:794) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.analyze(VmAnalyzer.java:135) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.lambda$analyzeVms$1(VmsMonitoring.java:151) [vdsbroker.jar:]
        at java.util.ArrayList.forEach(ArrayList.java:1255) [rt.jar:1.8.0_151]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.analyzeVms(VmsMonitoring.java:146) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.perform(VmsMonitoring.java:109) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:38) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) [:1.8.0_151]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_151]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_151]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:83) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:55) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_151]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]

2018-01-15 16:08:39,291+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler3) [5cd919bf] Before acquiring lock 'EngineLock:{exclusiveLocks='[19d88d3d-7d9f-4d19-8372-b8f1086f309e=VDS_INIT]', sharedLocks=''}'

Comment 14 RHV bug bot 2018-01-17 16:15:25 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Tag 'ovirt-engine-4.1.9.1' doesn't contain patch 'https://gerrit.ovirt.org/86367']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.1.9.1

For more info please contact: infra

Comment 15 Artyom 2018-02-01 08:24:54 UTC
Verified on rhevm-4.1.9.2-0.1.el7.noarch