Bug 1505399
| Summary: | Poweroff the migrating VM will leave VM with migrating state under the engine | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Artyom <alukiano> | ||||||
| Component: | BLL.Virt | Assignee: | Arik <ahadas> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Artyom <alukiano> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.1.7.4 | CC: | ahadas, alukiano, bugs, logank, lveyde, michal.skrivanek, tjelinek, troels | ||||||
| Target Milestone: | ovirt-4.1.9-1 | Keywords: | Triaged | ||||||
| Target Release: | --- | Flags: | rule-engine:
ovirt-4.1+
rule-engine: blocker+ |
||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-02-12 09:59:50 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1536300 | ||||||||
| Attachments: |
|
||||||||
|
Description
Artyom
2017-10-23 13:54:10 UTC
Created attachment 1342216 [details]
vdsm and engine logs
You can start looking from the line:
2017-10-23 09:47:49,588-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-6) [] EVENT_ID: USER_RUN_VM(32), Correlation ID: vms_syncAction_cdabd5f1-a3dd-4dd6, Job ID: f39281ef-b70b-40c0-b281-e3eb5fbe47ad, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM golden_env_mixed_virtio_0 started on Host host_mixed_1
So, what happens. 1: vm starts migrating, everything is fine 2: StopVmCommand is called which detects that the VM is being migrated so destroys it on both the source and the destination 3: the VM will be reported as down from both source and destination (this is the situation VmAnalyzer will get confused with) 4: first it finds it down on destination (in worker 6): 2017-10-23 09:48:11,451-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-6) [] VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) was unexpectedly detected as 'Down' on VDS 'fca7300c-760a-45aa-aa81-a8968fb8abef'(host_mixed_2) (expected on '19d88d3d-7d9f-4d19-8372-b8f1086f309e') 5: than, VmAnalyzer finds it down on source (in worker 8) 6: VmAnalyzer moves it to to down state (in worker 6). This is still the handling of finding it down on the destination which means the migration failed. Im not sure about the exact place but the vm.migratingToVds is set to null somewhere here 7: VmAnalyzer (in worker 8) handling the down on source thinks the migration ended successfully, so it tries to hand it over to destination. The destination is now null, so the VM is set to MigratingTo to the destination which has been set to null: 2017-10-23 09:48:17,984-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] Handing over VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) to Host 'null'. Setting VM to status 'MigratingTo' The logs end here. In theory in the next monitoring loop the VM status should change to Down. Is this the case? Created attachment 1342592 [details]
logs and screenshots
You can start looking from the engine log(just take in account engine has EDT timezone and hosts IDT):
017-10-24 04:07:14,382-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-15) [] VM 'a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb'(golden_env_mixed_virtio_1) moved from 'PoweringUp' --> 'Up'
2017-10-24 04:07:14,391-04 DEBUG [org.ovirt.engine.core.bll.scheduling.pending.PendingResourceManager] (ForkJoinPool-1-worker-15) [] Clearing pending resources for VM a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb
2017-10-24 04:07:14,400-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-15) [] EVENT_ID: USER_RUN_VM(32), Correlation ID: vms_syncAction_d2e4f344-3fd6-418b, Job ID: 831aab32-b23f-4d71-b220-e6fe1737a15f, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM golden_env_mixed_virtio_1 started on Host host_mixed_1
Arik - this is marked as blocker for oVirt GA. What is the status of the bug? need to check for exitReason MigrationSucceeded in VmAnalyzer::proceedDownVm() solution is posted (In reply to Tomas Jelinek from comment #2) Let me provide the missing parts: 1: vm starts migrating, everything is fine 2: StopVmCommand is called which detects that the VM is being migrated so destroys it on both the source and the destination. ** the destroy command calls resourceManager.removeAsyncRunningVm which calls MigrateVmCommand#reportCompleted that clears the migrating_to_vds field ** 3: the VM will be reported as down from both source and destination 4: first it finds it down on destination (in worker 6): 2017-10-23 09:48:11,451-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-6) [] VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) was unexpectedly detected as 'Down' on VDS 'fca7300c-760a-45aa-aa81-a8968fb8abef'(host_mixed_2) (expected on '19d88d3d-7d9f-4d19-8372-b8f1086f309e') ** The monitoring simply tries to destroy the VM on the destination (which does nothing because it was already destroyed) and it does not update the VM ** 5: then, VmAnalyzer finds it down on source (in worker 8) 6: ** the monitoring doesn't move the VM to down state ** 7: VmAnalyzer (in worker 8) handling the down on source thinks the migration ended successfully, so it tries to hand it over to destination. The destination is now null, so the VM is set to MigratingTo to the destination which has been set to null: 2017-10-23 09:48:17,984-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] Handing over VM 'f3ee19f2-8f89-43a9-ba10-445c10e861f2'(golden_env_mixed_virtio_0) to Host 'null'. Setting VM to status 'MigratingTo' ** the first patch addresses this phase - we won't do the hand-over unless the exit-reason is MigrationSuccessful. the second patch addresses the case that we poll before getting the event ** The logs end here. In theory in the next monitoring loop the VM status should change to Down. Is this the case? ** once the run_on_vds field is set to null, if the VM is not reported by any host it would remain in its status forever ** Don't we want to backport the fix to 4.1.z? The bug was detected on 4.1.7.. *** Bug 1525529 has been marked as a duplicate of this bug. *** Any idea why is this bug stuck on MODIFIED? It's not stuck, there's no 4.1.9 yet Checked on rhevm-4.1.9-0.2.el7.noarch, I can see the same problem:
Run 5 loops of:
1. Start the VM
2. Initiate VM migration
3. When the VM has "migrating" state, poweroff it
Under the engine log,
2018-01-15 16:08:38,936+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler6) [3fd2badc] VM 'a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb'(golden_env_mixed_virtio_1) is running in db and not running on VDS '19d88d3d-7d9f-4d19-8372-b8f1086f309e'(host_mixed_1)
2018-01-15 16:08:38,939+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler6) [3fd2badc] Failed during monitoring vm: a22df7a0-b4ec-4fb7-aa6d-aa9572dffdbb , error is: java.lang.NullPointerException
2018-01-15 16:08:38,940+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler6) [3fd2badc] Exception:: java.lang.NullPointerException
at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.auditVmMigrationAbort(VmAnalyzer.java:440) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.abortVmMigration(VmAnalyzer.java:432) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.proceedDisappearedVm(VmAnalyzer.java:794) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.analyze(VmAnalyzer.java:135) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.lambda$analyzeVms$1(VmsMonitoring.java:151) [vdsbroker.jar:]
at java.util.ArrayList.forEach(ArrayList.java:1255) [rt.jar:1.8.0_151]
at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.analyzeVms(VmsMonitoring.java:146) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.perform(VmsMonitoring.java:109) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:38) [vdsbroker.jar:]
at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) [:1.8.0_151]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_151]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_151]
at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:83) [scheduler.jar:]
at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:55) [scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_151]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_151]
2018-01-15 16:08:39,291+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler3) [5cd919bf] Before acquiring lock 'EngineLock:{exclusiveLocks='[19d88d3d-7d9f-4d19-8372-b8f1086f309e=VDS_INIT]', sharedLocks=''}'
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Tag 'ovirt-engine-4.1.9.1' doesn't contain patch 'https://gerrit.ovirt.org/86367'] gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.1.9.1 For more info please contact: infra Verified on rhevm-4.1.9.2-0.1.el7.noarch |