Bug 1887174 - Host deactivation failure after "Migration not in progress, code = 47" error
Summary: Host deactivation failure after "Migration not in progress, code = 47" error
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.4.3.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.5.0
: 4.5.0
Assignee: Liran Rotenberg
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-11 14:41 UTC by Polina
Modified: 2022-04-28 09:26 UTC (History)
2 users (show)

Fixed In Version: ovirt-engine-4.5.0
Clone Of:
Environment:
Last Closed: 2022-04-28 09:26:34 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
engine and vdsm (8.08 MB, application/gzip)
2020-10-11 14:41 UTC, Polina
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 117446 0 master MERGED backend: host maintenance incoming migration 2021-11-04 14:38:53 UTC

Description Polina 2020-10-11 14:41:31 UTC
Created attachment 1720672 [details]
engine and vdsm

Description of problem:
Receiving url:/ovirt-engine/api/hosts/{ID}/deactivate API the engine tries to switch a host into maintenance and detects it has a migration process running.
It sends the CancelMigrate command, which failed on the VDSM side due to the fact it's already finished (Migration not in progress, code = 47). The switch maintenance fails in such a case.

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.3.5-0.5.el8ev.noarch

How reproducible: rare in the timing reproduce. Happend while automation run 

Steps to Reproduce:
1. Setup with three hosts, 5 VM's run on host1 ['golden_env_mixed_virtio_1_0', 'golden_env_mixed_virtio_1_1', 'golden_env_mixed_virtio_2_0', 'golden_env_mixed_virtio_2_1', 'golden_env_mixed_virtio_3_0']
2. Send deactivate API to host3 

2020-10-07 01:41:01,605 - MainThread - hosts - DEBUG - Action request content is --  url:/ovirt-engine/api/hosts/7e3df6df-646d-4902-a2db-585bde37ddf8/deactivate 
body:
<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
</action>

3. At this moment a VM ac7c6844-4579-428c-8f95-f26e5b42082a'(golden_env_mixed_virtio_3_0) is migrating to this host (balanced) from host1 (211e8e36-8a09-4e15-a066-46bb4a72aba8)
2020-10-07 01:41:01,781+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [] VM 'ac7c6844-4579-428c-8f95-f26e5b42082a'(golden_env_mixed_virtio_3_0) was unexpectedly detected as 'PoweringUp' on VDS '7e3df6df-646d-4902-a2db-585bde37ddf8'(host_mixed_3) (expected on '211e8e36-8a09-4e15-a066-46bb4a72aba8')

Engine tries to cancel this migration  which fails

2020-10-07 01:41:04,041+03 ERROR [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Migration not in progress, code = 47 (Failed with error c and code 47)


Actual results:
Switch to maintenance failed .

Expected results:
The maintenance must be resumed and succeeded despite the MIGRATION_CANCEL_ERROR 

Additional info: for qe: automation test rhevmtests.compute.sla.scheduler_tests.even_vm_count_distribution.even_vm_count_distribution_test.TestBalancingWithDefaultParameters

Comment 1 Polina 2020-10-11 14:45:05 UTC
adding comment from Liran's help in logs analyzing:

The engine tries to switch a host into maintenance and detects it has a migration process running.
It sends the CancelMigrate command, which failed on the VDSM side due to the fact it's already finished.
Unfortunately, it failed the switch to maintenance command.

In the start of the commands we can clearly see the engine detects it's done but it will try to stop it:
2020-10-07 01:41:01,779+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] START, CancelMigrateVDSCommand(H
ostName = host_mixed_1, CancelMigrationVDSParameters:{hostId='211e8e36-8a09-4e15-a066-46bb4a72aba8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a'}), log id: 5a39ecaa
2020-10-07 01:41:01,781+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [] VM 'ac7c6844-4579-428c-8f95-f26e5b42082a'(golden_env_mixed_virtio_3_0)
was unexpectedly detected as 'PoweringUp' on VDS '7e3df6df-646d-4902-a2db-585bde37ddf8'(host_mixed_3) (expected on '211e8e36-8a09-4e15-a066-46bb4a72aba8')
2020-10-07 01:41:01,811+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (ForkJoinPool-1-worker-29) [] START, MigrateStatusVDSCommand(HostName = host_mixed_3, Mig
rateStatusVDSCommandParameters:{hostId='7e3df6df-646d-4902-a2db-585bde37ddf8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a'}), log id: 4bbd8d6a
2020-10-07 01:41:01,823+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (ForkJoinPool-1-worker-29) [] FINISH, MigrateStatusVDSCommand, return: , log id: 4bbd8d6a
2020-10-07 01:41:01,934+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-29) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM
: golden_env_mixed_virtio_3_0, Source: host_mixed_1, Destination: host_mixed_3, Duration: 7 seconds, Total: 7 seconds, Actual downtime: (N/A))
...
2020-10-07 01:41:03,990+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Failed in 'CancelMigrateVDS' met
hod
2020-10-07 01:41:04,025+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] EVENT_ID: VDS_BROKER_COMMAND_
FAILURE(10,802), VDSM host_mixed_1 command CancelMigrateVDS failed: Migration not in progress
2020-10-07 01:41:04,038+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'org.ovirt.engine.core.v
dsbroker.vdsbroker.CancelMigrateVDSCommand' return value 'StatusOnlyReturn [status=Status [code=47, message=Migration not in progress]]'
2020-10-07 01:41:04,038+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] HostName = host_mixed_1
2020-10-07 01:41:04,041+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'CancelMigrateVDSCommand
(HostName = host_mixed_1, CancelMigrationVDSParameters:{hostId='211e8e36-8a09-4e15-a066-46bb4a72aba8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a'})' execution failed: VDSGenericException: V
DSErrorException: Failed to CancelMigrateVDS, error = Migration not in progress, code = 47
2020-10-07 01:41:04,041+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] FINISH, CancelMigrateVDSCommand,
 return: , log id: 5a39ecaa
2020-10-07 01:41:04,041+03 ERROR [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'org.ovirt.engine.core.bll.Mainte
nanceNumberOfVdssCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Mi
gration not in progress, code = 47 (Failed with error MIGRATION_CANCEL_ERROR and code 47)
2020-10-07 01:41:04,077+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-107) [] Operation Failed: [Migration not in progress]
2020-10-07 01:41:04,090+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [] VM 'ac7c6844-4579-428c-8f95-f26e5b42082a' was reported as Down on VDS '
211e8e36-8a09-4e15-a066-46bb4a72aba8'(host_mixed_1)
2020-10-07 01:41:04,091+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-29) [] START, DestroyVDSCommand(HostName = host_mixed_1, DestroyVmVDSCom
mandParameters:{hostId='211e8e36-8a09-4e15-a066-46bb4a72aba8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: d9f
43eb
2020-10-07 01:41:04,149+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-34) [hosts_syncActi
on_b095665c-3233-4b3a] Command 'MaintenanceNumberOfVdss' id: '3a8a604a-840a-4832-a3ed-8847ac431b49' child commands '[]' executions were completed, status 'FAILED

I think it is really rare in the timing but can happen. We should resume the maintenance somehow when the reason to cancel migration is: Migration not in progress, code = 47

Comment 2 Liran Rotenberg 2020-10-11 15:18:14 UTC
Some information from VDSM logs:

Source host(host_mixed_1):
2020-10-07 01:40:54,265+0300 INFO  (jsonrpc/5) [api.virt] START migrate(params={'abortOnError': 'true', 'autoConverge': 'false', 'dst': 'lynx21.lab.eng.tlv2.redhat.com:54321', 'method': 'onl
ine', 'vmId': 'ac7c6844-4579-428c-8f95-f26e5b42082a', 'src': 'lynx12.lab.eng.tlv2.redhat.com', 'dstqemu': '10.46.16.36', 'outgoingLimit': 2, 'enableGuestEvents': False, 'maxBandwidth': 62, '
tunneled': 'false', 'encrypted': False, 'compressed': 'false', 'incomingLimit': 2}) from=::ffff:10.46.16.248,34282, flow_id=57523372, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:48)
2020-10-07 01:40:54,266+0300 INFO  (jsonrpc/5) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') using a computed convergence schedule for a legacy migration: {'init': [{'params': ['10
1'], 'name': 'setDowntime'}], 'stalling': [{'action': {'params': ['104'], 'name': 'setDowntime'}, 'limit': 1}, {'action': {'params': ['120'], 'name': 'setDowntime'}, 'limit': 2}, {'action':
{'params': ['189'], 'name': 'setDowntime'}, 'limit': 3}, {'action': {'params': ['500'], 'name': 'setDowntime'}, 'limit': 4}, {'action': {'params': ['500'], 'name': 'setDowntime'}, 'limit': 4
2}, {'action': {'params': [], 'name': 'abort'}, 'limit': -1}]} (migration:159)
2020-10-07 01:40:54,267+0300 INFO  (jsonrpc/5) [api.virt] FINISH migrate return={'status': {'code': 0, 'message': 'Migration in progress'}, 'progress': 0} from=::ffff:10.46.16.248,34282, flo
w_id=57523372, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:54)

...
2020-10-07 01:41:01,872+0300 INFO  (jsonrpc/2) [api.virt] START migrateCancel() from=::ffff:10.46.16.248,34282, flow_id=hosts_syncAction_b095665c-3233-4b3a, vmId=ac7c6844-4579-428c-8f95-f26e
5b42082a (api:48)
...
2020-10-07 01:41:04,071+0300 WARN  (jsonrpc/2) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Failed to cancel migration: Requested operation is not valid: domain is not running (vm:2026)
2020-10-07 01:41:04,072+0300 INFO  (jsonrpc/2) [api.virt] FINISH migrateCancel return={'status': {'code': 47, 'message': 'Migration not in progress'}} from=::ffff:10.46.16.248,34282, flow_id=hosts_syncAction_b095665c-3233-4b3a, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:54)
2020-10-07 01:41:04,072+0300 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.migrateCancel failed (error 47) in 2.20 seconds (__init__:312)
2020-10-07 01:41:04,076+0300 INFO  (migsrc/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') migration took 9 seconds to complete (migration:532)
2020-10-07 01:41:04,077+0300 INFO  (migsrc/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Changed state to Down: Migration succeeded (code=4) (vm:1644)
2020-10-07 01:41:04,170+0300 INFO  (migsrc/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Stopping connection (guestagent:444)
2020-10-07 01:41:04,177+0300 INFO  (jsonrpc/4) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.46.16.248,34282, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:48)
2020-10-07 01:41:04,177+0300 INFO  (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Release VM resources (vm:4639)
2020-10-07 01:41:04,177+0300 WARN  (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') trying to set state to Powering down when already Down (vm:626)
2020-10-07 01:41:04,177+0300 INFO  (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Stopping connection (guestagent:444)
2020-10-07 01:41:04,177+0300 INFO  (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') _destroyVmGraceful attempt #0 (vm:4676)
2020-10-07 01:41:04,180+0300 WARN  (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') VM 'ac7c6844-4579-428c-8f95-f26e5b42082a' couldn't be destroyed in libvirt: Requested operation is not valid: domain is not running (vm:4697)

Destination host(host_mixed_3):
2020-10-07 01:40:54,405+0300 INFO  (jsonrpc/6) [api.virt] START migrationCreate
..
2020-10-07 01:40:54,740+0300 INFO  (jsonrpc/6) [api.virt] FINISH migrationCreate return={'status': {'code': 0, 'message': 'Done'}, 'migrationPort': 0, 'params': {'vmId': 'ac7c6844-4579-428c-
8f95-f26e5b42082a', 'status': 'Migration Destination',
...
2020-10-07 01:41:00,799+0300 INFO  (libvirt/events) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') CPU running: onResume (vm:5606)
2020-10-07 01:41:00,821+0300 INFO  (vm/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Starting connection (guestagent:249)
2020-10-07 01:41:00,823+0300 INFO  (vm/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') CPU running: domain initializ
...
2020-10-07 01:41:01,663+0300 INFO  (qgapoller/0) [vds] Failed to get guest info for vm=ac7c6844-4579-428c-8f95-f26e5b42082a, error: Guest agent is not responding: QEMU guest agent is not connected (qemuguestagent:566)
2020-10-07 01:41:01,737+0300 DEBUG (check/loop) [storage.check] START check '/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__2__volume01/d3070a12-fb28-4cf1-9f6c-d7859f409d1b/dom_md/metadata' (delay=0.00) (check:289)
2020-10-07 01:41:01,796+0300 DEBUG (check/loop) [storage.asyncevent] Process <subprocess.Popen object at 0x7f15801d7860> terminated (count=1) (asyncevent:660)
2020-10-07 01:41:01,796+0300 DEBUG (check/loop) [storage.check] FINISH check '/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__2__volume01/d3070a12-fb28-4cf1-9f6c-d7859f409d1b/dom_md/metadata' (rc=0, elapsed=0.06) (check:359)
2020-10-07 01:41:01,899+0300 INFO  (jsonrpc/7) [api.virt] START getMigrationStatus() from=::ffff:10.46.16.248,39744, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:48)
2020-10-07 01:41:01,899+0300 INFO  (jsonrpc/7) [api.virt] FINISH getMigrationStatus return={'status': {'code': 0, 'message': 'Done'}, 'migrationStats': {'status': {'code': 0, 'message': 'Migration in progress'}, 'progress': 0}} from=::ffff:10.46.16.248,39744, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:54)
2020-10-07 01:41:01,899+0300 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.getMigrationStatus succeeded in 0.00 seconds (__init__:312)

...
2020-10-07 01:41:02,099+0300 INFO  (jsonrpc/5) [api.host] START dumpxmls(vmList=['ac7c6844-4579-428c-8f95-f26e5b42082a']) from=::ffff:10.46.16.248,39744 (api:48)

Comment 3 Polina 2022-04-25 11:58:16 UTC
verified on ovirt-engine-setup-4.5.0.4-0.1.el8ev.noarch


by running automation test in a loop rhevmtests.compute.sla.scheduler_tests.even_vm_count_distribution.even_vm_count_distribution_test.TestBalancingWithDefaultParameters

Comment 4 Sandro Bonazzola 2022-04-28 09:26:34 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.