Created attachment 1720672 [details] engine and vdsm Description of problem: Receiving url:/ovirt-engine/api/hosts/{ID}/deactivate API the engine tries to switch a host into maintenance and detects it has a migration process running. It sends the CancelMigrate command, which failed on the VDSM side due to the fact it's already finished (Migration not in progress, code = 47). The switch maintenance fails in such a case. Version-Release number of selected component (if applicable): ovirt-engine-4.4.3.5-0.5.el8ev.noarch How reproducible: rare in the timing reproduce. Happend while automation run Steps to Reproduce: 1. Setup with three hosts, 5 VM's run on host1 ['golden_env_mixed_virtio_1_0', 'golden_env_mixed_virtio_1_1', 'golden_env_mixed_virtio_2_0', 'golden_env_mixed_virtio_2_1', 'golden_env_mixed_virtio_3_0'] 2. Send deactivate API to host3 2020-10-07 01:41:01,605 - MainThread - hosts - DEBUG - Action request content is -- url:/ovirt-engine/api/hosts/7e3df6df-646d-4902-a2db-585bde37ddf8/deactivate body: <action> <async>false</async> <grace_period> <expiry>10</expiry> </grace_period> </action> 3. At this moment a VM ac7c6844-4579-428c-8f95-f26e5b42082a'(golden_env_mixed_virtio_3_0) is migrating to this host (balanced) from host1 (211e8e36-8a09-4e15-a066-46bb4a72aba8) 2020-10-07 01:41:01,781+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [] VM 'ac7c6844-4579-428c-8f95-f26e5b42082a'(golden_env_mixed_virtio_3_0) was unexpectedly detected as 'PoweringUp' on VDS '7e3df6df-646d-4902-a2db-585bde37ddf8'(host_mixed_3) (expected on '211e8e36-8a09-4e15-a066-46bb4a72aba8') Engine tries to cancel this migration which fails 2020-10-07 01:41:04,041+03 ERROR [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Migration not in progress, code = 47 (Failed with error c and code 47) Actual results: Switch to maintenance failed . Expected results: The maintenance must be resumed and succeeded despite the MIGRATION_CANCEL_ERROR Additional info: for qe: automation test rhevmtests.compute.sla.scheduler_tests.even_vm_count_distribution.even_vm_count_distribution_test.TestBalancingWithDefaultParameters
adding comment from Liran's help in logs analyzing: The engine tries to switch a host into maintenance and detects it has a migration process running. It sends the CancelMigrate command, which failed on the VDSM side due to the fact it's already finished. Unfortunately, it failed the switch to maintenance command. In the start of the commands we can clearly see the engine detects it's done but it will try to stop it: 2020-10-07 01:41:01,779+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] START, CancelMigrateVDSCommand(H ostName = host_mixed_1, CancelMigrationVDSParameters:{hostId='211e8e36-8a09-4e15-a066-46bb4a72aba8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a'}), log id: 5a39ecaa 2020-10-07 01:41:01,781+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [] VM 'ac7c6844-4579-428c-8f95-f26e5b42082a'(golden_env_mixed_virtio_3_0) was unexpectedly detected as 'PoweringUp' on VDS '7e3df6df-646d-4902-a2db-585bde37ddf8'(host_mixed_3) (expected on '211e8e36-8a09-4e15-a066-46bb4a72aba8') 2020-10-07 01:41:01,811+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (ForkJoinPool-1-worker-29) [] START, MigrateStatusVDSCommand(HostName = host_mixed_3, Mig rateStatusVDSCommandParameters:{hostId='7e3df6df-646d-4902-a2db-585bde37ddf8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a'}), log id: 4bbd8d6a 2020-10-07 01:41:01,823+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (ForkJoinPool-1-worker-29) [] FINISH, MigrateStatusVDSCommand, return: , log id: 4bbd8d6a 2020-10-07 01:41:01,934+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-29) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM : golden_env_mixed_virtio_3_0, Source: host_mixed_1, Destination: host_mixed_3, Duration: 7 seconds, Total: 7 seconds, Actual downtime: (N/A)) ... 2020-10-07 01:41:03,990+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Failed in 'CancelMigrateVDS' met hod 2020-10-07 01:41:04,025+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] EVENT_ID: VDS_BROKER_COMMAND_ FAILURE(10,802), VDSM host_mixed_1 command CancelMigrateVDS failed: Migration not in progress 2020-10-07 01:41:04,038+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'org.ovirt.engine.core.v dsbroker.vdsbroker.CancelMigrateVDSCommand' return value 'StatusOnlyReturn [status=Status [code=47, message=Migration not in progress]]' 2020-10-07 01:41:04,038+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] HostName = host_mixed_1 2020-10-07 01:41:04,041+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'CancelMigrateVDSCommand (HostName = host_mixed_1, CancelMigrationVDSParameters:{hostId='211e8e36-8a09-4e15-a066-46bb4a72aba8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a'})' execution failed: VDSGenericException: V DSErrorException: Failed to CancelMigrateVDS, error = Migration not in progress, code = 47 2020-10-07 01:41:04,041+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] FINISH, CancelMigrateVDSCommand, return: , log id: 5a39ecaa 2020-10-07 01:41:04,041+03 ERROR [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-107) [hosts_syncAction_b095665c-3233-4b3a] Command 'org.ovirt.engine.core.bll.Mainte nanceNumberOfVdssCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Mi gration not in progress, code = 47 (Failed with error MIGRATION_CANCEL_ERROR and code 47) 2020-10-07 01:41:04,077+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-107) [] Operation Failed: [Migration not in progress] 2020-10-07 01:41:04,090+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [] VM 'ac7c6844-4579-428c-8f95-f26e5b42082a' was reported as Down on VDS ' 211e8e36-8a09-4e15-a066-46bb4a72aba8'(host_mixed_1) 2020-10-07 01:41:04,091+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-29) [] START, DestroyVDSCommand(HostName = host_mixed_1, DestroyVmVDSCom mandParameters:{hostId='211e8e36-8a09-4e15-a066-46bb4a72aba8', vmId='ac7c6844-4579-428c-8f95-f26e5b42082a', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: d9f 43eb 2020-10-07 01:41:04,149+03 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-34) [hosts_syncActi on_b095665c-3233-4b3a] Command 'MaintenanceNumberOfVdss' id: '3a8a604a-840a-4832-a3ed-8847ac431b49' child commands '[]' executions were completed, status 'FAILED I think it is really rare in the timing but can happen. We should resume the maintenance somehow when the reason to cancel migration is: Migration not in progress, code = 47
Some information from VDSM logs: Source host(host_mixed_1): 2020-10-07 01:40:54,265+0300 INFO (jsonrpc/5) [api.virt] START migrate(params={'abortOnError': 'true', 'autoConverge': 'false', 'dst': 'lynx21.lab.eng.tlv2.redhat.com:54321', 'method': 'onl ine', 'vmId': 'ac7c6844-4579-428c-8f95-f26e5b42082a', 'src': 'lynx12.lab.eng.tlv2.redhat.com', 'dstqemu': '10.46.16.36', 'outgoingLimit': 2, 'enableGuestEvents': False, 'maxBandwidth': 62, ' tunneled': 'false', 'encrypted': False, 'compressed': 'false', 'incomingLimit': 2}) from=::ffff:10.46.16.248,34282, flow_id=57523372, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:48) 2020-10-07 01:40:54,266+0300 INFO (jsonrpc/5) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') using a computed convergence schedule for a legacy migration: {'init': [{'params': ['10 1'], 'name': 'setDowntime'}], 'stalling': [{'action': {'params': ['104'], 'name': 'setDowntime'}, 'limit': 1}, {'action': {'params': ['120'], 'name': 'setDowntime'}, 'limit': 2}, {'action': {'params': ['189'], 'name': 'setDowntime'}, 'limit': 3}, {'action': {'params': ['500'], 'name': 'setDowntime'}, 'limit': 4}, {'action': {'params': ['500'], 'name': 'setDowntime'}, 'limit': 4 2}, {'action': {'params': [], 'name': 'abort'}, 'limit': -1}]} (migration:159) 2020-10-07 01:40:54,267+0300 INFO (jsonrpc/5) [api.virt] FINISH migrate return={'status': {'code': 0, 'message': 'Migration in progress'}, 'progress': 0} from=::ffff:10.46.16.248,34282, flo w_id=57523372, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:54) ... 2020-10-07 01:41:01,872+0300 INFO (jsonrpc/2) [api.virt] START migrateCancel() from=::ffff:10.46.16.248,34282, flow_id=hosts_syncAction_b095665c-3233-4b3a, vmId=ac7c6844-4579-428c-8f95-f26e 5b42082a (api:48) ... 2020-10-07 01:41:04,071+0300 WARN (jsonrpc/2) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Failed to cancel migration: Requested operation is not valid: domain is not running (vm:2026) 2020-10-07 01:41:04,072+0300 INFO (jsonrpc/2) [api.virt] FINISH migrateCancel return={'status': {'code': 47, 'message': 'Migration not in progress'}} from=::ffff:10.46.16.248,34282, flow_id=hosts_syncAction_b095665c-3233-4b3a, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:54) 2020-10-07 01:41:04,072+0300 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.migrateCancel failed (error 47) in 2.20 seconds (__init__:312) 2020-10-07 01:41:04,076+0300 INFO (migsrc/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') migration took 9 seconds to complete (migration:532) 2020-10-07 01:41:04,077+0300 INFO (migsrc/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Changed state to Down: Migration succeeded (code=4) (vm:1644) 2020-10-07 01:41:04,170+0300 INFO (migsrc/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Stopping connection (guestagent:444) 2020-10-07 01:41:04,177+0300 INFO (jsonrpc/4) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.46.16.248,34282, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:48) 2020-10-07 01:41:04,177+0300 INFO (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Release VM resources (vm:4639) 2020-10-07 01:41:04,177+0300 WARN (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') trying to set state to Powering down when already Down (vm:626) 2020-10-07 01:41:04,177+0300 INFO (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Stopping connection (guestagent:444) 2020-10-07 01:41:04,177+0300 INFO (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') _destroyVmGraceful attempt #0 (vm:4676) 2020-10-07 01:41:04,180+0300 WARN (jsonrpc/4) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') VM 'ac7c6844-4579-428c-8f95-f26e5b42082a' couldn't be destroyed in libvirt: Requested operation is not valid: domain is not running (vm:4697) Destination host(host_mixed_3): 2020-10-07 01:40:54,405+0300 INFO (jsonrpc/6) [api.virt] START migrationCreate .. 2020-10-07 01:40:54,740+0300 INFO (jsonrpc/6) [api.virt] FINISH migrationCreate return={'status': {'code': 0, 'message': 'Done'}, 'migrationPort': 0, 'params': {'vmId': 'ac7c6844-4579-428c- 8f95-f26e5b42082a', 'status': 'Migration Destination', ... 2020-10-07 01:41:00,799+0300 INFO (libvirt/events) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') CPU running: onResume (vm:5606) 2020-10-07 01:41:00,821+0300 INFO (vm/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') Starting connection (guestagent:249) 2020-10-07 01:41:00,823+0300 INFO (vm/ac7c6844) [virt.vm] (vmId='ac7c6844-4579-428c-8f95-f26e5b42082a') CPU running: domain initializ ... 2020-10-07 01:41:01,663+0300 INFO (qgapoller/0) [vds] Failed to get guest info for vm=ac7c6844-4579-428c-8f95-f26e5b42082a, error: Guest agent is not responding: QEMU guest agent is not connected (qemuguestagent:566) 2020-10-07 01:41:01,737+0300 DEBUG (check/loop) [storage.check] START check '/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__2__volume01/d3070a12-fb28-4cf1-9f6c-d7859f409d1b/dom_md/metadata' (delay=0.00) (check:289) 2020-10-07 01:41:01,796+0300 DEBUG (check/loop) [storage.asyncevent] Process <subprocess.Popen object at 0x7f15801d7860> terminated (count=1) (asyncevent:660) 2020-10-07 01:41:01,796+0300 DEBUG (check/loop) [storage.check] FINISH check '/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__2__volume01/d3070a12-fb28-4cf1-9f6c-d7859f409d1b/dom_md/metadata' (rc=0, elapsed=0.06) (check:359) 2020-10-07 01:41:01,899+0300 INFO (jsonrpc/7) [api.virt] START getMigrationStatus() from=::ffff:10.46.16.248,39744, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:48) 2020-10-07 01:41:01,899+0300 INFO (jsonrpc/7) [api.virt] FINISH getMigrationStatus return={'status': {'code': 0, 'message': 'Done'}, 'migrationStats': {'status': {'code': 0, 'message': 'Migration in progress'}, 'progress': 0}} from=::ffff:10.46.16.248,39744, vmId=ac7c6844-4579-428c-8f95-f26e5b42082a (api:54) 2020-10-07 01:41:01,899+0300 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.getMigrationStatus succeeded in 0.00 seconds (__init__:312) ... 2020-10-07 01:41:02,099+0300 INFO (jsonrpc/5) [api.host] START dumpxmls(vmList=['ac7c6844-4579-428c-8f95-f26e5b42082a']) from=::ffff:10.46.16.248,39744 (api:48)
verified on ovirt-engine-setup-4.5.0.4-0.1.el8ev.noarch by running automation test in a loop rhevmtests.compute.sla.scheduler_tests.even_vm_count_distribution.even_vm_count_distribution_test.TestBalancingWithDefaultParameters
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.