When having the asynchronous snapshot process the engine will keep polling the memory dump command for the command's default(~50 hours). The engine should have a specific polling time for the memory dump, when reaching it, it should abort the memory dump and cancel the snapshot. This is an incremental bug for bug 1749284.
Verified with: rhvm-4.4.0-0.33.master.el8ev.noarch vdsm-4.40.13-1.el8ev.x86_64 Scenarios and steps: Prepare: 1) Set snapshot timeout by `engine-config -s LiveSnapshotTimeoutInMinutes=5` 2) Create a VM with 64GiB memory. 3) Run VM 4) load memory 1. Create a snapshot with memory Results: 1) Snapshot creation was aborted when 5mins timeout reached. 2) The partial snapshot was cleaned. 3) VM returned back to up status. Info in /var/log/vdsm/vdsm.log: 2020-04-25 07:05:07,152+0000 INFO (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': 'd1b97d50-4107-41b6-81d4-8fd48d5d3bf4', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,161d270a-030a-43ab-9c78-ef83d16b45a4,1066c56a-bd24-496e-88ae-80f3a8466ccc,8646087e-4370-4392-a634-120352639d4b,ba1699a0-b72d-40e8-a070-7d35e9ae7c25', frozen=False, jobUUID='4a5bb336-4dea-4147-ac3a-05c9add34c9f', timeout=5) from=::ffff:10.73.197.64,37632, flow_id=823d3071-38ce-4adb-bd13-1ac42e18637c, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48) 2020-04-25 07:05:07,155+0000 INFO (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607) 2020-04-25 07:05:08,609+0000 INFO (virt/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451) 2020-04-25 07:05:09,201+0000 INFO (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.040000000037253, out of 300 (snapshot:613) 2020-04-25 07:10:09,485+0000 WARN (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622) 2020-04-25 07:11:07,167+0000 INFO (virt/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:461) Info on engine: Apr 25, 2020, 11:04:43 PM Snapshot 'test_normal' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz. Apr 25, 2020, 11:05:05 PM VM snapshot_timeout_test has been paused. Apr 25, 2020, 11:11:08 PM Failed to complete snapshot 'test_normal' creation for VM 'snapshot_timeout_test'. 2. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd before snapshot timeout reached. Results: 1) Snapshot creation was aborted after 5mins from the snapshot job starting time, not from vdsmd restart time. 2) The partial snapshot was cleaned. 3) VM returned back to up status. Info in /var/log/vdsm/vdsm.log: 2020-04-25 03:55:32,814+0000 INFO (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '7e43067c-3d81-4eda-97fb-1acba8ea1bb8', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,f5038b45-6bfd-4bdb-abc9-0a3415feffd7,9ba92a4a-3c1e-4da4-ac3a-01f62276085b,8f5a759a-4521-412b-9730-4ba620e32fdb,7049c118-94a1-4d78-be1d-48c6c3293c10', frozen=False, jobUUID='8d4fd852-8451-417f-b6df-6af7e6a621a2', timeout=5) from=::ffff:10.73.197.64,46672, flow_id=3502422e-70c7-413c-8a23-e3b6e73814e8, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48) 2020-04-25 03:55:32,818+0000 INFO (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607) 2020-04-25 03:55:34,248+0000 INFO (virt/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451) 2020-04-25 03:55:34,871+0000 INFO (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0499999998137355, out of 300 (snapshot:613) 2020-04-25 03:56:58,366+0000 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:73) 2020-04-25 03:59:27,952+0000 INFO (MainThread) [vds] (PID: 191361) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155) 2020-04-25 03:59:32,273+0000 INFO (virt/8d4fd852) [root] Running job '8d4fd852-8451-417f-b6df-6af7e6a621a2'... (jobs:185) 2020-04-25 03:59:32,274+0000 INFO (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607) 2020-04-25 03:59:32,751+0000 INFO (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 239.92999999970198, out of 300 (snapshot:613) 2020-04-25 04:00:02,787+0000 INFO (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 269.96999999973923, out of 300 (snapshot:613) 2020-04-25 04:00:32,823+0000 WARN (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622) 2020-04-25 04:01:32,901+0000 INFO (virt/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:81) Info on engine: Apr 25, 2020, 7:54:59 PM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz. Apr 25, 2020, 7:55:34 PM VM snapshot_timeout_test has been paused. Apr 25, 2020, 7:57:55 PM VM snapshot_timeout_test was set to the Unknown status. Apr 25, 2020, 7:59:34 PM VM snapshot_timeout_test status was restored to Paused. Apr 25, 2020, 7:59:34 PM VM snapshot_timeout_test has been paused. Apr 25, 2020, 8:01:34 PM Failed to complete snapshot 'test_recovery' creation for VM 'snapshot_timeout_test'. 3. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd after snapshot timeout reached, but before the snapshot coulde be finished. Results: 1) Snapshot creation was aborted right after vdsmd restarted. 2) The partial snapshot was cleaned. 3) VM returned back to up status. Info in /var/log/vdsm/vdsm.log: 2020-04-25 06:00:43,533+0000 INFO (jsonrpc/4) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '29ce65b7-72a9-4118-97ed-52ec0e8fef39', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,eaa1cdeb-6f32-47b6-8271-4b71b9a48cc5,0022740c-aead-4762-b643-9dd0efcd1e9d,f2b4139c-e150-46dd-9dc6-b7cef9a433ab,6b677dd2-e004-4669-b542-ea5517f8594a', frozen=False, jobUUID='903daede-fea9-4581-9cdf-9ae426305393', timeout=5) from=::ffff:10.73.197.64,34342, flow_id=feed4506-021e-4c38-ae9a-8ae235b4f865, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48) 2020-04-25 06:00:43,538+0000 INFO (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607) 2020-04-25 06:00:45,057+0000 INFO (virt/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451) 2020-04-25 06:00:45,607+0000 INFO (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0700000002980232, out of 300 (snapshot:613) 2020-04-25 06:02:38,866+0000 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:73) 2020-04-25 06:07:18,122+0000 INFO (MainThread) [vds] (PID: 195789) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155) 2020-04-25 06:07:21,266+0000 INFO (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607) 2020-04-25 06:07:21,740+0000 WARN (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622) 2020-04-25 06:08:21,809+0000 INFO (virt/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:81) Info on engine: Apr 25, 2020, 10:00:11 PM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz. Apr 25, 2020, 10:00:45 PM VM snapshot_timeout_test has been paused. Apr 25, 2020, 10:03:28 PM VM snapshot_timeout_test was set to the Unknown status. Apr 25, 2020, 10:07:30 PM VM snapshot_timeout_test status was restored to Paused. Apr 25, 2020, 10:07:30 PM VM snapshot_timeout_test has been paused. Apr 25, 2020, 10:08:15 PM VM snapshot_timeout_test is not responding. Apr 25, 2020, 10:08:25 PM Failed to complete snapshot 'test_recovery' creation for VM 'snapshot_timeout_test'. 4. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd after the time snapshot could be finished. Results: 1) Snapshot was created. 2) VM returned back to up status. Info in /var/log/vdsm/vdsm.log: 2020-04-24 12:22:20,074+0000 INFO (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '51989b2c-d51e-4f82-b1f0-fad9c3112cda', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,3edd0714-697f-4fd0-b61e-817f6e37fcfb,e573d093-e877-4fc2-b558-ed456b6ec7c5,f17853b7-4e33-42c5-809f-c1188a7efc9b,25f3e089-6957-427d-9446-345dfd662296', frozen=False, jobUUID='84c3780c-00b1-4719-a2f4-2d063b973113', timeout=5) from=::ffff:10.73.197.64,44382, flow_id=f959da15-9a6c-4c89-8d44-7684f6b1fd72, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48) 2020-04-24 12:22:20,078+0000 INFO (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607) 2020-04-24 12:22:21,511+0000 INFO (virt/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451) 2020-04-24 12:22:22,125+0000 INFO (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0499999998137355, out of 300 (snapshot:613) 2020-04-24 12:25:54,367+0000 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:73) 2020-04-24 12:41:26,061+0000 INFO (MainThread) [vds] (PID: 160755) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155) 2020-04-24 12:41:26,857+0000 INFO (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607) 2020-04-24 12:41:27,375+0000 INFO (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot job already completed (snapshot:642) Info on engine: Apr 25, 2020, 4:21:50 AM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz. Apr 25, 2020, 4:22:29 AM VM snapshot_timeout_test has been paused. Apr 25, 2020, 4:26:47 AM VM snapshot_timeout_test was set to the Unknown status. Apr 25, 2020, 4:41:29 AM VM snapshot_timeout_test status was restored to Up. Apr 25, 2020, 4:41:36 AM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' has been completed. 5. Check if the vm is normal after snapshot was aborted. Results: 1) VM is up after snapshot creation was aborted, and can be shut down, restarted. 2) Set snapshot timeout to a bigger value, like 30mins, creating snapshot with memory could be completed.
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.