Bug 1780943 - [RFE] Add timeout and abort command to async live snapshot
Summary: [RFE] Add timeout and abort command to async live snapshot
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Liran Rotenberg
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On: 1749284
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-08 16:26 UTC by Liran Rotenberg
Modified: 2020-07-29 23:40 UTC (History)
5 users (show)

Fixed In Version: rhv-4.4.0-29
Doc Type: Enhancement
Doc Text:
Previously, a live snapshot of a virtual machine could take an infinite amount of time, locking the virtual machine. With this release, you can set a limit on the amount of time an asynchronous live snapshot can take using the command `engine-config -s LiveSnapshotTimeoutInMinutes=<time>` where `<time>` is a value in minutes. After the set time passes, the snapshot aborts, releasing the lock and enabling you to use the virtual machine. The default value of `<time>` is `30`.
Clone Of:
Environment:
Last Closed: 2020-05-20 20:02:12 UTC
oVirt Team: Virt
rbarry: ovirt-4.4?
pm-rhel: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 107241 0 master MERGED virt: async snapshot timeout abort 2021-02-19 12:59:05 UTC
oVirt gerrit 107252 0 master MERGED core: async snapshot timeout abort 2021-02-19 12:59:05 UTC

Description Liran Rotenberg 2019-12-08 16:26:17 UTC
When having the asynchronous snapshot process the engine will keep polling the memory dump command for the command's default(~50 hours).

The engine should have a specific polling time for the memory dump, when reaching it, it should abort the memory dump and cancel the snapshot.

This is an incremental bug for bug 1749284.

Comment 2 Qin Yuan 2020-04-26 03:01:10 UTC
Verified with:
rhvm-4.4.0-0.33.master.el8ev.noarch
vdsm-4.40.13-1.el8ev.x86_64

Scenarios and steps:

Prepare:
1) Set snapshot timeout by `engine-config -s LiveSnapshotTimeoutInMinutes=5`
2) Create a VM with 64GiB memory.
3) Run VM
4) load memory

1. Create a snapshot with memory

Results:
1) Snapshot creation was aborted when 5mins timeout reached.
2) The partial snapshot was cleaned.
3) VM returned back to up status.

Info in /var/log/vdsm/vdsm.log:
2020-04-25 07:05:07,152+0000 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': 'd1b97d50-4107-41b6-81d4-8fd48d5d3bf4', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,161d270a-030a-43ab-9c78-ef83d16b45a4,1066c56a-bd24-496e-88ae-80f3a8466ccc,8646087e-4370-4392-a634-120352639d4b,ba1699a0-b72d-40e8-a070-7d35e9ae7c25', frozen=False, jobUUID='4a5bb336-4dea-4147-ac3a-05c9add34c9f', timeout=5) from=::ffff:10.73.197.64,37632, flow_id=823d3071-38ce-4adb-bd13-1ac42e18637c, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-25 07:05:07,155+0000 INFO  (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 07:05:08,609+0000 INFO  (virt/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-25 07:05:09,201+0000 INFO  (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.040000000037253, out of 300 (snapshot:613)
2020-04-25 07:10:09,485+0000 WARN  (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622)
2020-04-25 07:11:07,167+0000 INFO  (virt/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:461)

Info on engine:
Apr 25, 2020, 11:04:43 PM Snapshot 'test_normal' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 11:05:05 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 11:11:08 PM Failed to complete snapshot 'test_normal' creation for VM 'snapshot_timeout_test'.


2. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd before snapshot timeout reached.

Results:
1) Snapshot creation was aborted after 5mins from the snapshot job starting time, not from vdsmd restart time.
2) The partial snapshot was cleaned.
3) VM returned back to up status.

Info in /var/log/vdsm/vdsm.log:
2020-04-25 03:55:32,814+0000 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '7e43067c-3d81-4eda-97fb-1acba8ea1bb8', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,f5038b45-6bfd-4bdb-abc9-0a3415feffd7,9ba92a4a-3c1e-4da4-ac3a-01f62276085b,8f5a759a-4521-412b-9730-4ba620e32fdb,7049c118-94a1-4d78-be1d-48c6c3293c10', frozen=False, jobUUID='8d4fd852-8451-417f-b6df-6af7e6a621a2', timeout=5) from=::ffff:10.73.197.64,46672, flow_id=3502422e-70c7-413c-8a23-e3b6e73814e8, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-25 03:55:32,818+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 03:55:34,248+0000 INFO  (virt/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-25 03:55:34,871+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0499999998137355, out of 300 (snapshot:613)
2020-04-25 03:56:58,366+0000 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:73)
2020-04-25 03:59:27,952+0000 INFO  (MainThread) [vds] (PID: 191361) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155)
2020-04-25 03:59:32,273+0000 INFO  (virt/8d4fd852) [root] Running job '8d4fd852-8451-417f-b6df-6af7e6a621a2'... (jobs:185)
2020-04-25 03:59:32,274+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 03:59:32,751+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 239.92999999970198, out of 300 (snapshot:613)
2020-04-25 04:00:02,787+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 269.96999999973923, out of 300 (snapshot:613)
2020-04-25 04:00:32,823+0000 WARN  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622)
2020-04-25 04:01:32,901+0000 INFO  (virt/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:81)

Info on engine:
Apr 25, 2020, 7:54:59 PM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 7:55:34 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 7:57:55 PM VM snapshot_timeout_test was set to the Unknown status.
Apr 25, 2020, 7:59:34 PM VM snapshot_timeout_test status was restored to Paused.
Apr 25, 2020, 7:59:34 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 8:01:34 PM Failed to complete snapshot 'test_recovery' creation for VM 'snapshot_timeout_test'.


3. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd after snapshot timeout reached, but before the snapshot coulde be finished.

Results:
1) Snapshot creation was aborted right after vdsmd restarted.
2) The partial snapshot was cleaned.
3) VM returned back to up status. 

Info in /var/log/vdsm/vdsm.log:
2020-04-25 06:00:43,533+0000 INFO  (jsonrpc/4) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '29ce65b7-72a9-4118-97ed-52ec0e8fef39', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,eaa1cdeb-6f32-47b6-8271-4b71b9a48cc5,0022740c-aead-4762-b643-9dd0efcd1e9d,f2b4139c-e150-46dd-9dc6-b7cef9a433ab,6b677dd2-e004-4669-b542-ea5517f8594a', frozen=False, jobUUID='903daede-fea9-4581-9cdf-9ae426305393', timeout=5) from=::ffff:10.73.197.64,34342, flow_id=feed4506-021e-4c38-ae9a-8ae235b4f865, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-25 06:00:43,538+0000 INFO  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 06:00:45,057+0000 INFO  (virt/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-25 06:00:45,607+0000 INFO  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0700000002980232, out of 300 (snapshot:613)
2020-04-25 06:02:38,866+0000 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:73)
2020-04-25 06:07:18,122+0000 INFO  (MainThread) [vds] (PID: 195789) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155)
2020-04-25 06:07:21,266+0000 INFO  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 06:07:21,740+0000 WARN  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622)
2020-04-25 06:08:21,809+0000 INFO  (virt/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:81)

Info on engine:
Apr 25, 2020, 10:00:11 PM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 10:00:45 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 10:03:28 PM VM snapshot_timeout_test was set to the Unknown status.
Apr 25, 2020, 10:07:30 PM VM snapshot_timeout_test status was restored to Paused.
Apr 25, 2020, 10:07:30 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 10:08:15 PM VM snapshot_timeout_test is not responding.
Apr 25, 2020, 10:08:25 PM Failed to complete snapshot 'test_recovery' creation for VM 'snapshot_timeout_test'.


4. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd after the time snapshot could be finished.

Results:
1) Snapshot was created.
2) VM returned back to up status.

Info in /var/log/vdsm/vdsm.log:
2020-04-24 12:22:20,074+0000 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '51989b2c-d51e-4f82-b1f0-fad9c3112cda', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,3edd0714-697f-4fd0-b61e-817f6e37fcfb,e573d093-e877-4fc2-b558-ed456b6ec7c5,f17853b7-4e33-42c5-809f-c1188a7efc9b,25f3e089-6957-427d-9446-345dfd662296', frozen=False, jobUUID='84c3780c-00b1-4719-a2f4-2d063b973113', timeout=5) from=::ffff:10.73.197.64,44382, flow_id=f959da15-9a6c-4c89-8d44-7684f6b1fd72, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-24 12:22:20,078+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-24 12:22:21,511+0000 INFO  (virt/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-24 12:22:22,125+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0499999998137355, out of 300 (snapshot:613)
2020-04-24 12:25:54,367+0000 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:73)
2020-04-24 12:41:26,061+0000 INFO  (MainThread) [vds] (PID: 160755) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155)
2020-04-24 12:41:26,857+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-24 12:41:27,375+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot job already completed (snapshot:642)

Info on engine:
Apr 25, 2020, 4:21:50 AM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 4:22:29 AM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 4:26:47 AM VM snapshot_timeout_test was set to the Unknown status.
Apr 25, 2020, 4:41:29 AM VM snapshot_timeout_test status was restored to Up.
Apr 25, 2020, 4:41:36 AM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' has been completed.


5. Check if the vm is normal after snapshot was aborted.

Results:
1) VM is up after snapshot creation was aborted, and can be shut down, restarted.
2) Set snapshot timeout to a bigger value, like 30mins, creating snapshot with memory could be completed.

Comment 3 Sandro Bonazzola 2020-05-20 20:02:12 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.