Bug 1780943

Summary: [RFE] Add timeout and abort command to async live snapshot
Product: [oVirt] ovirt-engine Reporter: Liran Rotenberg <lrotenbe>
Component: Backend.CoreAssignee: Liran Rotenberg <lrotenbe>
Status: CLOSED CURRENTRELEASE QA Contact: Qin Yuan <qiyuan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.0CC: bugs, qiyuan, rbarry, rdlugyhe, sgoodman
Target Milestone: ovirt-4.4.0Keywords: FutureFeature, RFE
Target Release: ---Flags: rbarry: ovirt-4.4?
pm-rhel: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhv-4.4.0-29 Doc Type: Enhancement
Doc Text:
Previously, a live snapshot of a virtual machine could take an infinite amount of time, locking the virtual machine. With this release, you can set a limit on the amount of time an asynchronous live snapshot can take using the command `engine-config -s LiveSnapshotTimeoutInMinutes=<time>` where `<time>` is a value in minutes. After the set time passes, the snapshot aborts, releasing the lock and enabling you to use the virtual machine. The default value of `<time>` is `30`.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 20:02:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1749284    
Bug Blocks:    

Description Liran Rotenberg 2019-12-08 16:26:17 UTC
When having the asynchronous snapshot process the engine will keep polling the memory dump command for the command's default(~50 hours).

The engine should have a specific polling time for the memory dump, when reaching it, it should abort the memory dump and cancel the snapshot.

This is an incremental bug for bug 1749284.

Comment 2 Qin Yuan 2020-04-26 03:01:10 UTC
Verified with:
rhvm-4.4.0-0.33.master.el8ev.noarch
vdsm-4.40.13-1.el8ev.x86_64

Scenarios and steps:

Prepare:
1) Set snapshot timeout by `engine-config -s LiveSnapshotTimeoutInMinutes=5`
2) Create a VM with 64GiB memory.
3) Run VM
4) load memory

1. Create a snapshot with memory

Results:
1) Snapshot creation was aborted when 5mins timeout reached.
2) The partial snapshot was cleaned.
3) VM returned back to up status.

Info in /var/log/vdsm/vdsm.log:
2020-04-25 07:05:07,152+0000 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': 'd1b97d50-4107-41b6-81d4-8fd48d5d3bf4', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,161d270a-030a-43ab-9c78-ef83d16b45a4,1066c56a-bd24-496e-88ae-80f3a8466ccc,8646087e-4370-4392-a634-120352639d4b,ba1699a0-b72d-40e8-a070-7d35e9ae7c25', frozen=False, jobUUID='4a5bb336-4dea-4147-ac3a-05c9add34c9f', timeout=5) from=::ffff:10.73.197.64,37632, flow_id=823d3071-38ce-4adb-bd13-1ac42e18637c, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-25 07:05:07,155+0000 INFO  (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 07:05:08,609+0000 INFO  (virt/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-25 07:05:09,201+0000 INFO  (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.040000000037253, out of 300 (snapshot:613)
2020-04-25 07:10:09,485+0000 WARN  (snap_abort/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622)
2020-04-25 07:11:07,167+0000 INFO  (virt/4a5bb336) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:461)

Info on engine:
Apr 25, 2020, 11:04:43 PM Snapshot 'test_normal' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 11:05:05 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 11:11:08 PM Failed to complete snapshot 'test_normal' creation for VM 'snapshot_timeout_test'.


2. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd before snapshot timeout reached.

Results:
1) Snapshot creation was aborted after 5mins from the snapshot job starting time, not from vdsmd restart time.
2) The partial snapshot was cleaned.
3) VM returned back to up status.

Info in /var/log/vdsm/vdsm.log:
2020-04-25 03:55:32,814+0000 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '7e43067c-3d81-4eda-97fb-1acba8ea1bb8', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,f5038b45-6bfd-4bdb-abc9-0a3415feffd7,9ba92a4a-3c1e-4da4-ac3a-01f62276085b,8f5a759a-4521-412b-9730-4ba620e32fdb,7049c118-94a1-4d78-be1d-48c6c3293c10', frozen=False, jobUUID='8d4fd852-8451-417f-b6df-6af7e6a621a2', timeout=5) from=::ffff:10.73.197.64,46672, flow_id=3502422e-70c7-413c-8a23-e3b6e73814e8, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-25 03:55:32,818+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 03:55:34,248+0000 INFO  (virt/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-25 03:55:34,871+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0499999998137355, out of 300 (snapshot:613)
2020-04-25 03:56:58,366+0000 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:73)
2020-04-25 03:59:27,952+0000 INFO  (MainThread) [vds] (PID: 191361) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155)
2020-04-25 03:59:32,273+0000 INFO  (virt/8d4fd852) [root] Running job '8d4fd852-8451-417f-b6df-6af7e6a621a2'... (jobs:185)
2020-04-25 03:59:32,274+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 03:59:32,751+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 239.92999999970198, out of 300 (snapshot:613)
2020-04-25 04:00:02,787+0000 INFO  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 269.96999999973923, out of 300 (snapshot:613)
2020-04-25 04:00:32,823+0000 WARN  (snap_abort/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622)
2020-04-25 04:01:32,901+0000 INFO  (virt/8d4fd852) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:81)

Info on engine:
Apr 25, 2020, 7:54:59 PM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 7:55:34 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 7:57:55 PM VM snapshot_timeout_test was set to the Unknown status.
Apr 25, 2020, 7:59:34 PM VM snapshot_timeout_test status was restored to Paused.
Apr 25, 2020, 7:59:34 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 8:01:34 PM Failed to complete snapshot 'test_recovery' creation for VM 'snapshot_timeout_test'.


3. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd after snapshot timeout reached, but before the snapshot coulde be finished.

Results:
1) Snapshot creation was aborted right after vdsmd restarted.
2) The partial snapshot was cleaned.
3) VM returned back to up status. 

Info in /var/log/vdsm/vdsm.log:
2020-04-25 06:00:43,533+0000 INFO  (jsonrpc/4) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '29ce65b7-72a9-4118-97ed-52ec0e8fef39', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,eaa1cdeb-6f32-47b6-8271-4b71b9a48cc5,0022740c-aead-4762-b643-9dd0efcd1e9d,f2b4139c-e150-46dd-9dc6-b7cef9a433ab,6b677dd2-e004-4669-b542-ea5517f8594a', frozen=False, jobUUID='903daede-fea9-4581-9cdf-9ae426305393', timeout=5) from=::ffff:10.73.197.64,34342, flow_id=feed4506-021e-4c38-ae9a-8ae235b4f865, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-25 06:00:43,538+0000 INFO  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 06:00:45,057+0000 INFO  (virt/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-25 06:00:45,607+0000 INFO  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0700000002980232, out of 300 (snapshot:613)
2020-04-25 06:02:38,866+0000 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:73)
2020-04-25 06:07:18,122+0000 INFO  (MainThread) [vds] (PID: 195789) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155)
2020-04-25 06:07:21,266+0000 INFO  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-25 06:07:21,740+0000 WARN  (snap_abort/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Timeout passed, aborting snapshot... (snapshot:622)
2020-04-25 06:08:21,809+0000 INFO  (virt/903daede) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot timeout reached, operation aborted (snapshot:81)

Info on engine:
Apr 25, 2020, 10:00:11 PM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 10:00:45 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 10:03:28 PM VM snapshot_timeout_test was set to the Unknown status.
Apr 25, 2020, 10:07:30 PM VM snapshot_timeout_test status was restored to Paused.
Apr 25, 2020, 10:07:30 PM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 10:08:15 PM VM snapshot_timeout_test is not responding.
Apr 25, 2020, 10:08:25 PM Failed to complete snapshot 'test_recovery' creation for VM 'snapshot_timeout_test'.


4. Create a snapshot with memory, stop vdsmd on host after memory dump start, restart vdsmd after the time snapshot could be finished.

Results:
1) Snapshot was created.
2) VM returned back to up status.

Info in /var/log/vdsm/vdsm.log:
2020-04-24 12:22:20,074+0000 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '9f53ef0e-3fbd-4ced-9756-ea9bf3219b72', 'baseVolumeID': '40f13854-a81d-425c-8642-8d03449ad73d', 'volumeID': '51989b2c-d51e-4f82-b1f0-fad9c3112cda', 'domainID': '5801ec80-1fde-4839-84c7-aef9ec811295'}], snapMemory='5801ec80-1fde-4839-84c7-aef9ec811295,89447e2a-bdeb-43e9-be23-1b947a1f1eb0,3edd0714-697f-4fd0-b61e-817f6e37fcfb,e573d093-e877-4fc2-b558-ed456b6ec7c5,f17853b7-4e33-42c5-809f-c1188a7efc9b,25f3e089-6957-427d-9446-345dfd662296', frozen=False, jobUUID='84c3780c-00b1-4719-a2f4-2d063b973113', timeout=5) from=::ffff:10.73.197.64,44382, flow_id=f959da15-9a6c-4c89-8d44-7684f6b1fd72, vmId=212aee9e-5e32-409f-9d61-41449a6f6a31 (api:48)
2020-04-24 12:22:20,078+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-24 12:22:21,511+0000 INFO  (virt/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Taking a live snapshot (drives=sda,memory=True) (snapshot:451)
2020-04-24 12:22:22,125+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Time passed: 2.0499999998137355, out of 300 (snapshot:613)
2020-04-24 12:25:54,367+0000 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:73)
2020-04-24 12:41:26,061+0000 INFO  (MainThread) [vds] (PID: 160755) I am the actual vdsm 4.40.13.1 dell-per740-28.lab.eng.pek2.redhat.com (4.18.0-193.el8.x86_64) (vdsmd:155)
2020-04-24 12:41:26,857+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Starting snapshot abort job, with check interval 30 (snapshot:607)
2020-04-24 12:41:27,375+0000 INFO  (snap_abort/84c3780c) [virt.vm] (vmId='212aee9e-5e32-409f-9d61-41449a6f6a31') Snapshot job already completed (snapshot:642)

Info on engine:
Apr 25, 2020, 4:21:50 AM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' was initiated by admin@internal-authz.
Apr 25, 2020, 4:22:29 AM VM snapshot_timeout_test has been paused.
Apr 25, 2020, 4:26:47 AM VM snapshot_timeout_test was set to the Unknown status.
Apr 25, 2020, 4:41:29 AM VM snapshot_timeout_test status was restored to Up.
Apr 25, 2020, 4:41:36 AM Snapshot 'test_recovery' creation for VM 'snapshot_timeout_test' has been completed.


5. Check if the vm is normal after snapshot was aborted.

Results:
1) VM is up after snapshot creation was aborted, and can be shut down, restarted.
2) Set snapshot timeout to a bigger value, like 30mins, creating snapshot with memory could be completed.

Comment 3 Sandro Bonazzola 2020-05-20 20:02:12 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.