Bug 1749284

Summary:	Change the Snapshot operation to be asynchronous
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Michal Skrivanek <michal.skrivanek>
Component:	ovirt-engine	Assignee:	Liran Rotenberg <lrotenbe>
Status:	CLOSED ERRATA	QA Contact:	Beni Pelled <bpelled>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.7	CC:	bugs, dfodor, pelauter, rdlugyhe, rhodain, sgoodman
Target Milestone:	ovirt-4.4.0	Flags:	lsvaty: testing_plan_complete-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rhv-4.4.0-29	Doc Type:	Enhancement
Doc Text:	Before this update, the live snapshot operation was synchronized, such that if VDSM required more than 180 seconds to create a snapshot, the operation failed, preventing snapshots of some virtual machines, such as those with large memory loads, or slow storage. With this update, the live snapshot operation is asynchronous, so the operation runs until it ends successfully, regardless of how long it takes.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 13:20:31 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1565552
Bug Blocks:	1687345, 1780943

Description Michal Skrivanek 2019-09-05 10:04:30 UTC

This bug was initially created as a copy of Bug #1687345

Change snapshot to be an async operation instead of current sync call between the engine and vdsm.


Description of problem:

Currently, the engine assumes that the "snapshot" command issued to the vdsm will complete in 180 seconds. However,  for memory snapshots, it can take more time depending upon the RAM of the VM since an copying operation happens during this phase in addition to the swapping of the images. So the time it can take may be much more than 180 seconds depending on the VMs RAM size and the storage connectivity speed.

I was able to reproduce this in my test environment.

The snapshot command was issued to the VM.

===
2019-03-11 14:02:25,927+0530 INFO  (jsonrpc/5) [api.virt] START snapshot(snapDrives=[{u'baseVolumeID': u'f71cf012-e79f-4654-ac8e-cbf30bbd2ffd', u'domainID': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2', u'volumeID': u'72e83f93-4e9f-4b8b-bc87-82504facbfcd', u'imageID': u'fb61b343-4cdb-4f57-8d0b-32058e463de5'}], snapMemory=u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2,5c3f0ee4-335d-11e9-ad0e-001a4a17016f,ca39f7c5-f9e7-4e20-bb2a-865fcf7c952a,782183fc-25c1-4ae3-b573-91c487ac3ab1,a52d3ce2-ab6e-47df-b249-3672fe55a188,e3ff1738-192d-4351-9b99-2b8ab9f0ea74', frozen=False) from=::ffff:10.74.130.110,37200, flow_id=56193ec3-f1c4-4e10-b844-674f2dbc31ff, vmId=b1ade260-811a-42ea-8cca-f5a5dd67a1a5 (api:46)
====

Worker blocked messages.

===
2019-03-11 14:03:26,000+0530 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'frozen': False, u'vmID': u'b1ade260-811a-42ea-8cca-f5a5dd67a1a5', u'snapDrives': [{u'baseVolumeID': u'f71cf012-e79f-4654-ac8e-cbf30bbd2ffd', u'domainID': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2', u'volumeID': u'72e83f93-4e9f-4b8b-bc87-82504facbfcd', u'imageID': u'fb61b343-4cdb-4f57-8d0b-32058e463de5'}], u'snapMemory': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2,5c3f0ee4-335d-11e9-ad0e-001a4a17016f,ca39f7c5-f9e7-4e20-bb2a-865fcf7c952a,782183fc-25c1-4ae3-b573-91c487ac3ab1,a52d3ce2-ab6e-47df-b249-3672fe55a188,e3ff1738-192d-4351-9b99-2b8ab9f0ea74'}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'3e47b6d5-7fc2-4d33-a0ad-65433e9d9dae'} at 0x7fea7968bcd0> timeout=60, duration=60 at 0x7fea7968bad0> task#=197 at 0x7fea54077e10>, traceback:

File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4617, in snapshot
  self._dom.snapshotCreateXML(snapxml, snapFlags)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
  ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
  ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
  return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2617, in snapshotCreateXML
  ret = libvirtmod.virDomainSnapshotCreateXML(self._o, xmlDesc, flags) (executor:363)
====

Engine timeout after 3 minutes.

===
2019-03-11 04:32:19,790-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2615) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'snap1' creation for VM 'test_vm' was initiated by admin@internal-authz.


2019-03-11 04:35:27,063-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-98) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] EVENT_ID: USER_CREATE_LIVE_SNAPSHOT_FINISHED_FAILURE(170), Failed to create live snapshot 'snap1' for VM 'test_vm'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency.
===

It even deleted the newly created leaf LV as part of revert workflow although the snapshot operation is still going on...

===
2019-03-11 04:35:29,305-04 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='5c3f0ee4-335d-11e9-ad0e-001a4a17016f', ignoreFailoverLimit='false', storageDomainId='f16946db-79e2-4bd8-8acd-855e7e4ab0f2', imageGroupId='fb61b343-4cdb-4f57-8d0b-32058e463de5', imageId='00000000-0000-0000-0000-000000000000', imageList='[72e83f93-4e9f-4b8b-bc87-82504facbfcd]', postZero='false', force='false'}), log id: 75a08f03
===

The snapshot got completed and shows error about the unavailability of the volume.
   
===
2019-03-11 14:27:56,740+0530 INFO  (jsonrpc/5) [virt.vm] (vmId='b1ade260-811a-42ea-8cca-f5a5dd67a1a5') Completed live snapshot (vm:4618)

2019-03-11 14:27:56,834+0530 ERROR (jsonrpc/5) [virt.vm] (vmId='b1ade260-811a-42ea-8cca-f5a5dd67a1a5') Unable to update drive 'sda' volume size: Unable to get volume size for domain f16946db-79e2-4bd8-8acd-855e7e4ab0f2 volume 72e83f93-4e9f-4b8b-bc87-82504facbfcd (vm:4657)

2019-03-11 14:27:56,835+0530 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.snapshot succeeded in 1530.91 seconds (__init__:573)
===

Subsequent snapshot operation will fail with an error "volume doesn't exist".


Version-Release number of selected component (if applicable):

vdsm-4.20.43-1.el7ev.x86_64

How reproducible:

100%

Steps to Reproduce:

1. Created a VM with 32 GB memory.
2. Increased memory utilization with the VM using command below. 

stress --vm-bytes $(awk '/MemFree/{printf "%d\n", $2 * 0.9;}' < /proc/meminfo)k --vm-keep -m 1

3. Took the VM snapshot with memory.

Actual results:

Snapshot with memory volumes can timeout if the memory dump takes more than 180 seconds.

Expected results:

I think the snapshot operation with memory volumes should be monitored separately instead of default vdsTimeout.

Additional info:

Comment 3 Beni Pelled 2020-04-22 08:08:10 UTC

Verified with:
- RHV 4.4.0-0.33.master.el8ev
- libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
- vdsm-4.40.13-1.el8ev.x86_64

Verification steps:
1. Create and start a VM with 32G RAM
2. Take a snapshot with memory
3. Brake the SD connection and reconnect after a few minutes

Result:
- The snapshot process took ~5 minutes and it finished successfully (see logs below)

> engine log (start & end process):
2020-04-22 10:12:52,594+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateLiveSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-93) [24e1d9ba-e8b1-4ac2-816e-9dbc6faa10e4] Running command: Create LiveSnapshotForVmCommand internal: true. Entities affected :  ID: c719d7f2-5b93-44e4-811b-79b1ce745d55 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2020-04-22 10:17:00,510+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateLiveSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [24e1d9ba-e8b1-4ac2-816e-9dbc6faa10e4] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateLiveSnapshotForVmCommand' successfully.

> VDSM log (start & end process):
2020-04-22 10:12:52,736+0300 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': '2eb86f10-73d3-44f7-aae1-e8cc7724c800', 'baseVolumeID': '76576ffd-e672-4ff0-ac20-85a3b70d0105', 'volumeID': 'e7e92dd2-02d6-4b2e-9bcb-90f38387cd89', 'domainID': '34870b03-c3d0-4968-9df4-db4d5104a5de'}], snapMemory='34870b03-c3d0-4968-9df4-db4d5104a5de,bc8d7bea-d7fe-48a4-9e17-1392ef5e8a7e,cae0dae7-7c50-4ecd-a2ca-91fcede15113,1a04f7c5-8c15-46c6-9cae-ede712ec738f,21283d32-d858-4625-a3fa-9fbc792b1637,e8efd959-ab3b-487e-b09c-c76986cd0bfb', frozen=False, jobUUID='b3fa3ca6-9a7e-4753-966c-8bf8d5ac452d', timeout=30) from=::ffff:10.35.161.9,50314, flow_id=24e1d9ba-e8b1-4ac2-816e-9dbc6faa10e4, vmId=c719d7f2-5b93-44e4-811b-79b1ce745d55 (api:48)
2020-04-22 10:12:54,058+0300 INFO  (virt/b3fa3ca6) [virt.vm] (vmId='c719d7f2-5b93-44e4-811b-79b1ce745d55') Taking a live snapshot (drives=vda,memory=True) (snapshot:451)
2020-04-22 10:16:45,482+0300 INFO  (virt/b3fa3ca6) [virt.vm] (vmId='c719d7f2-5b93-44e4-811b-79b1ce745d55') Completed live snapshot (snapshot:476)

Comment 7 errata-xmlrpc 2020-08-04 13:20:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247