Description of problem: Currently, the engine assumes that the "snapshot" command issued to the vdsm will complete in 180 seconds. However, for memory snapshots, it can take more time depending upon the RAM of the VM since an copying operation happens during this phase in addition to the swapping of the images. So the time it can take may be much more than 180 seconds depending on the VMs RAM size and the storage connectivity speed. I was able to reproduce this in my test environment. The snapshot command was issued to the VM. === 2019-03-11 14:02:25,927+0530 INFO (jsonrpc/5) [api.virt] START snapshot(snapDrives=[{u'baseVolumeID': u'f71cf012-e79f-4654-ac8e-cbf30bbd2ffd', u'domainID': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2', u'volumeID': u'72e83f93-4e9f-4b8b-bc87-82504facbfcd', u'imageID': u'fb61b343-4cdb-4f57-8d0b-32058e463de5'}], snapMemory=u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2,5c3f0ee4-335d-11e9-ad0e-001a4a17016f,ca39f7c5-f9e7-4e20-bb2a-865fcf7c952a,782183fc-25c1-4ae3-b573-91c487ac3ab1,a52d3ce2-ab6e-47df-b249-3672fe55a188,e3ff1738-192d-4351-9b99-2b8ab9f0ea74', frozen=False) from=::ffff:10.74.130.110,37200, flow_id=56193ec3-f1c4-4e10-b844-674f2dbc31ff, vmId=b1ade260-811a-42ea-8cca-f5a5dd67a1a5 (api:46) ==== Worker blocked messages. === 2019-03-11 14:03:26,000+0530 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'frozen': False, u'vmID': u'b1ade260-811a-42ea-8cca-f5a5dd67a1a5', u'snapDrives': [{u'baseVolumeID': u'f71cf012-e79f-4654-ac8e-cbf30bbd2ffd', u'domainID': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2', u'volumeID': u'72e83f93-4e9f-4b8b-bc87-82504facbfcd', u'imageID': u'fb61b343-4cdb-4f57-8d0b-32058e463de5'}], u'snapMemory': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2,5c3f0ee4-335d-11e9-ad0e-001a4a17016f,ca39f7c5-f9e7-4e20-bb2a-865fcf7c952a,782183fc-25c1-4ae3-b573-91c487ac3ab1,a52d3ce2-ab6e-47df-b249-3672fe55a188,e3ff1738-192d-4351-9b99-2b8ab9f0ea74'}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'3e47b6d5-7fc2-4d33-a0ad-65433e9d9dae'} at 0x7fea7968bcd0> timeout=60, duration=60 at 0x7fea7968bad0> task#=197 at 0x7fea54077e10>, traceback: File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4617, in snapshot self._dom.snapshotCreateXML(snapxml, snapFlags) File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2617, in snapshotCreateXML ret = libvirtmod.virDomainSnapshotCreateXML(self._o, xmlDesc, flags) (executor:363) ==== Engine timeout after 3 minutes. === 2019-03-11 04:32:19,790-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2615) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'snap1' creation for VM 'test_vm' was initiated by admin@internal-authz. 2019-03-11 04:35:27,063-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-98) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] EVENT_ID: USER_CREATE_LIVE_SNAPSHOT_FINISHED_FAILURE(170), Failed to create live snapshot 'snap1' for VM 'test_vm'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency. === It even deleted the newly created leaf LV as part of revert workflow although the snapshot operation is still going on... === 2019-03-11 04:35:29,305-04 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='5c3f0ee4-335d-11e9-ad0e-001a4a17016f', ignoreFailoverLimit='false', storageDomainId='f16946db-79e2-4bd8-8acd-855e7e4ab0f2', imageGroupId='fb61b343-4cdb-4f57-8d0b-32058e463de5', imageId='00000000-0000-0000-0000-000000000000', imageList='[72e83f93-4e9f-4b8b-bc87-82504facbfcd]', postZero='false', force='false'}), log id: 75a08f03 === The snapshot got completed and shows error about the unavailability of the volume. === 2019-03-11 14:27:56,740+0530 INFO (jsonrpc/5) [virt.vm] (vmId='b1ade260-811a-42ea-8cca-f5a5dd67a1a5') Completed live snapshot (vm:4618) 2019-03-11 14:27:56,834+0530 ERROR (jsonrpc/5) [virt.vm] (vmId='b1ade260-811a-42ea-8cca-f5a5dd67a1a5') Unable to update drive 'sda' volume size: Unable to get volume size for domain f16946db-79e2-4bd8-8acd-855e7e4ab0f2 volume 72e83f93-4e9f-4b8b-bc87-82504facbfcd (vm:4657) 2019-03-11 14:27:56,835+0530 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.snapshot succeeded in 1530.91 seconds (__init__:573) === Subsequent snapshot operation will fail with an error "volume doesn't exist". Version-Release number of selected component (if applicable): vdsm-4.20.43-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Created a VM with 32 GB memory. 2. Increased memory utilization with the VM using command below. stress --vm-bytes $(awk '/MemFree/{printf "%d\n", $2 * 0.9;}' < /proc/meminfo)k --vm-keep -m 1 3. Took the VM snapshot with memory. Actual results: Snapshot with memory volumes can timeout if the memory dump takes more than 180 seconds. Expected results: I think the snapshot operation with memory volumes should be monitored separately instead of default vdsTimeout. Additional info:
I think the memory volumes are less used by the users since as per my understanding most of the user doesn't require the current running state of the VM. The disk only snapshot would be sufficient for most of the users. Also, the VM will also get paused during the memory snapshot and it requires more space in the storage. However, by default, the "Save Memory" option is checked while taking the snapshots using GUI. Do you think "Save Memory" option should be disabled by default? Please let me know if I need to open a new BZ for this.
sync2jira
See https://bugzilla.redhat.com/show_bug.cgi?id=1749284 This will be resolved as part of moving snapshot operations to async in 4.4 GA
Looks like a dupe of the async snapshot bug. Think your patch will resolve?
(In reply to Ryan Barry from comment #35) > Looks like a dupe of the async snapshot bug. Think your patch will resolve? My 4.4 sync snapshot should resolve it.
Verified according steps from description on: ovirt-engine-4.4.0-0.32.master.el8ev.noarch vdsm-4.40.13-1.el8ev.x86_64 Snapshot with memory took around 5 minutes. no timeouts or errors seen in the engine log. Start snapshot: 2020-04-19 15:30:03,231+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine- Thread-5173) [b2971597-a2d7-47bd-9eaa-4ddafebf0105] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'test' creation for VM 'test' was initi ated by admin@internal-authz. End snapshot: 2020-04-19 15:35:11,479+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'test' creation for VM 'test' has been completed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247