Bug 1687345 - Snapshot with memory volumes can fail if the memory dump takes more than 180 seconds
Summary: Snapshot with memory volumes can fail if the memory dump takes more than 180 ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.8
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.4.0
: ---
Assignee: Liran Rotenberg
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On: 1746730 1749284
Blocks: 1547768
TreeView+ depends on / blocked
 
Reported: 2019-03-11 10:44 UTC by nijin ashok
Modified: 2023-10-06 18:10 UTC (History)
14 users (show)

Fixed In Version: rhv-4.4.0-29
Doc Type: Enhancement
Doc Text:
Previously, RHV Manager created live virtual machine snapshots synchronously. If creating the snapshot exceeded the timeout period (default 180 seconds), the operation failed. These failures tended to happen with virtual machines that had large memory loads or clusters that had slow storage speeds. With this enhancement, the live snapshot operation is asynchronous and runs until it is complete, regardless of how long it takes.
Clone Of:
Environment:
Last Closed: 2020-08-04 13:16:55 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3997851 0 Troubleshoot None Snapshot command timeout in engine while taking memory snapshot 2019-03-20 05:43:22 UTC
Red Hat Product Errata RHSA-2020:3247 0 None None None 2020-08-04 13:19:06 UTC

Description nijin ashok 2019-03-11 10:44:24 UTC
Description of problem:

Currently, the engine assumes that the "snapshot" command issued to the vdsm will complete in 180 seconds. However,  for memory snapshots, it can take more time depending upon the RAM of the VM since an copying operation happens during this phase in addition to the swapping of the images. So the time it can take may be much more than 180 seconds depending on the VMs RAM size and the storage connectivity speed.

I was able to reproduce this in my test environment.

The snapshot command was issued to the VM.

===
2019-03-11 14:02:25,927+0530 INFO  (jsonrpc/5) [api.virt] START snapshot(snapDrives=[{u'baseVolumeID': u'f71cf012-e79f-4654-ac8e-cbf30bbd2ffd', u'domainID': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2', u'volumeID': u'72e83f93-4e9f-4b8b-bc87-82504facbfcd', u'imageID': u'fb61b343-4cdb-4f57-8d0b-32058e463de5'}], snapMemory=u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2,5c3f0ee4-335d-11e9-ad0e-001a4a17016f,ca39f7c5-f9e7-4e20-bb2a-865fcf7c952a,782183fc-25c1-4ae3-b573-91c487ac3ab1,a52d3ce2-ab6e-47df-b249-3672fe55a188,e3ff1738-192d-4351-9b99-2b8ab9f0ea74', frozen=False) from=::ffff:10.74.130.110,37200, flow_id=56193ec3-f1c4-4e10-b844-674f2dbc31ff, vmId=b1ade260-811a-42ea-8cca-f5a5dd67a1a5 (api:46)
====

Worker blocked messages.

===
2019-03-11 14:03:26,000+0530 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'frozen': False, u'vmID': u'b1ade260-811a-42ea-8cca-f5a5dd67a1a5', u'snapDrives': [{u'baseVolumeID': u'f71cf012-e79f-4654-ac8e-cbf30bbd2ffd', u'domainID': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2', u'volumeID': u'72e83f93-4e9f-4b8b-bc87-82504facbfcd', u'imageID': u'fb61b343-4cdb-4f57-8d0b-32058e463de5'}], u'snapMemory': u'f16946db-79e2-4bd8-8acd-855e7e4ab0f2,5c3f0ee4-335d-11e9-ad0e-001a4a17016f,ca39f7c5-f9e7-4e20-bb2a-865fcf7c952a,782183fc-25c1-4ae3-b573-91c487ac3ab1,a52d3ce2-ab6e-47df-b249-3672fe55a188,e3ff1738-192d-4351-9b99-2b8ab9f0ea74'}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'3e47b6d5-7fc2-4d33-a0ad-65433e9d9dae'} at 0x7fea7968bcd0> timeout=60, duration=60 at 0x7fea7968bad0> task#=197 at 0x7fea54077e10>, traceback:

File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4617, in snapshot
  self._dom.snapshotCreateXML(snapxml, snapFlags)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
  ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
  ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
  return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2617, in snapshotCreateXML
  ret = libvirtmod.virDomainSnapshotCreateXML(self._o, xmlDesc, flags) (executor:363)
====

Engine timeout after 3 minutes.

===
2019-03-11 04:32:19,790-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2615) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'snap1' creation for VM 'test_vm' was initiated by admin@internal-authz.


2019-03-11 04:35:27,063-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-98) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] EVENT_ID: USER_CREATE_LIVE_SNAPSHOT_FINISHED_FAILURE(170), Failed to create live snapshot 'snap1' for VM 'test_vm'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency.
===

It even deleted the newly created leaf LV as part of revert workflow although the snapshot operation is still going on...

===
2019-03-11 04:35:29,305-04 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [56193ec3-f1c4-4e10-b844-674f2dbc31ff] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='5c3f0ee4-335d-11e9-ad0e-001a4a17016f', ignoreFailoverLimit='false', storageDomainId='f16946db-79e2-4bd8-8acd-855e7e4ab0f2', imageGroupId='fb61b343-4cdb-4f57-8d0b-32058e463de5', imageId='00000000-0000-0000-0000-000000000000', imageList='[72e83f93-4e9f-4b8b-bc87-82504facbfcd]', postZero='false', force='false'}), log id: 75a08f03
===

The snapshot got completed and shows error about the unavailability of the volume.
   
===
2019-03-11 14:27:56,740+0530 INFO  (jsonrpc/5) [virt.vm] (vmId='b1ade260-811a-42ea-8cca-f5a5dd67a1a5') Completed live snapshot (vm:4618)

2019-03-11 14:27:56,834+0530 ERROR (jsonrpc/5) [virt.vm] (vmId='b1ade260-811a-42ea-8cca-f5a5dd67a1a5') Unable to update drive 'sda' volume size: Unable to get volume size for domain f16946db-79e2-4bd8-8acd-855e7e4ab0f2 volume 72e83f93-4e9f-4b8b-bc87-82504facbfcd (vm:4657)

2019-03-11 14:27:56,835+0530 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.snapshot succeeded in 1530.91 seconds (__init__:573)
===

Subsequent snapshot operation will fail with an error "volume doesn't exist".


Version-Release number of selected component (if applicable):

vdsm-4.20.43-1.el7ev.x86_64

How reproducible:

100%

Steps to Reproduce:

1. Created a VM with 32 GB memory.
2. Increased memory utilization with the VM using command below. 

stress --vm-bytes $(awk '/MemFree/{printf "%d\n", $2 * 0.9;}' < /proc/meminfo)k --vm-keep -m 1

3. Took the VM snapshot with memory.

Actual results:

Snapshot with memory volumes can timeout if the memory dump takes more than 180 seconds.

Expected results:

I think the snapshot operation with memory volumes should be monitored separately instead of default vdsTimeout.

Additional info:

Comment 1 nijin ashok 2019-03-12 08:50:22 UTC
I think the memory volumes are less used by the users since as per my understanding most of the user doesn't require the current running state of the VM. The disk only snapshot would be sufficient for most of the users. Also, the VM will also get paused during the memory snapshot and it requires more space in the storage. However, by default, the "Save Memory" option is checked while taking the snapshots using GUI.

Do you think "Save Memory" option should be disabled by default? Please let me know if I need to open a new BZ for this.

Comment 21 Daniel Gur 2019-08-28 13:12:40 UTC
sync2jira

Comment 22 Daniel Gur 2019-08-28 13:16:53 UTC
sync2jira

Comment 34 Ryan Barry 2020-02-28 14:13:18 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=1749284

This will be resolved as part of moving snapshot operations to async in 4.4 GA

Comment 35 Ryan Barry 2020-03-09 22:13:04 UTC
Looks like a dupe of the async snapshot bug. Think your patch will resolve?

Comment 36 Liran Rotenberg 2020-03-10 08:40:43 UTC
(In reply to Ryan Barry from comment #35)
> Looks like a dupe of the async snapshot bug. Think your patch will resolve?

My 4.4 sync snapshot should resolve it.

Comment 38 Ilan Zuckerman 2020-04-19 12:48:57 UTC
Verified according steps from description on:

ovirt-engine-4.4.0-0.32.master.el8ev.noarch
vdsm-4.40.13-1.el8ev.x86_64

Snapshot with memory took around 5 minutes. no timeouts or errors seen in the engine log.

Start snapshot:

2020-04-19 15:30:03,231+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-
Thread-5173) [b2971597-a2d7-47bd-9eaa-4ddafebf0105] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'test' creation for VM 'test' was initi
ated by admin@internal-authz.



End snapshot:

2020-04-19 15:35:11,479+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'test' creation for VM 'test' has been completed.

Comment 45 errata-xmlrpc 2020-08-04 13:16:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247


Note You need to log in before you can comment on or make changes to this bug.