Bug 1348405 - RHEV: limit number of images in an image chain (snapshots)
Summary: RHEV: limit number of images in an image chain (snapshots)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.6
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Maor
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-21 06:32 UTC by Germano Veit Michel
Modified: 2019-11-14 08:27 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
This update limits the number of snapshots per disk in the Manager due to the maximum path length in QEMU which causes snapshot operations to start failing around the 98th image in the chain. This update introduced a new configuration value called MaxImagesInChain with a limit number of 90. The limit includes all the image chain, active volume, and the image snapshots. For example, if a disk has 89 snapshots, the next snapshot that will be created will be blocked by a validation. Other related operations which uses snapshots like live migrate disk, or running a stateless virtual machine will also apply to the same limitation to avoid failure.
Clone Of:
Environment:
Last Closed: 2017-04-25 00:42:04 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2387061 0 None None None 2016-06-21 06:53:10 UTC
Red Hat Knowledge Base (Solution) 3123531 0 None None None 2017-07-24 05:39:09 UTC
Red Hat Product Errata RHEA-2017:0997 0 normal SHIPPED_LIVE Red Hat Virtualization Manager (ovirt-engine) 4.1 GA 2017-04-18 20:11:26 UTC
oVirt gerrit 62641 0 'None' MERGED core: Adding maximum limitation for image chain. 2021-02-11 17:47:02 UTC
oVirt gerrit 62642 0 'None' MERGED core: Validate maximum number of volumes for VM snapshot. 2021-02-11 17:47:02 UTC
oVirt gerrit 62643 0 'None' MERGED core: Validate maximum number of volumes for stateless VM snapshot. 2021-02-11 17:47:02 UTC
oVirt gerrit 62644 0 'None' MERGED core: Validate maximum number of volumes for live migrate. 2021-02-11 17:47:02 UTC
oVirt gerrit 62946 0 'None' MERGED core: Avoid using static for config value. 2021-02-11 17:47:02 UTC

Description Germano Veit Michel 2016-06-21 06:32:10 UTC
Description of problem:

Possibly due to the maximum path length, snapshots operations start to fail around the 98th image in the chain. Currently, this is an expected failure as the actual link to the underlying image grows as it walks the chain.

When that point is reached, snapshots fail, the VM won't migrate/restart.

It would be good just to limit the number of snapshots in the Engine (or do something more sophisticated and evaluate the number of images in a chain, as some snapshots may not contain all disks).

VDSM:

d0d32c54-8441-4943-866e-f96d7f7e4dab::DEBUG::2016-06-21 15:57:57,871::utils::671::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/qemu-img create -f qcow2 -o compat=0.10 -b ../929ad925-0067-4140-974f-021ac619c335/6e876512-b9ac-4576-b70b-d9835a6fbe1a -F qcow2 /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8fcbd/images/929ad925-0067-4140-974f-021ac619c335/4d14c6be-aec0-40e4-851
d-c580417af0c1 (cwd /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8fcbd/images/929ad925-0067-4140-974f-021ac619c335)

.......

d0d32c54-8441-4943-866e-f96d7f7e4dab::DEBUG::2016-06-21 15:57:58,256::utils::689::root::(execCmd) FAILED: <err> = 'qemu-img: /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8f
cbd/images/929ad925-0067-4140-974f-021ac619c335/4d14c6be-aec0-40e4-851d-c580417af0c1: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not refresh total sector count: Invalid argument
\n'; <rc> = 1

Engine:

2016-06-21 01:58:07,791 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-35) [16a2d386] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM RHEV-H10 command failed: Error creating a new volume

2016-06-21 01:58:07,791 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [16a2d386] SPMAsyncTask::PollTask: Polling task 'd0d32c54-8441-4943-866e-f96d7f7e4dab' (Parent Command 'CreateAllSnapshotsFromVm', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'.

2016-06-21 01:58:07,799 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [16a2d386] BaseAsyncTask::logEndTaskFailure: Task 'd0d32c54-8441-4943-866e-f96d7f7e4dab' (Parent Command 'CreateAllSnapshotsFromVm', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new volume, code = 205',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new volume, code = 205'

2016-06-21 01:58:08,024 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-2) [] Correlation ID: 6caf62c9, Job ID: a868b8af-f05d-4bc9-9172-425deff6dafc, Call Stack: null, Custom Event ID: -1, Message: Failed to complete snapshot 'snapshot98' creation for VM 'SnapshotLimitVM'.

Then, of course, VM won't start again if shut down:

Thread-44920::ERROR::2016-06-21 16:08:01,011::vm::759::virt.vm::(_startUnderlyingVm) vmId=`021e2161-eb6e-4788-92c2-932bb4f39bc1`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 703, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1941, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3611, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: File name too long


Version-Release number of selected component (if applicable):
vdsm-4.17.28-0.el7ev.noarch
qemu-img-rhev-2.3.0-31.el7_2.13.x86_64
rhevm-3.6.6.2-0.1.el6.noarch

How reproducible:
100%

Steps to Reproduce:

    for s in range(0,150):
        snap = "snapshot" + str(s)
        print "Creating snapshot %s" % (snap)
        try:
            v.snapshots.add(params.Snapshot(description=snap, vm=v))
            time.sleep(30)
        except Exception as e:
            print 'Failed:\n%s' % str(e)

Actual results:
Snapshot fails, chain is broken, VM will not start again if shutdown.

Expected results:
Snapshot succeeds or engine tells user cannot create more snapshots. No failures or broken chains/VMs not starting.

Additional info:

I am afraid this BZ will change this limit: BZ 1333627
Also see this: BZ 1082630

In summary, what is requested here is to test what is the maximum number of images a chain may contain and limit it to this number in the engine, not allowing failures. As this number may change as the product is developed, keep it up to date until there is no more limit (still a good idea to have some limit? - like the 350-400 LVs a SD should contain, as per our scale recommendations?)

Comment 1 Germano Veit Michel 2016-06-21 07:10:14 UTC
One correction: the chain is not broken, it's fine.

Comment 2 Allon Mureinik 2016-06-22 08:29:50 UTC
This makes sense to me, I'm really surprised customers actually hit this limit.

However, we need to think how to treat pre-existing snapshots that pass this limit.

Comment 3 Germano Veit Michel 2016-06-24 00:00:32 UTC
Well, some customers use the API or external tool do "backup" VMs by using the snapshot functions. It's quite easy to get a 100. I've seen many VMs with 40,50,60,70 images mostly hitting other bugs, but it's probably just a matter of time until more customers reach higher numbers.

Anyway, the pre-existing snapshots that pass this limit is quite clear.

There is a number: LIMIT. At this point, all works: start VM, delete snapshot, create snapshot. All good.

So a chain is at LIMIT, then it creates a new snapshot (successfully): it reaches LIMIT+1. At this number, nothing works, including creating new snapshots. So there is no way that I know of to reach LIMIT+2, therefore the pre-existing  "big chains" are quite predictable. The only thing that works is deleting a snapshots, which brings it back to LIMIT and apparently all is good again (waiting for customer to confirm as well, on our side it looks ok).

I would say we are likely to get customer cases for VMs that pass this limit. The knowledge base article is already on the way (waiting for customer to confirm all is good after delete). Perhaps the engine could just throw a warning saying the chain is too big, and we copy the message to our KCS, so customers can find it easily.

Comment 11 Kevin Alon Goldblatt 2017-01-30 12:17:12 UTC
Verified with the following code:
-------------------------------------------
ovirt-engine-4.1.1-0.0.master.20170126161333.git83fd7e0.el7.centos.noarch
vdsm-4.19.3-1.gitdfa4d67.el7.centos.x86_64

Verified with the following scenario:
------------------------------------------
1. Create a VM with disk
2. Set the number of images to 7
3. Create 7 snapshots - the following error is displayed




Error while executing action:

vm4:

    Cannot create Snapshot. The following disk(s) have exceeded the number of volumes in an image chain:

    vm4_Disk1 (7/7)

    Please remove some of the disk(s)'s snapshots and try again.

Moving to VERIFIED!

Comment 12 Allon Mureinik 2017-04-13 03:42:16 UTC
Byron - CDA stands for "Can Do Action", named after a method in the code that was renamed in RHV 4.0.
Can we please change the usage of "CDA" in the doctext to "validation" or something similar?
Thanks!


Note You need to log in before you can comment on or make changes to this bug.