Bug 1348405

Summary: RHEV: limit number of images in an image chain (snapshots)
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED ERRATA QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: medium    
Version: 3.6.6CC: acanan, amureini, bgraveno, lsurette, mgoldboi, mkalinin, mlipchuk, ratamir, rbalakri, Rhev-m-bugs, srevivo, tnisan, ykaul, ylavi
Target Milestone: ovirt-4.1.0-alpha   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
This update limits the number of snapshots per disk in the Manager due to the maximum path length in QEMU which causes snapshot operations to start failing around the 98th image in the chain. This update introduced a new configuration value called MaxImagesInChain with a limit number of 90. The limit includes all the image chain, active volume, and the image snapshots. For example, if a disk has 89 snapshots, the next snapshot that will be created will be blocked by a validation. Other related operations which uses snapshots like live migrate disk, or running a stateless virtual machine will also apply to the same limitation to avoid failure.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-25 00:42:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2016-06-21 06:32:10 UTC
Description of problem:

Possibly due to the maximum path length, snapshots operations start to fail around the 98th image in the chain. Currently, this is an expected failure as the actual link to the underlying image grows as it walks the chain.

When that point is reached, snapshots fail, the VM won't migrate/restart.

It would be good just to limit the number of snapshots in the Engine (or do something more sophisticated and evaluate the number of images in a chain, as some snapshots may not contain all disks).

VDSM:

d0d32c54-8441-4943-866e-f96d7f7e4dab::DEBUG::2016-06-21 15:57:57,871::utils::671::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/qemu-img create -f qcow2 -o compat=0.10 -b ../929ad925-0067-4140-974f-021ac619c335/6e876512-b9ac-4576-b70b-d9835a6fbe1a -F qcow2 /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8fcbd/images/929ad925-0067-4140-974f-021ac619c335/4d14c6be-aec0-40e4-851
d-c580417af0c1 (cwd /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8fcbd/images/929ad925-0067-4140-974f-021ac619c335)

.......

d0d32c54-8441-4943-866e-f96d7f7e4dab::DEBUG::2016-06-21 15:57:58,256::utils::689::root::(execCmd) FAILED: <err> = 'qemu-img: /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8f
cbd/images/929ad925-0067-4140-974f-021ac619c335/4d14c6be-aec0-40e4-851d-c580417af0c1: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not refresh total sector count: Invalid argument
\n'; <rc> = 1

Engine:

2016-06-21 01:58:07,791 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-35) [16a2d386] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM RHEV-H10 command failed: Error creating a new volume

2016-06-21 01:58:07,791 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [16a2d386] SPMAsyncTask::PollTask: Polling task 'd0d32c54-8441-4943-866e-f96d7f7e4dab' (Parent Command 'CreateAllSnapshotsFromVm', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'.

2016-06-21 01:58:07,799 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [16a2d386] BaseAsyncTask::logEndTaskFailure: Task 'd0d32c54-8441-4943-866e-f96d7f7e4dab' (Parent Command 'CreateAllSnapshotsFromVm', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new volume, code = 205',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new volume, code = 205'

2016-06-21 01:58:08,024 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-2) [] Correlation ID: 6caf62c9, Job ID: a868b8af-f05d-4bc9-9172-425deff6dafc, Call Stack: null, Custom Event ID: -1, Message: Failed to complete snapshot 'snapshot98' creation for VM 'SnapshotLimitVM'.

Then, of course, VM won't start again if shut down:

Thread-44920::ERROR::2016-06-21 16:08:01,011::vm::759::virt.vm::(_startUnderlyingVm) vmId=`021e2161-eb6e-4788-92c2-932bb4f39bc1`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 703, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1941, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3611, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: File name too long


Version-Release number of selected component (if applicable):
vdsm-4.17.28-0.el7ev.noarch
qemu-img-rhev-2.3.0-31.el7_2.13.x86_64
rhevm-3.6.6.2-0.1.el6.noarch

How reproducible:
100%

Steps to Reproduce:

    for s in range(0,150):
        snap = "snapshot" + str(s)
        print "Creating snapshot %s" % (snap)
        try:
            v.snapshots.add(params.Snapshot(description=snap, vm=v))
            time.sleep(30)
        except Exception as e:
            print 'Failed:\n%s' % str(e)

Actual results:
Snapshot fails, chain is broken, VM will not start again if shutdown.

Expected results:
Snapshot succeeds or engine tells user cannot create more snapshots. No failures or broken chains/VMs not starting.

Additional info:

I am afraid this BZ will change this limit: BZ 1333627
Also see this: BZ 1082630

In summary, what is requested here is to test what is the maximum number of images a chain may contain and limit it to this number in the engine, not allowing failures. As this number may change as the product is developed, keep it up to date until there is no more limit (still a good idea to have some limit? - like the 350-400 LVs a SD should contain, as per our scale recommendations?)

Comment 1 Germano Veit Michel 2016-06-21 07:10:14 UTC
One correction: the chain is not broken, it's fine.

Comment 2 Allon Mureinik 2016-06-22 08:29:50 UTC
This makes sense to me, I'm really surprised customers actually hit this limit.

However, we need to think how to treat pre-existing snapshots that pass this limit.

Comment 3 Germano Veit Michel 2016-06-24 00:00:32 UTC
Well, some customers use the API or external tool do "backup" VMs by using the snapshot functions. It's quite easy to get a 100. I've seen many VMs with 40,50,60,70 images mostly hitting other bugs, but it's probably just a matter of time until more customers reach higher numbers.

Anyway, the pre-existing snapshots that pass this limit is quite clear.

There is a number: LIMIT. At this point, all works: start VM, delete snapshot, create snapshot. All good.

So a chain is at LIMIT, then it creates a new snapshot (successfully): it reaches LIMIT+1. At this number, nothing works, including creating new snapshots. So there is no way that I know of to reach LIMIT+2, therefore the pre-existing  "big chains" are quite predictable. The only thing that works is deleting a snapshots, which brings it back to LIMIT and apparently all is good again (waiting for customer to confirm as well, on our side it looks ok).

I would say we are likely to get customer cases for VMs that pass this limit. The knowledge base article is already on the way (waiting for customer to confirm all is good after delete). Perhaps the engine could just throw a warning saying the chain is too big, and we copy the message to our KCS, so customers can find it easily.

Comment 11 Kevin Alon Goldblatt 2017-01-30 12:17:12 UTC
Verified with the following code:
-------------------------------------------
ovirt-engine-4.1.1-0.0.master.20170126161333.git83fd7e0.el7.centos.noarch
vdsm-4.19.3-1.gitdfa4d67.el7.centos.x86_64

Verified with the following scenario:
------------------------------------------
1. Create a VM with disk
2. Set the number of images to 7
3. Create 7 snapshots - the following error is displayed




Error while executing action:

vm4:

    Cannot create Snapshot. The following disk(s) have exceeded the number of volumes in an image chain:

    vm4_Disk1 (7/7)

    Please remove some of the disk(s)'s snapshots and try again.

Moving to VERIFIED!

Comment 12 Allon Mureinik 2017-04-13 03:42:16 UTC
Byron - CDA stands for "Can Do Action", named after a method in the code that was renamed in RHV 4.0.
Can we please change the usage of "CDA" in the doctext to "validation" or something similar?
Thanks!