Description of problem: Possibly due to the maximum path length, snapshots operations start to fail around the 98th image in the chain. Currently, this is an expected failure as the actual link to the underlying image grows as it walks the chain. When that point is reached, snapshots fail, the VM won't migrate/restart. It would be good just to limit the number of snapshots in the Engine (or do something more sophisticated and evaluate the number of images in a chain, as some snapshots may not contain all disks). VDSM: d0d32c54-8441-4943-866e-f96d7f7e4dab::DEBUG::2016-06-21 15:57:57,871::utils::671::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/qemu-img create -f qcow2 -o compat=0.10 -b ../929ad925-0067-4140-974f-021ac619c335/6e876512-b9ac-4576-b70b-d9835a6fbe1a -F qcow2 /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8fcbd/images/929ad925-0067-4140-974f-021ac619c335/4d14c6be-aec0-40e4-851 d-c580417af0c1 (cwd /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8fcbd/images/929ad925-0067-4140-974f-021ac619c335) ....... d0d32c54-8441-4943-866e-f96d7f7e4dab::DEBUG::2016-06-21 15:57:58,256::utils::689::root::(execCmd) FAILED: <err> = 'qemu-img: /rhev/data-center/00000001-0001-0001-0001-000000000019/09f005c0-06c0-4798-a246-0c0414d8f cbd/images/929ad925-0067-4140-974f-021ac619c335/4d14c6be-aec0-40e4-851d-c580417af0c1: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not open backing file: Could not refresh total sector count: Invalid argument \n'; <rc> = 1 Engine: 2016-06-21 01:58:07,791 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-35) [16a2d386] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM RHEV-H10 command failed: Error creating a new volume 2016-06-21 01:58:07,791 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [16a2d386] SPMAsyncTask::PollTask: Polling task 'd0d32c54-8441-4943-866e-f96d7f7e4dab' (Parent Command 'CreateAllSnapshotsFromVm', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'cleanSuccess'. 2016-06-21 01:58:07,799 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-35) [16a2d386] BaseAsyncTask::logEndTaskFailure: Task 'd0d32c54-8441-4943-866e-f96d7f7e4dab' (Parent Command 'CreateAllSnapshotsFromVm', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new volume, code = 205', -- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new volume, code = 205' 2016-06-21 01:58:08,024 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-2) [] Correlation ID: 6caf62c9, Job ID: a868b8af-f05d-4bc9-9172-425deff6dafc, Call Stack: null, Custom Event ID: -1, Message: Failed to complete snapshot 'snapshot98' creation for VM 'SnapshotLimitVM'. Then, of course, VM won't start again if shut down: Thread-44920::ERROR::2016-06-21 16:08:01,011::vm::759::virt.vm::(_startUnderlyingVm) vmId=`021e2161-eb6e-4788-92c2-932bb4f39bc1`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 703, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 1941, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3611, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: File name too long Version-Release number of selected component (if applicable): vdsm-4.17.28-0.el7ev.noarch qemu-img-rhev-2.3.0-31.el7_2.13.x86_64 rhevm-3.6.6.2-0.1.el6.noarch How reproducible: 100% Steps to Reproduce: for s in range(0,150): snap = "snapshot" + str(s) print "Creating snapshot %s" % (snap) try: v.snapshots.add(params.Snapshot(description=snap, vm=v)) time.sleep(30) except Exception as e: print 'Failed:\n%s' % str(e) Actual results: Snapshot fails, chain is broken, VM will not start again if shutdown. Expected results: Snapshot succeeds or engine tells user cannot create more snapshots. No failures or broken chains/VMs not starting. Additional info: I am afraid this BZ will change this limit: BZ 1333627 Also see this: BZ 1082630 In summary, what is requested here is to test what is the maximum number of images a chain may contain and limit it to this number in the engine, not allowing failures. As this number may change as the product is developed, keep it up to date until there is no more limit (still a good idea to have some limit? - like the 350-400 LVs a SD should contain, as per our scale recommendations?)
One correction: the chain is not broken, it's fine.
This makes sense to me, I'm really surprised customers actually hit this limit. However, we need to think how to treat pre-existing snapshots that pass this limit.
Well, some customers use the API or external tool do "backup" VMs by using the snapshot functions. It's quite easy to get a 100. I've seen many VMs with 40,50,60,70 images mostly hitting other bugs, but it's probably just a matter of time until more customers reach higher numbers. Anyway, the pre-existing snapshots that pass this limit is quite clear. There is a number: LIMIT. At this point, all works: start VM, delete snapshot, create snapshot. All good. So a chain is at LIMIT, then it creates a new snapshot (successfully): it reaches LIMIT+1. At this number, nothing works, including creating new snapshots. So there is no way that I know of to reach LIMIT+2, therefore the pre-existing "big chains" are quite predictable. The only thing that works is deleting a snapshots, which brings it back to LIMIT and apparently all is good again (waiting for customer to confirm as well, on our side it looks ok). I would say we are likely to get customer cases for VMs that pass this limit. The knowledge base article is already on the way (waiting for customer to confirm all is good after delete). Perhaps the engine could just throw a warning saying the chain is too big, and we copy the message to our KCS, so customers can find it easily.
Verified with the following code: ------------------------------------------- ovirt-engine-4.1.1-0.0.master.20170126161333.git83fd7e0.el7.centos.noarch vdsm-4.19.3-1.gitdfa4d67.el7.centos.x86_64 Verified with the following scenario: ------------------------------------------ 1. Create a VM with disk 2. Set the number of images to 7 3. Create 7 snapshots - the following error is displayed Error while executing action: vm4: Cannot create Snapshot. The following disk(s) have exceeded the number of volumes in an image chain: vm4_Disk1 (7/7) Please remove some of the disk(s)'s snapshots and try again. Moving to VERIFIED!
Byron - CDA stands for "Can Do Action", named after a method in the code that was renamed in RHV 4.0. Can we please change the usage of "CDA" in the doctext to "validation" or something similar? Thanks!