Description of problem: Have HC setup and running all my vms on gluster storage. When i delete all the vms from the UI , vdsm command fails with the error "VDSM command failed: Could not remove all image's volumes" and seen a Traceback in the vdsm logs. jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:15,177::fileSD::410::Storage.StorageDomain::(_deleteVolumeFile) Removing file: /rhev/data-center/mnt/glusterSD/10.70.34.35:_ data/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87 jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:16,389::fileSD::410::Storage.StorageDomain::(_deleteVolumeFile) Removing file: /rhev/data-center/mnt/glusterSD/10.70.34.35:_ data/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.meta jsonrpc.Executor/6::WARNING::2016-06-03 18:50:16,396::fileSD::415::Storage.StorageDomain::(_deleteVolumeFile) File u'/rhev/data-center/mnt/glusterSD/10.70.34.35:_data/e 543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.meta' does not exist: [Errno 2] No such file or directory jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:16,397::fileSD::410::Storage.StorageDomain::(_deleteVolumeFile) Removing file: /rhev/data-center/mnt/glusterSD/10.70.34.35:_ data/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.lease jsonrpc.Executor/6::WARNING::2016-06-03 18:50:16,399::fileSD::415::Storage.StorageDomain::(_deleteVolumeFile) File u'/rhev/data-center/mnt/glusterSD/10.70.34.35:_data/e 543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.lease' does not exist: [Errno 2] No such file or directory jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:16,400::fileSD::402::Storage.StorageDomain::(deleteImage) Removing directory: /rhev/data-center/mnt/glusterSD/10.70.34.35:_d ata/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a jsonrpc.Executor/6::ERROR::2016-06-03 18:50:16,402::fileSD::406::Storage.StorageDomain::(deleteImage) removed image dir: /rhev/data-center/mnt/glusterSD/10.70.34.35:_da ta/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a can't be removed jsonrpc.Executor/6::ERROR::2016-06-03 18:50:16,402::task::866::Storage.TaskManager.Task::(_setError) Task=`08e421a9-9cb0-45b1-ad1f-fb65ec9f4b9f`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1544, in deleteImage pool.deleteImage(dom, imgUUID, volsByImg) File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1982, in deleteImage domain.deleteImage(domain.sdUUID, imgUUID, volsByImg) File "/usr/share/vdsm/storage/fileSD.py", line 407, in deleteImage raise se.ImageDeleteError("%s %s" % (imgUUID, str(e))) ImageDeleteError: Could not remove all image's volumes: (u'0769365e-2aed-4a6e-909e-79587d82ab9a [Errno 2] No such file or directory',) Do not see any functional impact. Version-Release number of selected component (if applicable): vdsm-4.17.29-0.1.el7ev.noarch How reproducible: Hit it once Steps to Reproduce: 1. Install HC setup 2. Create 9 vms from Template. 3. Delete all the vms created. Actual results: All the vms gets deleted leaving an event message " Could not remove all image's volumes" and Traceback in the vdsm logs. Expected results: VDSM should not report any errors/failures/Tracebacks. Additional info:
sosreports from all nodes and engine can be found in the link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1342550/
Kasturi, is this issue consistently reproducible?
(In reply to Sahina Bose from comment #2) > Kasturi, is this issue consistently reproducible? Able to reproduce this consistently. I am seeing this issue with 4.0 too
Kasturi, your logs are not from current 4.1 or 4.2 version. Can you reproduce this with 4.1 or 4.2? Also, we must have complete vdsm logs from the time this image was created until it was deleted.
Nir, Tried it with "Red Hat Virtualization Manager Version: 4.1.2.2-0.1.el7" and i still see the above issue mentioned. Following is what i see in my events "VDSM <host> command HSMGetAllTasksStatusesVDS failed: Could not remove all image's volumes". I see following traceback in the vdsm log file: 2017-05-30 18:47:04,252+0530 ERROR (tasks/7) [storage.TaskManager.Task] (Task='413e1a8a-4efe-4614-a6e0-c8bdfb91c01f') Unexpected error (task:870) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 333, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 2001, in purgeImage domain.purgeImage(sdUUID, imgUUID, volsByImg, discard) File "/usr/share/vdsm/storage/sd.py", line 673, in purgeImage self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard) File "/usr/share/vdsm/storage/fileSD.py", line 250, in purgeImage raise se.ImageDeleteError("%s %s" % (imgUUID, str(e))) ImageDeleteError: Could not remove all image's volumes: (u'53c73285-2af9-44f3-9317-7c6aedcc450c [Errno 2] No such file or directory',) I ran sosreport on all the nodes and Hosted Engine vm and copied in the location below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1342550/ Hope this helps !!!
Thanks Kasturi! Can you extract the relevant vdsm logs showing the timeframe from creating the images until they are deleted (and show this traceback) and attach them to the bug?
Hi Nir, I see that vm creation is initated @ Jun 21, 2017 4:25:45 PM VM applinuxvm1 creation was initiated by admin@internal-authz from engine.log. Vdsm log starts from here in log file vdsm.log.71.xz (where the image is created and started) on 2017-6-21 2017-06-21 16:25:45,480+0530 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533) vdsm log(starts here) when the image was removed : 2017-06-27 11:22:18,305+0530 INFO (jsonrpc/0) [dispatcher] Run and protect: deleteImage(sdUUID=u'11df6265-6d83-44be-a72b-1fd033074588', spUUID=u'00000001-0001-0001-0001-000 000000311', imgUUID=u'0f1adbe9-9402-43a9-bc84-265d04b08cb6', postZero=u'false', force=u'false', discard=False) (logUtils:51) 2017-06-27 11:22:18,313+0530 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:49418 (protocoldetector:72) 2017-06-27 11:22:18,314+0530 WARN (tasks/1) [storage.StorageDomainManifest] File u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_data/11df6265-6d83-44be-a72b-1fd033074588/im ages/_remove_me_96670856-6508-48d2-ac1e-ba390dbe0a13/5886ed56-a989-48b6-8a3a-33ac1cb9b71d.lease' does not exist: [Errno 2] No such file or directory (fileSD:258) 2017-06-27 11:22:18,318+0530 ERROR (tasks/1) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/glusterSD/10.70.36.78:_data/11df6265-6d83-44be-a72b-1fd 033074588/images/_remove_me_96670856-6508-48d2-ac1e-ba390dbe0a13 can't be removed (fileSD:249) 2017-06-27 11:22:18,319+0530 ERROR (tasks/1) [storage.TaskManager.Task] (Task='942c1306-6f7c-4205-89e3-5db04aa8dd1e') Unexpected error (task:870) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 333, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 2001, in purgeImage domain.purgeImage(sdUUID, imgUUID, volsByImg, discard) File "/usr/share/vdsm/storage/sd.py", line 673, in purgeImage self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard) File "/usr/share/vdsm/storage/fileSD.py", line 250, in purgeImage raise se.ImageDeleteError("%s %s" % (imgUUID, str(e))) ImageDeleteError: Could not remove all image's volumes: (u'96670856-6508-48d2-ac1e-ba390dbe0a13 [Errno 2] No such file or directory',) and image was successfully removed at 11:22:32 on 2017-06-27 ============================================================ 2017-06-27 11:22:32,993+0530 INFO (jsonrpc/6) [storage.VolumeManifest] Info request: sdUUID=b8be7bfb-a360-482f-b5e1-0a8215c663b4 imgUUID=44fb4952-b9b6-4c4f-83b7-57e5d0a884b 9 volUUID = ead0d239-c88c-4eab-8cf5-ed2163ad5606 (volume:238) All the vdsm logs in between these days are copied to the link below: ======================================================================= http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1342550/ I tried to reproduce it today to have a minimized log, but i was not able to. So attaching the logs from the time of vm creation till deleting the vm.
Ok I think I understand what is happening in this bug. Deleting disks involves calling two verbs, deleteImage and purgeImage. The deleteImage verb renames the image directory by prepending sd.REMOVED_IMAGE_PREFIX to the directory name. Next, purgeImage tries to remove the files individually from that renamed directory. We also have an imageGarbageCollector function which removes directories which have been deleted but not purged. This function runs when the FileStorageDomain object is first created. These objects are created as needed and then stored in the sdCache for reuse until that cache is invalidated. This means that the garbage collector gets called periodically but unpredictably. This usually works fine but in the logs there was a connectStorageServer call between the deleteImage and purgeImage calls. The connectStorageServer verb invalidates the sdCache when called. Therefore the following happened: 1. deleteImage -> rename directory 2. connectStorageServer -> invalidate sdCache 3. purgeImage -> 1. Needs to produce the FileStorageDomain object 2. Not in sdCache so create new object 3. imageGarbageCollector called during FileStorageDomain __init__() 4. renamed directory is cleaned up by garbage collector 5. purgeImage tries to delete the directory and finds it not present anymore The way to fix this is to either run the garbage collector at a different time (or not at all), or ignore -ENOENT in the purgeImage function.
*** Bug 1431517 has been marked as a duplicate of this bug. ***
Verified and works fine with build vdsm-4.19.35-1.el7ev.x86_64. Created 29 vms and deleted them. In the events tab i do not see any failure related to not able to remove vm images + no Tracebacks which is reported in the Bug description.
Hi Adam, While deleting the vms i do see a Traceback as below. Any idea what this is ? Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 207, in __call__ self._func() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 510, in __call__ self._send_metrics() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 526, in _send_metrics vm_sample.interval) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 47, in produce balloon(vm, stats, last_sample) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 169, in balloon balloon_cur = sample['balloon.current'] KeyError: 'balloon.current' Thanks kasturi
This looks like a race between the virt periodic sampling code and VM delete. Not related to this bug at all but should probably be filed as a new virt bug.