1342550 – while deleting vms created from a template, vdsm command fails with error VDSM command failed: Could not remove all image's volumes

Bug 1342550 - while deleting vms created from a template, vdsm command fails with error VDSM command failed: Could not remove all image's volumes

Summary: while deleting vms created from a template, vdsm command fails with error VDS...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.17.29
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	ovirt-4.1.7
Target Release:	4.19.35
Assignee:	Adam Litke
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1431517 (view as bug list)
Depends On:
Blocks:	Gluster-HC-3 1438506
TreeView+	depends on / blocked

Reported:	2016-06-03 13:50 UTC by RamaKasturi
Modified:	2017-11-13 12:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Clones:	1438506 (view as bug list)
Environment:
Last Closed:	2017-11-13 12:26:25 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: planning_ack+ rule-engine: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	81939	0	'None'	MERGED	fileSD: Gracefully handle purgeImage delete race	2020-11-02 12:47:20 UTC
oVirt gerrit	82629	0	'None'	MERGED	fileSD: Gracefully handle purgeImage delete race	2020-11-02 12:47:20 UTC

Description RamaKasturi 2016-06-03 13:50:42 UTC

Description of problem:
Have HC setup and running all my vms on gluster storage. When i delete all the vms from the UI , vdsm command fails with the error "VDSM command failed: Could not remove all image's volumes" and seen a Traceback in the vdsm logs.

jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:15,177::fileSD::410::Storage.StorageDomain::(_deleteVolumeFile) Removing file: /rhev/data-center/mnt/glusterSD/10.70.34.35:_
data/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87
jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:16,389::fileSD::410::Storage.StorageDomain::(_deleteVolumeFile) Removing file: /rhev/data-center/mnt/glusterSD/10.70.34.35:_
data/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.meta
jsonrpc.Executor/6::WARNING::2016-06-03 18:50:16,396::fileSD::415::Storage.StorageDomain::(_deleteVolumeFile) File u'/rhev/data-center/mnt/glusterSD/10.70.34.35:_data/e
543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.meta' does not exist: [Errno 2] No such 
file or directory
jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:16,397::fileSD::410::Storage.StorageDomain::(_deleteVolumeFile) Removing file: /rhev/data-center/mnt/glusterSD/10.70.34.35:_
data/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.lease
jsonrpc.Executor/6::WARNING::2016-06-03 18:50:16,399::fileSD::415::Storage.StorageDomain::(_deleteVolumeFile) File u'/rhev/data-center/mnt/glusterSD/10.70.34.35:_data/e
543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a/53fe90be-4ce7-4b25-a2da-e4b8137eac87.lease' does not exist: [Errno 2] No such
 file or directory
jsonrpc.Executor/6::DEBUG::2016-06-03 18:50:16,400::fileSD::402::Storage.StorageDomain::(deleteImage) Removing directory: /rhev/data-center/mnt/glusterSD/10.70.34.35:_d
ata/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a
jsonrpc.Executor/6::ERROR::2016-06-03 18:50:16,402::fileSD::406::Storage.StorageDomain::(deleteImage) removed image dir: /rhev/data-center/mnt/glusterSD/10.70.34.35:_da
ta/e543b4a3-3a65-419d-b9cc-810c3f580fad/images/_remove_me_0769365e-2aed-4a6e-909e-79587d82ab9a can't be removed
jsonrpc.Executor/6::ERROR::2016-06-03 18:50:16,402::task::866::Storage.TaskManager.Task::(_setError) Task=`08e421a9-9cb0-45b1-ad1f-fb65ec9f4b9f`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1544, in deleteImage
    pool.deleteImage(dom, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1982, in deleteImage
    domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/fileSD.py", line 407, in deleteImage
    raise se.ImageDeleteError("%s %s" % (imgUUID, str(e)))
ImageDeleteError: Could not remove all image's volumes: (u'0769365e-2aed-4a6e-909e-79587d82ab9a [Errno 2] No such file or directory',)


Do not see any functional impact.

Version-Release number of selected component (if applicable):
vdsm-4.17.29-0.1.el7ev.noarch

How reproducible:
Hit it once

Steps to Reproduce:
1. Install HC setup
2. Create 9 vms from Template.
3. Delete all the vms created.

Actual results:
All the vms gets deleted leaving an event message " Could not remove all image's volumes" and Traceback in the vdsm logs.

Expected results:
VDSM should not report any errors/failures/Tracebacks.

Additional info:

Comment 1 RamaKasturi 2016-06-03 14:07:27 UTC

sosreports from all nodes and engine can be found in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1342550/

Comment 2 Sahina Bose 2016-06-13 11:22:47 UTC

Kasturi, is this issue consistently reproducible?

Comment 3 RamaKasturi 2016-08-30 09:54:26 UTC

(In reply to Sahina Bose from comment #2)
> Kasturi, is this issue consistently reproducible?

Able to reproduce this consistently. I am seeing this issue with 4.0 too

Comment 5 Nir Soffer 2017-05-29 23:24:47 UTC

Kasturi, your logs are not from current 4.1 or 4.2 version. Can you reproduce this
with 4.1 or 4.2?

Also, we must have complete vdsm logs from the time this image was created until
it was deleted.

Comment 6 RamaKasturi 2017-05-30 13:54:43 UTC

Nir,

  Tried it with "Red Hat Virtualization Manager Version: 4.1.2.2-0.1.el7" and i still see the above issue mentioned. Following is what i see in my events "VDSM <host> command HSMGetAllTasksStatusesVDS failed: Could not remove all image's volumes".

I see following traceback in the vdsm log file:

2017-05-30 18:47:04,252+0530 ERROR (tasks/7) [storage.TaskManager.Task] (Task='413e1a8a-4efe-4614-a6e0-c8bdfb91c01f') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 333, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 2001, in purgeImage
    domain.purgeImage(sdUUID, imgUUID, volsByImg, discard)
  File "/usr/share/vdsm/storage/sd.py", line 673, in purgeImage
    self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard)
  File "/usr/share/vdsm/storage/fileSD.py", line 250, in purgeImage
    raise se.ImageDeleteError("%s %s" % (imgUUID, str(e)))
ImageDeleteError: Could not remove all image's volumes: (u'53c73285-2af9-44f3-9317-7c6aedcc450c [Errno 2] No such file or directory',)


I ran sosreport on all the nodes and Hosted Engine vm and copied in the location below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1342550/

Hope this helps !!!

Comment 7 Nir Soffer 2017-05-30 15:03:52 UTC

Thanks Kasturi!

Can you extract the relevant vdsm logs showing the timeframe from creating the
images until they are deleted (and show this traceback) and attach them to the bug?

Comment 8 RamaKasturi 2017-06-27 07:36:46 UTC

Hi Nir,

   I see that vm creation is initated @ Jun 21, 2017 4:25:45 PM VM applinuxvm1 creation was initiated by admin@internal-authz from engine.log. 

    Vdsm log starts from here in log file vdsm.log.71.xz (where the image is created and started) on 2017-6-21

2017-06-21 16:25:45,480+0530 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:533)

vdsm log(starts here) when the image was removed :

2017-06-27 11:22:18,305+0530 INFO  (jsonrpc/0) [dispatcher] Run and protect: deleteImage(sdUUID=u'11df6265-6d83-44be-a72b-1fd033074588', spUUID=u'00000001-0001-0001-0001-000
000000311', imgUUID=u'0f1adbe9-9402-43a9-bc84-265d04b08cb6', postZero=u'false', force=u'false', discard=False) (logUtils:51)
2017-06-27 11:22:18,313+0530 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:49418 (protocoldetector:72)
2017-06-27 11:22:18,314+0530 WARN  (tasks/1) [storage.StorageDomainManifest] File u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_data/11df6265-6d83-44be-a72b-1fd033074588/im
ages/_remove_me_96670856-6508-48d2-ac1e-ba390dbe0a13/5886ed56-a989-48b6-8a3a-33ac1cb9b71d.lease' does not exist: [Errno 2] No such file or directory (fileSD:258)
2017-06-27 11:22:18,318+0530 ERROR (tasks/1) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/glusterSD/10.70.36.78:_data/11df6265-6d83-44be-a72b-1fd
033074588/images/_remove_me_96670856-6508-48d2-ac1e-ba390dbe0a13 can't be removed (fileSD:249)
2017-06-27 11:22:18,319+0530 ERROR (tasks/1) [storage.TaskManager.Task] (Task='942c1306-6f7c-4205-89e3-5db04aa8dd1e') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 333, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 2001, in purgeImage
    domain.purgeImage(sdUUID, imgUUID, volsByImg, discard)
  File "/usr/share/vdsm/storage/sd.py", line 673, in purgeImage
    self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard)
  File "/usr/share/vdsm/storage/fileSD.py", line 250, in purgeImage
    raise se.ImageDeleteError("%s %s" % (imgUUID, str(e)))
ImageDeleteError: Could not remove all image's volumes: (u'96670856-6508-48d2-ac1e-ba390dbe0a13 [Errno 2] No such file or directory',)

and image was successfully removed at 11:22:32 on 2017-06-27
============================================================
2017-06-27 11:22:32,993+0530 INFO  (jsonrpc/6) [storage.VolumeManifest] Info request: sdUUID=b8be7bfb-a360-482f-b5e1-0a8215c663b4 imgUUID=44fb4952-b9b6-4c4f-83b7-57e5d0a884b
9 volUUID = ead0d239-c88c-4eab-8cf5-ed2163ad5606  (volume:238)

All the vdsm logs in between these days are copied to the link below:
=======================================================================

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1342550/

I tried to reproduce it today to have a minimized log, but i was not able to. So attaching the logs from the time of vm creation till deleting the vm.

Comment 9 Adam Litke 2017-09-18 17:51:39 UTC

Ok I think I understand what is happening in this bug.

Deleting disks involves calling two verbs, deleteImage and purgeImage.  The deleteImage verb renames the image directory by prepending sd.REMOVED_IMAGE_PREFIX to the directory name.  Next, purgeImage tries to remove the files individually from that renamed directory.  

We also have an imageGarbageCollector function which removes directories which have been deleted but not purged.  This function runs when the FileStorageDomain object is first created.  These objects are created as needed and then stored in the sdCache for reuse until that cache is invalidated.  This means that the garbage collector gets called periodically but unpredictably.

This usually works fine but in the logs there was a connectStorageServer call between the deleteImage and purgeImage calls.  The connectStorageServer verb invalidates the sdCache when called.  Therefore the following happened:

1. deleteImage -> rename directory
2. connectStorageServer -> invalidate sdCache
3. purgeImage ->
   1. Needs to produce the FileStorageDomain object
   2. Not in sdCache so create new object
   3. imageGarbageCollector called during FileStorageDomain __init__()
   4. renamed directory is cleaned up by garbage collector
   5. purgeImage tries to delete the directory and finds it not present anymore


The way to fix this is to either run the garbage collector at a different time (or not at all), or ignore -ENOENT in the purgeImage function.

Comment 10 Denis Chaplygin 2017-09-19 12:47:07 UTC

*** Bug 1431517 has been marked as a duplicate of this bug. ***

Comment 11 RamaKasturi 2017-10-31 13:04:57 UTC

Verified and works fine with build vdsm-4.19.35-1.el7ev.x86_64.

Created 29 vms and deleted them. In the events tab i do not see any failure related to not able to remove vm images + no Tracebacks which is reported in the Bug description.

Comment 12 RamaKasturi 2017-10-31 13:06:09 UTC

Hi Adam,

    While deleting the vms i do see a Traceback as below. Any idea what this is ?

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 207, in __call__
    self._func()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 510, in __call__
    self._send_metrics()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 526, in _send_metrics
    vm_sample.interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 47, in produce
    balloon(vm, stats, last_sample)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 169, in balloon
    balloon_cur = sample['balloon.current']
KeyError: 'balloon.current'

Thanks
kasturi

Comment 13 Adam Litke 2017-10-31 13:50:17 UTC

This looks like a race between the virt periodic sampling code and VM delete.  Not related to this bug at all but should probably be filed as a new virt bug.

Comment 14 Adam Litke 2017-10-31 13:50:45 UTC

This looks like a race between the virt periodic sampling code and VM delete.  Not related to this bug at all but should probably be filed as a new virt bug.

Note You need to log in before you can comment on or make changes to this bug.