1270220 – SPM is not tolerant for very slow NFS file deletes

Bug 1270220 - SPM is not tolerant for very slow NFS file deletes

Summary: SPM is not tolerant for very slow NFS file deletes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.0.0-alpha
Target Release:	4.0.0
Assignee:	Liron Aravot
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1266579 1367721 (view as bug list)
Depends On:	1273122 1286963
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-09 10:09 UTC by Tim Speetjens
Modified:	2019-09-12 09:06 UTC (History)
CC List:	24 users (show)
Fixed In Version:	ovirt 4.0.0 alpha1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 20:18:15 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	tspeetje: needinfo- tspeetje: needinfo-

Attachments	(Terms of Use)
logs-28.4 (1.82 MB, application/x-gzip) 2016-04-28 09:57 UTC, Elad	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1331335	unspecified	CLOSED	[engine-backend] An attempt to import an image back to the data domain while the original one has "_remove_me" in its ID...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2016:1671	normal	SHIPPED_LIVE	VDSM 4.0 GA bug fix and enhancement update	2016-09-02 21:32:03 UTC
oVirt gerrit	48477	master	MERGED	storage: perform image deletion in task	2015-12-28 16:54:49 UTC

Internal Links: 1331335

Description Tim Speetjens 2015-10-09 10:09:53 UTC

Description of problem:
When under load, NFS filers may require a very large time to delete large files. When this happens, the SPM status is set to problematic, and messages are seen saying 'Data Center is being initialized'

Version-Release number of selected component (if applicable):
vdsm-4.16.20-1.el6ev.x86_64


How reproducible:
Difficult, requires one or more file deletes to take up more than the timeout (default 1 minute).

Steps to Reproduce:
If the delete takes very long:
1. Delete a large image (part of a template or a VM)
2.
3.

Actual results:
ioprocess reports a timeout in vdsm.log, and the SPM state is set to problematic
Data Center is being initialized' messages (log_type_name SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_SEARCHING_NEW_SPM)

Sometimes the SPM continues to operate on the same host.

If this happens to be in a task managed by task manager, the datacenter does not reach initialized state for a longer time.

Expected results:
The SPM should be more tolerant on slow deletes, and continue its operations without issues.

Additional info:
When this happens, manual cleanup is required: ex. Locked disks, locked disks,  disks in illegal state

It looks like the rhevm-shell and the ovirtsdk behave differently, when deleting a VM or a template.

Comment 1 Tim Speetjens 2015-10-09 10:22:05 UTC

The following is seen in vdsm.log

14:39:18,242::task::866::Storage.TaskManager.Task::(_setError) Task=`<id>`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1545, in deleteImage
    pool.deleteImage(dom, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1869, in deleteImage
    domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/fileSD.py", line 384, in deleteImage
    self.oop.os.remove(volPath)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove
    self._iop.unlink(path)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 455, in unlink
    return self._sendCommand("unlink", {"path": path}, self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

Which then later is propagated to the manager, after the task is aborted
Thread-858705::ERROR::2015-10-07 14:39:18,304::dispatcher::79::Storage.Dispatcher::(wrapper) Connection timed out
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 103, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1179, in prepare
    raise self.error
Timeout: Connection timed out
Thread-858705::DEBUG::2015-10-07 14:39:18,307::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

Also seen with glob:
Thread-847559::ERROR::2015-10-07 10:31:00,791::task::866::Storage.TaskManager.Task::(_setError) Task=`id`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1515, in deleteImage
    allVols = dom.getAllVolumes()
  File "/usr/share/vdsm/storage/fileSD.py", line 425, in getAllVolumes
    volMetaPaths = self.oop.glob.glob(volMetaPattern)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 120, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 500, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

which then gets propagated as well, after task is aborted
Thread-847559::ERROR::2015-10-07 10:31:00,800::dispatcher::79::Storage.Dispatcher::(wrapper) Connection timed out
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 103, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1179, in prepare
    raise self.error
Timeout: Connection timed out
Thread-847559::DEBUG::2015-10-07 10:31:00,801::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

Comment 3 Allon Mureinik 2015-10-11 09:17:26 UTC

Nir, can you take a look and see if there's anything intelligent we can do here?

Comment 4 Nir Soffer 2015-10-11 13:20:42 UTC

Tim, can you check if using rfh oop implementation works better for this case?

ioprocess should be better in any way, but it is new and less tested than rfh.

To switch implementation edit /etc/vdsm.conf and set:

[irs]
oop_impl = rfh

Restart vdsm to apply the change.

Comment 5 Nir Soffer 2015-10-11 13:40:09 UTC

(In reply to Tim Speetjens from comment #0)
> Description of problem:
> When under load, 

Can you be more specific about "under load"? how can we reproduce this
load?

> NFS filers may require a very large time 

What is very large time? How much time it takes to delete files in this
case?

> to delete large
> files. When this happens, the SPM status is set to problematic, and messages
> are seen saying 'Data Center is being initialized'

> Expected results:
> The SPM should be more tolerant on slow deletes, and continue its operations
> without issues.

We can increase the timeouts calling ioprocess, but this will lead to tasks
that are blocked for long time, and we have a limit on the number of 
concurrent tasks (10), and number of queued tasks (500). Increasing the
timeouts would probably require increasing the other limits as well.

> Additional info:
> When this happens, manual cleanup is required: ex. Locked disks, locked
> disks,  disks in illegal state

This looks like an engine issue. engine should be prepared for handling 
failed deletes, retrying the operation or allowing the user to retry.
Can you open a separate engine bug for this?

Comment 6 Nir Soffer 2015-10-11 14:09:56 UTC

If you want to try larger timeouts:

[irs]
process_pool_timeout = 60
process_pool_max_slots_per_domain = 10
process_pool_max_queued_slots_per_domain = 10
thread_pool_size = 10
max_tasks = 500

Start with increasing process_pool_timeout (e.g. 120), and lets see if this
change is good enough.

You may need to increase thread_pool_size, which may need an increase of 
process_pool_max_slots_per_domain.

Comment 7 Nir Soffer 2015-10-11 15:39:16 UTC

Checking the relevant code reveal that deleteImage is a synchronous 
operation since pre 2.3 (a comment mention that this should be fixed in
2.3).

When using "wipe after delete" (only relevant to block storage), we 
perform the validation in the request thread, and the actual delete in
the task manager.

In file storage and block storage when not using "wipe after delete", the 
entire operation is performed in the request thread, and the "task" 
scheduled is only a fake task for error reporting.

This means that tuning task manager configuration (e.g. thread_pool_size)
will have no effect on this issue.

Also, since there is no limit on the number of request threads, the 
operation may fail since there are no available oop handlers. This
problem is worse with jsonrpc since it may create unlimited number
of request threads, while with xmlrpc, there is a thread per engine
connection.

Changing this design is too risky for 3.5, and is too late for 3.6.

The new SDM code should avoid this issue by running delete operations
in a thread pool, and by the new garbage collection mechanism.

I think we should target this to new SDM feature.

For 3.5, we can consider tuning of the timeouts, although I don't think we
can ship extreme timeouts that work with very slow nfs server. Handling
such case must be by site specific configuration until the issue on the 
server side is resolved by the vendor.

Adam, what do you think?

Comment 8 Tim Speetjens 2015-10-12 07:07:40 UTC

(In reply to Nir Soffer from comment #6)
> If you want to try larger timeouts:

> Start with increasing process_pool_timeout (e.g. 120), and lets see if this
> change is good enough.

From our investigations, the largest time it took for the unlink system call, was about 90 seconds. (Later this even grew larger) Based on this, we already set the process_pool_timeout to 180.

This wasn't sufficient, as the unlink system calls that took about 60 seconds, still would make the proces_pool_timeout to be reached with the "unlink" command in ioprocess.

Do we have any other bottlenecks, in this mechanism?

> Also, since there is no limit on the number of request threads, the 
> operation may fail since there are no available oop handlers. This
> problem is worse with jsonrpc since it may create unlimited number
> of request threads, while with xmlrpc, there is a thread per engine
> connection.

Also, tests were performed with, in addition to the 180 seconds timeout, a process_pool_max_slots_per_domain of 20. These values, still do not prevent the issue, however, it may have had a positive impact.

Can we measure this? For example: how long do the oop requests take, what's the state of the ioprocess thread pool?

Comment 9 Nir Soffer 2015-10-12 07:20:43 UTC

(In reply to Tim Speetjens from comment #8)
> (In reply to Nir Soffer from comment #6)
> > If you want to try larger timeouts:
> 
> > Start with increasing process_pool_timeout (e.g. 120), and lets see if this
> > change is good enough.
> 
> From our investigations, the largest time it took for the unlink system
> call, was about 90 seconds. (Later this even grew larger) Based on this, we
> already set the process_pool_timeout to 180.
> 
> This wasn't sufficient, as the unlink system calls that took about 60
> seconds, still would make the proces_pool_timeout to be reached with the
> "unlink" command in ioprocess.

You set process_pool_timeout to 180, but an unlink that took 60 caused a timeout?

> Do we have any other bottlenecks, in this mechanism?

Maybe there is some hardcoded timeout in ioprocess.

Piotr, can you check this?

> > Also, since there is no limit on the number of request threads, the 
> > operation may fail since there are no available oop handlers. This
> > problem is worse with jsonrpc since it may create unlimited number
> > of request threads, while with xmlrpc, there is a thread per engine
> > connection.
> 
> Also, tests were performed with, in addition to the 180 seconds timeout, a
> process_pool_max_slots_per_domain of 20. These values, still do not prevent
> the issue, however, it may have had a positive impact.
> 
> Can we measure this? For example: how long do the oop requests take, what's
> the state of the ioprocess thread pool?

Piotr, can you answer this?

Comment 10 Piotr Kliczewski 2015-10-12 08:50:57 UTC

I am not aware of any hardcoded timeouts.

Comment 14 Yaniv Kaul 2015-10-18 06:34:54 UTC

(In reply to Nir Soffer from comment #7)
> Checking the relevant code reveal that deleteImage is a synchronous 
> operation since pre 2.3 (a comment mention that this should be fixed in
> 2.3).
> 
> When using "wipe after delete" (only relevant to block storage), we 
> perform the validation in the request thread, and the actual delete in
> the task manager.

wipe after delete is not really an operation on NFS, AFAIR.
And delete itself should be a very quick operation.

Comment 20 Nir Soffer 2015-10-19 15:53:03 UTC

Tim, how many concurrent deleteImage tasks can you see in vdsm log, and how many tasks where running in vdsm when you had the timeouts?

If storage is very slow, all iprocess threads may be busy waiting for the
storage, and new io requests can time out waiting in ioprocess queue.

If this is the case, increasing "process_pool_max_slots_per_domain" can
help. I would try to increase this value to 20 or 30.

Comment 21 Nir Soffer 2015-10-19 16:39:08 UTC

Tim, we will track the ioprocess issues in bug 1273122.

Please check this bug and add more info if needed.

Comment 26 Elad 2015-10-21 14:38:10 UTC

We've simulated customer case using a 3.5 setup with vdsm-4.16.20-1.el6ev.x86_64 installed on host.
I used NFS storage server with Nir's code that simulates a slow file system.


I created VM from template as clone (1 image - 3 files/volumes under the image dir)

Modified the following:

engine:
VdsTimeout=900


vdsm.conf:
[irs]
process_pool_max_slots_per_domain = 20
process_pool_timeout = 100

Removed the VM with its disk.

deleteImage starts:

Thread-267::INFO::2015-10-21 16:56:53,608::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='f1260a63-d301-465d-a41e-a1986c6f702a', spUUID='00000002-0002-0002-0002-000000000152', imgUUID='5636bc19-c1a8-4218-a5c3-5e5fa0da9b9a', postZero='false', force='false')


Thread-267::DEBUG::2015-10-21 16:56:53,609::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.5636bc19-c1a8-4218-a5c3-5e5fa0da9b9a_f1260a63-d301-465d-a41e-a1986c6f702a`ReqID=`40d9f7d7-06c5-46ef-a4ed-5c33e75bcab2`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1513' at 'deleteImage'

Dir rename:

Thread-267::DEBUG::2015-10-21 16:56:53,651::fileSD::374::Storage.StorageDomain::(deleteImage) Renaming dir /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/5636bc19-c1a8-4218-a5c3-5e5fa0da9b9a to /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remor3eg5J

volume removal starts:

Thread-267::DEBUG::2015-10-21 16:56:53,654::fileSD::383::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remor3eg5J/28211c7b-108b-444a-a843-d571f73d3e67

meta file removal starts:

Thread-267::DEBUG::2015-10-21 16:58:23,786::fileSD::386::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remor3eg5J/28211c7b-108b-444a-a843-d571f73d3e67.meta

lease file removal starts:

Thread-267::DEBUG::2015-10-21 16:59:53,940::fileSD::389::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remor3eg5J/28211c7b-108b-444a-a843-d571f73d3e67.lease


Dir removal:

Thread-267::DEBUG::2015-10-21 17:01:24,027::fileSD::394::Storage.StorageDomain::(deleteImage) Removing directory: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remor3eg5J

deleteImage ends successfully:

Thread-267::INFO::2015-10-21 17:01:24,166::logUtils::47::dispatcher::(wrapper) Run and protect: deleteImage, Return response: None

Comment 27 Elad 2015-10-21 14:41:20 UTC

*Tested while the delay in the storage server for each remove operation is set to 90 seconds.

Comment 28 Elad 2015-10-22 12:21:08 UTC

Tested with the following conf:

Slowfs conf:
unlink = 45

engine conf:
VdsTimeout = 360

vdsm.conf:
[irs]
process_pool_max_slots_per_domain = 20
process_pool_timeout = 60

Created 50 VM with one 1G sparse disk on the slowfs domain.
Then removed the disks using python sdk (done one by one by the system)

All vms were removed successfully, no timeouts seen in vdsm log.

Example delete flow:

Thread-3097::INFO::2015-10-22 13:59:53,885::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='f1260a63-d301-465d-a41e-a1986c6f702a', spUUID='00000002-0002-0002-0002-000000000152', imgUUID='0
a787408-e5c3-43ec-a1ba-3c5bc9e6abb2', postZero='false', force='false')

Thread-3097::DEBUG::2015-10-22 13:59:53,910::fileSD::374::Storage.StorageDomain::(deleteImage) Renaming dir /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/0a7874
08-e5c3-43ec-a1ba-3c5bc9e6abb2 to /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remoyEOMTo

Thread-3097::DEBUG::2015-10-22 13:59:53,912::fileSD::383::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_rem
oyEOMTo/72b5ecbe-75d6-4f03-8961-cacc40d35fd9

meta file removal:

Thread-3097::DEBUG::2015-10-22 14:00:38,962::fileSD::386::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remoyEOMTo/72b5ecbe-75d6-4f03-8961-cacc40d35fd9.meta
Thread-3097::DEBUG::2015-10-22 14:01:24,030::fileSD::389::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remoyEOMTo/72b5ecbe-75d6-4f03-8961-cacc40d35fd9.lease

Thread-3097::DEBUG::2015-10-22 14:02:09,084::fileSD::394::Storage.StorageDomain::(deleteImage) Removing directory: /rhev/data-center/mnt/10.35.161.79:_root_test_slowfs_1/f1260a63-d301-465d-a41e-a1986c6f702a/images/_remoyEOMTo

Thread-3097::INFO::2015-10-22 14:02:09,163::logUtils::47::dispatcher::(wrapper) Run and protect: deleteImage, Return response: None

Comment 29 Nir Soffer 2015-10-25 08:07:25 UTC

Tim,

To get logs from ioprocess, you also must enable DEBUG logs in ioprocess
logger in vdsm.conf:

#/etc/vdsm/logger.conf

[logger_IOProcess]
level=DEBUG
handlers=logfile
qualname=IOProcess
propagate=0

Restart vdsm to apply the changes.

We are waiting now for new logs from customer, showing vdsm timeouts with
ioprocess timing info.

Thanks,
Nir

Comment 30 Yaniv Kaul 2015-10-25 08:23:24 UTC

Elad - please note that the customer's VMs are large, may or may not be sparse, and there are many. We believe that deleting large disks is significantly different than 1G ones. Please try with with much larger disks.

Comment 31 Elad 2015-10-25 10:53:55 UTC

Yaniv, the test we've performed simulated a long files deletion (each disk deletion took around 3 minutes) which supposed to be similar to a large disks deletion over non-delayed storage.
Also, I need some more information from GSS.

Tim, can you please provide the exact setup specification of the customer, I would like to know the following:
1) Were the VMs created as thin copy or cloned from the template? (i.e, are the VMs disks are independent/cloned or thin/snapshots?)
2) The amount of VMs and disks.

In terms of disks size, the allocation policy of the disks is not relevant since we're talking about file based storage. Nir, can you confirm?

Comment 32 Nir Soffer 2015-10-25 11:34:31 UTC

(In reply to Elad from comment #31)
> In terms of disks size, the allocation policy of the disks is not relevant
> since we're talking about file based storage. Nir, can you confirm?

Yes

Comment 33 Yaniv Kaul 2015-10-25 11:49:09 UTC

(In reply to Nir Soffer from comment #32)
> (In reply to Elad from comment #31)
> > In terms of disks size, the allocation policy of the disks is not relevant
> > since we're talking about file based storage. Nir, can you confirm?
> 
> Yes

I'm not sure if raw-sparse and qcow2 will behave the same on NFS. Files are files, but they allocation underneath and the way the FS treats them may be different.

Comment 34 Nir Soffer 2015-10-25 12:00:08 UTC

(In reply to Yaniv Kaul from comment #33)
> (In reply to Nir Soffer from comment #32)
> > (In reply to Elad from comment #31)
> > > In terms of disks size, the allocation policy of the disks is not relevant
> > > since we're talking about file based storage. Nir, can you confirm?
> > 
> > Yes
> 
> I'm not sure if raw-sparse and qcow2 will behave the same on NFS. Files are
> files, but they allocation underneath and the way the FS treats them may be
> different.

Yes, but we tested deletion of empty files, so the format does not matter.

To be more specific about how we tested - we used nfs server exporting slowfs filesystem[1]. We configure unlink to take at least 45 seconds, regardless of the size of file unlinked (slowfs sleeps 45 before unlinking
the file, blocking the caller).

Since each disk deleted has 3 volumes (image/volume, image/volume.meta, image/volume.lease), delete took at least 135 seconds, as we can see in 
the example flow in comment 28.

We can check the effect of deleting big files with different allocation
with a real filesystem, but I don't think this is the issue here. The
users are creating and deleting vm for each test, so it does not make 
sense to copy big images and delete them for each vm.

In the logs added in comment 23 and comment 24, we can see that the user 
is deleting 232 vms during 2 hours.

I suggest to wait for more info from Tim, about the file size and format
of the failed deletes.

[1] https://github.com/nirs/slowfs

Comment 35 Tal Nisan 2015-10-26 12:11:35 UTC

Currently there no immediate solution in sight and since the 3.5.6 build is tomorrow targeting to 3.5.7

Comment 37 Nir Soffer 2015-11-03 16:08:47 UTC

Currently, deleteImage on file based storage is renaming the image to _remove_me_XXXXX (XXXXXX is random name), then remove all the files in the image directory and finally remove empty image directory.

While vdsm is busy removing image files, engine is holding the spm lock and waiting until the operation is finished. No other spm operation can happen during this time.

This design is a weak point in the system, and plan to improve it in 3.6.
Hopefully we will be able to backport a fix to 3.5 or at least provide
a hotfix for users that have issues with the current design.

Our plan is to modify deleteImage on file based storage so it works more
like block storage:

1. rename the image directory to _remove_me_XXXX
2. schedule removal of the volumes in the task manager
3. return success

Engine will remove the image from the database at this point, and will be able to send the next SPM command immediately.

The actual removal of the image files will happen later in a task manager
thread.

We assume that most delete operation are fast, so in the worst case, some
threads in the task manager will be blocked for couple of minutes on slow
deletes, without effecting other tasks.

In the worst case, all tasks will be blocked on slow deletes until they
time out. In this case, stale _remove_me_XXXXXX directories can be removed
manually. We have the same issue in block storage if the wipe after delete
fails. This issue will be solved in 4.0 which will use a different scheme
where engine control the process, and no stale files will be left on
storage.

Tim, we would like to get your input before we continue with this plan.

Comment 38 Nir Soffer 2015-11-03 16:28:38 UTC

Sahina, is this plan going to work with gluster storage domain? We assume
that rename is atomic (posix requirment) and fast.

Is rename on gluster storage fast and atomic?

Comment 39 Tim Speetjens 2015-11-03 16:38:21 UTC

I would prefer to have both rename and delete in the async task, managed by task manager, but I'm ok with the proposed solution.

Comment 40 Nir Soffer 2015-11-03 17:21:26 UTC

(In reply to Tim Speetjens from comment #39)
> I would prefer to have both rename and delete in the async task, managed by
> task manager, but I'm ok with the proposed solution.

Tim, this may be possible in the next version, but for 3.6/3.5, we want
minimal changes.

Comment 41 Tim Speetjens 2015-11-05 11:24:37 UTC

*** Bug 1266579 has been marked as a duplicate of this bug. ***

Comment 42 Sahina Bose 2015-11-05 15:04:47 UTC

(In reply to Nir Soffer from comment #38)
> Sahina, is this plan going to work with gluster storage domain? We assume
> that rename is atomic (posix requirment) and fast.
> 
> Is rename on gluster storage fast and atomic?

Nir, as per Raghavendra (rgowdappa) from gluster team :

1. rename of files (non-directories) is atomic. We hold a lock which contends with other renames on the same file. However, this lock doesn't prevent other directory operations like readdir, creating/deleting links to the same file etc while rename is being done.
2. Directory renames are non-atomic and there is no crash-consistency.
3. There is no roll-back/crash-consistency in rename. rename is a compound operation in gluster DHT layer. Though we acquire lock, if there is a crash of brick/client, rollback is not performed. So, future operations can "see" intermediate state of rename. (However for a replicate only volume, rename of directories is atomic)

Comment 43 Allon Mureinik 2015-11-12 12:36:18 UTC

AFAIK, the urgent customer issue was solved onsite and was caused due to a misconfiguration of an NFS Filer. 

The proposed patch touches a very delicate area of the code, and since there appears to be no real urgency, I'm delaying it to 4.0.
If this becomes urgent again, we can always reschedule.

Comment 45 Allon Mureinik 2016-01-13 13:52:40 UTC

Liron, the patch is merged for ovirt's master (the would-be 4.0).
Is there anything else we're waiting for, or can this be moved to MODIFIED?

Comment 46 Liron Aravot 2016-01-13 13:57:49 UTC

moved to MODIFIED

Comment 47 Elad 2016-04-26 12:20:00 UTC

Hi Nir, Liron,

Using slowfs [1], I simulated a slow files deletion (unlink = 30).
I removed 10 images with several volumes (snapshots) each (~5 each). 
Several deleteImage tasks were opened and executed simultaneity by the SPM :


[root@purple-vds1 images]# vdsClient -s 0 getAllTasks
0699484c-dd65-46a7-8a47-b47d180682ec :
         verb = deleteImage
         code = 0
         state = running
         tag = spm
         result = 
         message = running job 1 of 1
         id = 0699484c-dd65-46a7-8a47-b47d180682ec
248170d2-ee01-4b35-bf2c-914fe5730994 :
         verb = deleteImage
         code = 0
         state = running
         tag = spm
         result = 
         message = running job 1 of 1
         id = 248170d2-ee01-4b35-bf2c-914fe5730994
d352c6ca-a1ea-4a60-a3df-84c6f6e34b90 :
         verb = deleteImage
         code = 100
         state = recovered
         tag = spm
         result = 
         message = Connection timed out
         id = d352c6ca-a1ea-4a60-a3df-84c6f6e34b90
cf56ab7d-772c-4994-b463-a2068501885b :
         verb = deleteImage
         code = 0
         state = running
         tag = spm
         result = 
         message = running job 1 of 1
         id = cf56ab7d-772c-4994-b463-a2068501885b
3f1233fc-8db5-4c94-ad75-8edfed545f08 :
         verb = deleteImage
         code = 0
         state = finished
         tag = spm
         result = 
         message = 1 jobs completed successfully
         id = 3f1233fc-8db5-4c94-ad75-8edfed545f08


Thread-2133::INFO::2016-04-26 15:10:43,243::logUtils::52::dispatcher::(wrapper) Run and protect: getAllTasks, Return response: {'tasks': {'cf56ab7d-772c-4994-b463-a2068501885b': {'code': 0, 'tag': u'spm', 'state':
 'running', 'verb': 'deleteImage', 'result': '', 'message': 'running job 1 of 1', 'id': 'cf56ab7d-772c-4994-b463-a2068501885b'}, 'd352c6ca-a1ea-4a60-a3df-84c6f6e34b90': {'code': 100, 'tag': u'spm', 'state': 'recov
ered', 'verb': 'deleteImage', 'result': '', 'message': u'Connection timed out', 'id': 'd352c6ca-a1ea-4a60-a3df-84c6f6e34b90'}, '248170d2-ee01-4b35-bf2c-914fe5730994': {'code': 0, 'tag': u'spm', 'state': 'running',
 'verb': 'deleteImage', 'result': '', 'message': 'running job 1 of 1', 'id': '248170d2-ee01-4b35-bf2c-914fe5730994'}, '0699484c-dd65-46a7-8a47-b47d180682ec': {'code': 0, 'tag': u'spm', 'state': 'running', 'verb': 
'deleteImage', 'result': '', 'message': 'running job 1 of 1', 'id': '0699484c-dd65-46a7-8a47-b47d180682ec'}, '3f1233fc-8db5-4c94-ad75-8edfed545f08': {'code': 0, 'tag': u'spm', 'state': 'finished', 'verb': 'deleteI
mage', 'result': '', 'message': '1 jobs completed successfully', 'id': '3f1233fc-8db5-4c94-ad75-8edfed545f08'}}}



[1] https://github.com/nirs/slowfs/blob/master/README.md



Is this the desired behaviour? Can I verify?

Comment 48 Liron Aravot 2016-04-26 15:51:55 UTC

Elad,
the patch in this bug reduces the execution time of the deleteImage() verb without harming its functionality.
to verify it properly, i'd suggest to make the deletion of files very slow (either by manipulating the vdsm code or by using slowfs) and attempt to reuse the disk id (for example - by importing it again) before the deletion is ended.

that way you could verify that no regression was introduced and as the syncronous part of the deletion ends the disk is considered to be deleted and its id is free to be used.

thanks,
Liron.

Comment 49 Elad 2016-04-28 09:57:53 UTC

Created attachment 1151791 [details]
logs-28.4

Liron, Tired your suggestion in comment #48:
Created a NFS domain based on slowfs, created a VM with a disk resides on the slowfs domain attached, exported the VM to export domain, changed the deletion delay in the server to 10 sec (unlink = 10), removed the VM with the attached disk and immediately tried to import the VM with the disk (the same image ID).

This is what I got:

2016-04-28 12:44:01,823 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (default task-14) [338bff11] ERROR, GetImagesListVDSCommand( GetImagesListVDSCommandParameters:{runAsync='true', storagePoolId='7de10d80-b113-4f60-8f7f-e70f6476432b', ignoreFailoverLimit='false', sdUUID='fb97cea4-5bf2-48fa-9ceb-b8b2e109acd4'}), exception: For input string: "_remove_me_a9187951", log id: 6238e26f
2016-04-28 12:44:01,823 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (default task-14) [338bff11] Exception: java.lang.NumberFormatException: For input string: "_remove_me_a9187951"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) [rt.jar:1.8.0_71]
        at java.lang.Long.parseLong(Long.java:589) [rt.jar:1.8.0_71]
        at java.lang.Long.valueOf(Long.java:776) [rt.jar:1.8.0_71]
        at java.lang.Long.decode(Long.java:928) [rt.jar:1.8.0_71]
        at java.util.UUID.fromString(UUID.java:198) [rt.jar:1.8.0_71]
        at org.ovirt.engine.core.compat.Guid.<init>(Guid.java:73) [compat.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.GetImagesListVDSCommand.executeIrsBrokerCommand(GetImagesListVDSCommand.java:23) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:159) [vdsbroker.jar:]


============================

Webadmdin:


Operation Canceled

Error while executing action: 

slow:
General command validation failure.

============================

This might be a separate bug in engine.
Full logs attached.

Comment 50 Liron Aravot 2016-04-28 10:03:34 UTC

Elad, that's a separate bug (3.6.7). lets open it.

thanks.

Comment 51 Elad 2016-04-28 10:27:20 UTC

Thanks Liron

Comment 52 Elad 2016-04-28 10:45:47 UTC

As you suggested offline, created a template and copied its disk to the slowfs domain. Then deleted the disk while 10 sec delay is configured for deletion and tried to copy the disk again to the slowfs domain. Image copy was initiated successfully while the old one was still under deletion process. Copy image finished successfully.

The same image exist twice on the domain (_remove_me_e83cb973-cc7b-4f03-94ce-8c4f21a9b683 and e83cb973-cc7b-4f03-94ce-8c4f21a9b683):

Every 2.0s: ls -l                                                                                                                                                 Thu Apr 28 13:38:17 2016

total 20
drwxr-xr-x. 2 root root 4096 Apr 28 13:01 9a813945-0af6-4228-bee3-8621ec4e351d
drwxr-xr-x. 2 root root 4096 Apr 28 13:01 b2df65ac-bfe4-4a56-a142-21f21ccb0023
drwxr-xr-x. 2 root root 4096 Apr 28 13:37 e83cb973-cc7b-4f03-94ce-8c4f21a9b683
drwxr-xr-x. 2 root root 4096 Apr 28 13:29 ef31097e-c317-4d58-b539-fcf82493e2d7
drwxr-xr-x. 2 root root 4096 Apr 28 13:38 _remove_me_e83cb973-cc7b-4f03-94ce-8c4f21a9b683


After deletion of _remove_me_e83cb973-cc7b-4f03-94ce-8c4f21a9b683 finished, only the new image exist (e83cb973-cc7b-4f03-94ce-8c4f21a9b683):

Every 2.0s: ls -l                                                                                                                                                 Thu Apr 28 13:41:52 2016

total 16
drwxr-xr-x. 2 root root 4096 Apr 28 13:01 9a813945-0af6-4228-bee3-8621ec4e351d
drwxr-xr-x. 2 root root 4096 Apr 28 13:01 b2df65ac-bfe4-4a56-a142-21f21ccb0023
drwxr-xr-x. 2 root root 4096 Apr 28 13:38 e83cb973-cc7b-4f03-94ce-8c4f21a9b683
drwxr-xr-x. 2 root root 4096 Apr 28 13:29 ef31097e-c317-4d58-b539-fcf82493e2d7



2016-04-28 13:37:59,313 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (org.ovirt.thread.pool-8-thread-7) [16d26c3f] START, CopyImageVDSCommand( CopyImageVDSCommandParameters:{runAsync='true', storagePoolId='7de10d80-b113-4f60-8f7f-e70f6476432b', ignoreFailoverLimit='false', storageDomainId='3843ad2f-447d-4911-bbc4-20a0ca0fb96f', imageGroupId='e83cb973-cc7b-4f03-94ce-8c4f21a9b683', imageId='716605d9-8cd6-4b57-8bd7-d7f0908714d6', dstImageGroupId='e83cb973-cc7b-4f03-94ce-8c4f21a9b683', vmId='00000000-0000-0000-0000-000000000000', dstImageId='716605d9-8cd6-4b57-8bd7-d7f0908714d6', imageDescription='', dstStorageDomainId='fb97cea4-5bf2-48fa-9ceb-b8b2e109acd4', copyVolumeType='SharedVol', volumeFormat='RAW', preallocate='Sparse', postZero='false', force='false'}), log id: 6ad0d97e



jsonrpc.Executor/3::INFO::2016-04-28 13:37:59,324::logUtils::49::dispatcher::(wrapper) Run and protect: copyImage(sdUUID=u'3843ad2f-447d-4911-bbc4-20a0ca0fb96f', spUUID=u'7de10d80-b113-4f60-8f7f-e70f6476432b', vmUUID='', srcImgUUID=u'e83cb973-cc7b-4f03-94ce-8c4f21a9b683', srcVolUUID=u'716605d9-8cd6-4b57-8bd7-d7f0908714d6', dstImgUUID=u'e83cb973-cc7b-4f03-94ce-8c4f21a9b683', dstVolUUID=u'716605d9-8cd6-4b57-8bd7-d7f0908714d6', description=u'', dstSdUUID=u'fb97cea4-5bf2-48fa-9ceb-b8b2e109acd4', volType=6, volFormat=5, preallocate=2, postZero=u'false', force=u'false')



2016-04-28 13:38:30,043 INFO  [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupCommand] (org.ovirt.thread.pool-8-thread-16) [16d26c3f] Ending command 'org.ovirt.engine.core.bl
l.storage.disk.image.CopyImageGroupCommand' successfully.



Used:
ovirt-engine-4.0.0-0.0.master.20160406161747.gita4ecba2.el7.centos.noarch
vdsm-4.17.999-724.gitb8cb30a.el7.centos.noarch

Comment 53 Tim Speetjens 2016-05-09 12:03:06 UTC

Don't see what the needinfo on me is for, clearing.

Comment 55 errata-xmlrpc 2016-08-23 20:18:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1671.html

Comment 56 Daniel Erez 2016-08-28 06:01:18 UTC

*** Bug 1367721 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

alitke
amureini
bazulay
ebenahar
fsimonce
gklein
laravot
lsurette
mkalinin
mst
mtessun
nsoffer
pablo.iranzo
pkliczew
ratamir
sabose
sbonazzo
srevivo
tnisan
tspeetje
ycui
yjog
ykaul
ylavi