Bug 298931

Summary: on-disk unlinked inode meta-data leak
Product: Red Hat Enterprise Linux 5 Reporter: Josef Bacik <jbacik>
Component: gfs-kmodAssignee: Abhijith Das <adas>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: djansa, edamato, johnson.eric, juanino, kanderso, rpeterso, tao, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-09-17 17:06:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 218576    
Bug Blocks:    
Attachments:
Description Flags
Have scand reclaim metadata from X% of rgrps at a time
none
New and improved
none
Simple and improved - tunable is now absolute number none

Comment 3 RHEL Program Management 2008-06-04 22:48:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Abhijith Das 2008-08-25 17:15:11 UTC
Created attachment 314932 [details]
Have scand reclaim metadata from X% of rgrps at a time

This patch is not the same idea as Wendy suggested in an earlier comment. It tries to free metadata from a subset of total rgrps limited by a tunable.

This might not be as efficient as keeping an extra list of rgrps that were deleted from; I haven't run any performance numbers.

The tunable rgrp_free_perc is the percentage of the total rgrps (sdp->sd_rgcount). With the default value of 5% (chosen arbitrarily), a maximum of 5% of the total rgrps' unused metadata will be freed. Let me know what would be preferable: a) percent of total rgrps (as is) or b) absolute value for max rgrps to free metadata from.

Bob/Steve, please let me know your thoughts on this. If you think this won't work or will be inefficient, I'll fall back to the list idea.

Comment 5 Abhijith Das 2008-09-10 21:52:14 UTC
Created attachment 316363 [details]
New and improved

Added a flag RD_FL_META2FREE to struct rgrpd that is set when there is freeable metadata in the rgrp and cleared when we reclaim from it.

Comment 6 Abhijith Das 2008-09-12 21:19:24 UTC
Created attachment 316624 [details]
Simple and improved - tunable is now absolute number

The percentage of total rgrps tunable is now an absolute value. Specify the max number of rgrps to free metadata from in each gfs_inoded cycle.

Comment 7 Abhijith Das 2008-09-12 21:48:06 UTC
pushed above patch to RHEL5, STABLE2 and master.

Comment 8 Abhijith Das 2008-09-16 14:23:17 UTC
The patch above locks the entire filesystem each time it needs to free metadata from an rgrp. I ran a simple test case to see what the impact is on performance.

The test is run and timed simultaneously on all cluster nodes. Each node operates in a separate directory. It creates 10 subdirectories and fills each subdirectory with 10000 1MB files and removes a subdirectory as soon as the 10000th file is written into it.

Let me know your thoughts about this... whether I should back out the patch or not.

GFS without the reclaim patch
-----------------------------

real    113m4.499s
user    1m22.279s
sys     19m54.774s
[root@camel ~]#

real    126m18.504s
user    1m22.707s
sys     19m52.207s
[root@merit ~]#

real    127m0.553s
user    1m22.073s
sys     19m48.148s
[root@winston ~]#

real    110m1.599s
user    0m54.464s
sys     20m33.615s
[root@salem ~]#

GFS with the reclaim patch (tunable default max rgrps to free from = 5)
-----------------------------------------------------------------------

real    123m20.349s
user    1m21.887s
sys     20m1.453s
[root@camel ~]#

real    133m12.838s
user    1m22.173s
sys     19m44.736s
[root@merit ~]#

real    132m35.304s
user    1m22.271s
sys     20m5.701s
[root@winston ~]#

real    111m58.369s
user    0m54.438s
sys     20m45.796s
[root@salem ~]#

Comment 9 Abhijith Das 2008-09-16 20:45:43 UTC
I've got a few more numbers and they don't look pretty for this patch, unfortunately :(. I added a timer to time the beginning and end of gfs_reclaim_metadata(). This tells us the number of rgrps from which metadata was freed during that cycle and how long it took in seconds.
The test is the same as before, just that, to save time, I copied zero length files instead of 1MB. Cuts the test run time to about 10 mins.

camel:
real    10m37.203s
user    0m57.045s
sys     3m4.040s

Freed 1: time: 0.908551 s
Freed 2: time: 2.255598 s
Freed 2: time: 7.011848 s
Freed 2: time: 4.146901 s
Freed 2: time: 4.406757 s
Freed 2: time: 1.308091 s
Freed 2: time: 2.713996 s
Freed 2: time: 0.632563 s
Freed 2: time: 1.445605 s
Freed 2: time: 1.444344 s

merit:
real    9m5.089s
user    0m57.568s
sys     3m5.484s

Freed 1: time: 0.420229 s
Freed 2: time: 3.806471 s
Freed 2: time: 4.327761 s
Freed 2: time: 1.462549 s
Freed 2: time: 5.276235 s
Freed 2: time: 2.105195 s
Freed 2: time: 0.948678 s
Freed 2: time: 0.816133 s
Freed 1: time: 1.596697 s
Freed 2: time: 3.547658 s
 
winston:
real    9m13.812s
user    0m58.030s
sys     3m8.511s

Freed 1: time: 27.362559 s
Freed 1: time: 1.007744 s
Freed 2: time: 1.057754 s
Freed 2: time: 3.942767 s
Freed 2: time: 1.332333 s
Freed 2: time: 5.283363 s
Freed 2: time: 3.798999 s
Freed 1: time: 15.359068 s
Freed 1: time: 5.588834 s
Freed 2: time: 0.879379 s
Freed 2: time: 4.959445 s
Freed 2: time: 5.476087 s

salem:
real    9m4.354s
user    0m38.753s
sys     2m29.936s

Freed 1: time: 0.606883 s
Freed 2: time: 3.385412 s
Freed 2: time: 4.650354 s
Freed 3: time: 1.552037 s
Freed 2: time: 4.872653 s
Freed 3: time: 5.114137 s
Freed 1: time: 0.392321 s
Freed 2: time: 1.153888 s
Freed 2: time: 8.842174 s
Freed 2: time: 1.612880 s
Freed 1: time: 8.140755 s

Comment 10 Abhijith Das 2008-09-17 17:06:17 UTC
Automatic reclaim of unlinked metadata presents greater cost than benefit. The
reclaim operation locks the filesystem while it attempts to free the unused
metadata, which hurts filesystem performance and certainly doesn't scale well.

The sensible way to perform reclaim is to execute "gfs_tool reclaim
<mountpoint>" as a routine gfs administration activity when the filesystem is
relatively idle.

Comment 11 Jerry Uanino 2008-09-17 23:05:20 UTC
is this fixed in gfs2?

Comment 12 Abhijith Das 2008-09-18 12:53:21 UTC
GFS2 works differently when it comes to reclaiming unlinked inode metadata... and doesn't have this problem IMO. Steve Whitehouse can probably add more.