Bug 505548

Summary: 1921270 - gfs2 filesystem won't free up space when files are deleted
Product: Red Hat Enterprise Linux 5 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: adas, adrew, cmarcant, cward, czhang, dejohnso, dmair, dursone, dzickus, jcapel, jongomersall, liko, rpeterso, rwheeler, swhiteho, syeghiay, tao, ymansuri
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:40:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514700    
Attachments:
Description Flags
First cut at a patch.
none
cleaned up patch none

Description Issue Tracker 2009-06-12 11:51:57 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2009-06-12 11:51:59 UTC
Event posted on 05-27-2009 04:44pm EDT by bboley

We have a gfs2 filesystem, and when files are deleted in it, the space used by the file isn't freed, so the filesystem just gets fuller and fuller as time goes on. The space usage reported by df and "gfs2_tool df" is the same. I've used lsof to see if processes are holding open deleted files, but they aren't.

root@pocrac1> du -sm /proj/archive
30192 /proj/archive
root@pocrac1> df -m /proj/archive
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/mapper/gfsvg-gfslv
255653 176650 79003 70% /proj/archive
root@pocrac1> gfs2_tool df
/proj/archive:
SB lock proto = "lock_dlm"
SB lock table = "poccluster:firstgfs"
SB ondisk format = 1801
SB multihost format = 1900
Block size = 4096
Journals = 2
Resource Groups = 999
Mounted lock proto = "lock_dlm"
Mounted lock table = "poccluster:firstgfs"
Mounted host data = "jid=0:id=196609:first=0"
Journal number = 0
Lock module flags = 0
Local flocks = FALSE
Local caching = FALSE

Type Total Used Free use%
------------------------------------------------------------------------
data 65447072 45351691 20095381 69%
inodes 20095812 431 20095381 0%
This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 301103

Comment 2 Issue Tracker 2009-06-12 11:52:01 UTC
Event posted on 06-11-2009 03:15pm EDT by dejohnso

Talked to Bob in development.  Can you give me the exact steps to reproduce
this and then I will BZ it.

This is what Bob says about the removing of files in gfs2.  NOTE: A
reclaim is never needed.

Here is what's supposed to happen: If one node has a file open and it
gets deleted, its blocks should get a status of "unlinked metadata". 
The unlinked metadata should be automatically reused by the gfs2 kernel
code, unless the file is open on a node, etc.

So the unlinked blocks should get reused automatically; no need for a
reclaim like there was on gfs

<Deb> bob: but if they are unlinked should not the df show that free space
(if the file is not open on another node)
<bob> It depends.  It may take the kernel code a while to clean it up and
reuse it.
<Deb> so df should be ignored?  What method can be used then to tell if
the file system is full?
<bob> They should just be able to use the system df command

<bob> When the file is closed, the metadata should be freed, after the
journaling happens.  If it's not, that's a bug.



Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 301103

Comment 3 Issue Tracker 2009-06-12 11:52:03 UTC
Event posted on 06-11-2009 05:36pm EDT by cmarcant

What I did to reproduce:

- create a 2 node 64 bit RHEL 5.3 cluster
- create a 4.5G logical volume on top of a clustered VG
- create a GFS2 filesystem on this logical volume with two journals
- mount this GFS2 filesystem on both nodes and cd into the mount point
- on node 1 run "gfs2_tool df" and also "df" and note the (expected)
low usage
- on node 1 run "dd if=/dev/zero of=bigfile bs=1024 count=30000000" to
create a 3G file
- on node 2 run "ll" inside this mount point (I actually did it once
while the file was being created and then once when it was finished)
- on node 1 and/or node 2 run "gfs2_tool df" and "df" again and note
70% usage
- on node 1 run "rm bigfile"

From this point on, "gfs2_tool df" and regular "df" (run from either
node) continue to show 70% usage, even though "ll" on either node shows
the file is no longer present.

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by dejohnso  [Support Engineering Group]
 issue 301103

Comment 4 Debbie Johnson 2009-06-12 11:54:48 UTC
As you can see by one of the comments in the BZ.  Bob is aware this is being BZed.
I could not find the gfs2 component so I used filesystem.  If this is not correct, please correct it and let me know what I should be using.  Thanks,

Debbie

Comment 5 RHEL Program Management 2009-06-12 12:09:45 UTC
This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release. If you would like
this request to be reviewed for the next minor release, ask your
support representative to set the next rhel-x.y flag to "?".

Comment 6 Steve Whitehouse 2009-06-12 14:08:15 UTC
We are collecting a number of similar bugs. It might be that we are not freeing things up, but it might also be an issue with statfs. We are already looking into this and will update as soon as we know.

Comment 7 Issue Tracker 2009-06-12 14:54:56 UTC
Event posted on 06-12-2009 10:54am EDT by cmarcant

By the way, one other interesting piece of information I was able to
collect. I basically got a system into this situation as previously
described.  I then wrote a bash script to monitor "gfs2_tool df" output
once a minute and report back if/when the value ever changed (I can attach
the script if it's of interest to anyone).  The value stayed at 70% usage
for basically 9 hours, and then seemed to spontaneously free all the
previously unreclaimed space back up in the course of a minute as the
final run (after 9 hours) reported 7% usage.

Not sure if this is useful information or not, but it *does* appear that
this will clear itself up eventually.  Mind you, this was on a completely
empty gfs2 filesystem with no load what so ever, so it's also possible
that while under use this behavior might change.  I'm currently
re-running my test to see if I can get the same results again.


This event sent from IssueTracker by cmarcant 
 issue 301103

Comment 10 Chris Marcantonio 2009-06-30 21:03:33 UTC
One other interesting piece of info probably worth passing along in this BZ...  I originally started looking into this and tried to see if it could be attributed to the statfs_fast stuff in gfs2.  It didn't immediately seem to fit what I would expect to see, in that even on the node you perform the delete from doesn't see the space that should be freed locally (so it didn't seem to be an issue where the node's local cache or whatever just wasn't being written back to the cluster).  None the less, I had my customers try to turn on statfs_slow to see how things behaved.  We then hit the behavior described here:

https://bugzilla.redhat.com/show_bug.cgi?id=505171

This seems to be pretty easily reproducible, since I've had 2-3 customers run into the same thing, and we were able to reproduce here locally too.

So, I'm not sure that the above is particularly relevant to this bug, except that turning on statfs_slow to try a different angle introduces it's own problems and isn't really a usable direction at the moment.

Comment 12 Ben Marzinski 2009-07-08 23:48:45 UTC
It appears that what's happening is this:

in gfs2_delete_inode() gfs2 tries to drop the unlinked inode's iopen lock, and reacquire it EXCLUSIVE with a flag of LM_FLAG_TRY_1CB, in order to deallocate the file. It fails.  After this it waits for hours for the file to deallocated.

Comment 15 Jon Gomersall 2009-07-13 10:58:23 UTC
We have a 6 server cluster and have had the same issue with a 1TB GFS2 filesystem becoming full.

After a complete rm of all the files the df showed as the filesystem was still 100% full...

Also the inodes was approx 82% full each of the servers had the mount point unmounted. One of the server took approx 7 minutes to release the mount. After this the inodes on the filesystem had changes to approx 2% used...

Running a gfs2_fsck came up with the following messages. 

Ondisk and fsck bitmaps differ at block 137 (0x89) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)

Ondisk and fsck bitmaps differ at block 139 (0x89) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)

The blocks counted up during the fsck

This ran for approximately 80 hours. We had to stop this due to the project needing to use the filesystem...  

a recreate of the filesystem was need to allow work to proceed

Comment 16 Ben Marzinski 2009-07-14 07:19:34 UTC
The issue is that as long as one node has still has an inode object around for the
file, it can't drop it's iopen holder.  Until the iopen lock is dropped by the last node, the space will not be freed.

Unfortunately, the inode gets freed based on memory pressure, not diskspace pressure, so by creating and deleting large files, you can have an empty filesystem that is completely full.

Right now, I'm trying to find some way for GFS2 to be able to communicate when a file is deleted on one node to the vfs layer on another node.  So that it can flush the inode from its cache. The other possible solution would be for the iopen holder to only exist while the file is open, instead of while the inode exists in cache.  However this is a much bigger locking change.

Comment 17 Steve Whitehouse 2009-07-14 12:29:09 UTC
Changing the iopen lock seems likely to be prone to all kinds of issues. We already use the callback mechanism on iopen locks to indicate when link count has hit zero, so we probably just need to ensure that we also invalidate that entry in the dcache if there are no users left, since I suspect that its the dcache thats keeping the entry from going away. In other words we need to expand the current flag setting on a callback to something specific to iopen locks.

I guess we might want to add a gl_ops operation for callbacks in that case. Need to check that we can grab dcache/icache locks from that context without any issues, otherwise we'd have to do it from run_queue.

You should be able to use the tracing in upstream to track the demote requests if that is useful to you.

Does that sound reasonable?

Comment 18 Ben Marzinski 2009-07-16 21:24:14 UTC
I did some work along this route, and it should work... but there's a catch. If we free the last inode reference in glock_work_func(), then we end up calling gfs2_delete_inode() from within glock_work_func().  This needs to acquire two exclusive locks, which themselves will require getting callbacks and running glock_work_func().  This doesn't work. Even if we could block in glock_work_func() and allow other glock_workqueue processes to handle these callbacks, what happens if the other glock_workqueue process is also waiting in gfs2_delete_inode()? No matter how many glock_workqueue processes we have, they could all be stuck in gfs2_delete_inode, and so none of them could handle the callbacks to acquire the locks. We also must finish freeing up the inode by the time we return from gfs2_delete_inode(), so we can't easily push out the work of deleting the data until later.

It seems like the most reasonable solution is to not free up the dcache in the
workqueue, but instead shunt it off to a different thread that just does this.
I'm not very thrilled with that solution, so if anyone has another way that this could work, I'd love to hear it.  But I'm starting work on this idea now.

Comment 19 Steve Whitehouse 2009-07-17 15:31:53 UTC
Yes, I think a different thread will be required... maybe we can use an existing abstraction though? Perhaps another use for slow-work? Need to ensure that there will be no interactions with the recovery code though. Alternatively quotad might be useable as we already use this for dealing with pending truncates for similar reasons.

Comment 20 Ben Marzinski 2009-07-17 22:26:06 UTC
Created attachment 354232 [details]
First cut at a patch.

This patch is pretty ugly, but so far, it has seems to work correctly.  I'm going to keep testing, and cleaning it up.

Comment 21 Steve Whitehouse 2009-07-18 22:01:34 UTC
The patch doesn't look too bad... could you move the trigger for the delete workqueue out from run_queue and put it in the callback code perhaps? Maybe add a callback entry to the glops structure so that we can do type specific call backs?

That way we'd move that out of the common code and into iopen only code.

Comment 22 Yunus 2009-07-20 07:44:04 UTC
When would the patch be ready for this issue ...as I have a prod gfs2 cluster with same issue  which run on RHEL5 Update3, Kernel  2.6.18-128.1.14.el5

Comment 26 RHEL Program Management 2009-07-21 12:43:43 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 Ben Marzinski 2009-07-22 22:04:07 UTC
Created attachment 354793 [details]
cleaned up patch

This patch is lite the previous one, but it removes the debug printouts, moves some of the logic around, and fixes a bug where unmounting a filesystem while there was still work on the gfs2_delete_workqueue caused a withdraw.

Comment 29 Ben Marzinski 2009-07-22 22:57:19 UTC
POSTed.

Comment 34 Don Zickus 2009-07-28 20:13:31 UTC
in kernel-2.6.18-160.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 36 Caspar Zhang 2009-08-03 07:10:49 UTC
verified that the patch to this bug is included in kernel-2.6.18-160.el5 with
patch #24367

Comment 40 errata-xmlrpc 2009-09-02 08:40:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 44 David 2010-12-10 23:12:07 UTC
Has this bug been reintroduced? I am seeing this exact issue on 2.6.18-194.26.1.el5 kernel

Comment 45 Adam Drew 2011-01-05 15:31:12 UTC
(In reply to comment #44)
> Has this bug been reintroduced? I am seeing this exact issue on
> 2.6.18-194.26.1.el5 kernel

David, a very similar but new issues has recently been found:

https://bugzilla.redhat.com/show_bug.cgi?id=666080