Bug 505548
Summary: | 1921270 - gfs2 filesystem won't free up space when files are deleted | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Issue Tracker <tao> | ||||||
Component: | kernel | Assignee: | Ben Marzinski <bmarzins> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.3 | CC: | adas, adrew, cmarcant, cward, czhang, dejohnso, dmair, dursone, dzickus, jcapel, jongomersall, liko, rpeterso, rwheeler, swhiteho, syeghiay, tao, ymansuri | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-09-02 08:40:01 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 514700 | ||||||||
Attachments: |
|
Description
Issue Tracker
2009-06-12 11:51:57 UTC
Event posted on 05-27-2009 04:44pm EDT by bboley We have a gfs2 filesystem, and when files are deleted in it, the space used by the file isn't freed, so the filesystem just gets fuller and fuller as time goes on. The space usage reported by df and "gfs2_tool df" is the same. I've used lsof to see if processes are holding open deleted files, but they aren't. root@pocrac1> du -sm /proj/archive 30192 /proj/archive root@pocrac1> df -m /proj/archive Filesystem 1M-blocks Used Available Use% Mounted on /dev/mapper/gfsvg-gfslv 255653 176650 79003 70% /proj/archive root@pocrac1> gfs2_tool df /proj/archive: SB lock proto = "lock_dlm" SB lock table = "poccluster:firstgfs" SB ondisk format = 1801 SB multihost format = 1900 Block size = 4096 Journals = 2 Resource Groups = 999 Mounted lock proto = "lock_dlm" Mounted lock table = "poccluster:firstgfs" Mounted host data = "jid=0:id=196609:first=0" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Type Total Used Free use% ------------------------------------------------------------------------ data 65447072 45351691 20095381 69% inodes 20095812 431 20095381 0% This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 301103 Event posted on 06-11-2009 03:15pm EDT by dejohnso Talked to Bob in development. Can you give me the exact steps to reproduce this and then I will BZ it. This is what Bob says about the removing of files in gfs2. NOTE: A reclaim is never needed. Here is what's supposed to happen: If one node has a file open and it gets deleted, its blocks should get a status of "unlinked metadata". The unlinked metadata should be automatically reused by the gfs2 kernel code, unless the file is open on a node, etc. So the unlinked blocks should get reused automatically; no need for a reclaim like there was on gfs <Deb> bob: but if they are unlinked should not the df show that free space (if the file is not open on another node) <bob> It depends. It may take the kernel code a while to clean it up and reuse it. <Deb> so df should be ignored? What method can be used then to tell if the file system is full? <bob> They should just be able to use the system df command <bob> When the file is closed, the metadata should be freed, after the journaling happens. If it's not, that's a bug. Internal Status set to 'Waiting on Support' This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 301103 Event posted on 06-11-2009 05:36pm EDT by cmarcant What I did to reproduce: - create a 2 node 64 bit RHEL 5.3 cluster - create a 4.5G logical volume on top of a clustered VG - create a GFS2 filesystem on this logical volume with two journals - mount this GFS2 filesystem on both nodes and cd into the mount point - on node 1 run "gfs2_tool df" and also "df" and note the (expected) low usage - on node 1 run "dd if=/dev/zero of=bigfile bs=1024 count=30000000" to create a 3G file - on node 2 run "ll" inside this mount point (I actually did it once while the file was being created and then once when it was finished) - on node 1 and/or node 2 run "gfs2_tool df" and "df" again and note 70% usage - on node 1 run "rm bigfile" From this point on, "gfs2_tool df" and regular "df" (run from either node) continue to show 70% usage, even though "ll" on either node shows the file is no longer present. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 301103 As you can see by one of the comments in the BZ. Bob is aware this is being BZed. I could not find the gfs2 component so I used filesystem. If this is not correct, please correct it and let me know what I should be using. Thanks, Debbie This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?". We are collecting a number of similar bugs. It might be that we are not freeing things up, but it might also be an issue with statfs. We are already looking into this and will update as soon as we know. Event posted on 06-12-2009 10:54am EDT by cmarcant By the way, one other interesting piece of information I was able to collect. I basically got a system into this situation as previously described. I then wrote a bash script to monitor "gfs2_tool df" output once a minute and report back if/when the value ever changed (I can attach the script if it's of interest to anyone). The value stayed at 70% usage for basically 9 hours, and then seemed to spontaneously free all the previously unreclaimed space back up in the course of a minute as the final run (after 9 hours) reported 7% usage. Not sure if this is useful information or not, but it *does* appear that this will clear itself up eventually. Mind you, this was on a completely empty gfs2 filesystem with no load what so ever, so it's also possible that while under use this behavior might change. I'm currently re-running my test to see if I can get the same results again. This event sent from IssueTracker by cmarcant issue 301103 One other interesting piece of info probably worth passing along in this BZ... I originally started looking into this and tried to see if it could be attributed to the statfs_fast stuff in gfs2. It didn't immediately seem to fit what I would expect to see, in that even on the node you perform the delete from doesn't see the space that should be freed locally (so it didn't seem to be an issue where the node's local cache or whatever just wasn't being written back to the cluster). None the less, I had my customers try to turn on statfs_slow to see how things behaved. We then hit the behavior described here: https://bugzilla.redhat.com/show_bug.cgi?id=505171 This seems to be pretty easily reproducible, since I've had 2-3 customers run into the same thing, and we were able to reproduce here locally too. So, I'm not sure that the above is particularly relevant to this bug, except that turning on statfs_slow to try a different angle introduces it's own problems and isn't really a usable direction at the moment. It appears that what's happening is this: in gfs2_delete_inode() gfs2 tries to drop the unlinked inode's iopen lock, and reacquire it EXCLUSIVE with a flag of LM_FLAG_TRY_1CB, in order to deallocate the file. It fails. After this it waits for hours for the file to deallocated. We have a 6 server cluster and have had the same issue with a 1TB GFS2 filesystem becoming full. After a complete rm of all the files the df showed as the filesystem was still 100% full... Also the inodes was approx 82% full each of the servers had the mount point unmounted. One of the server took approx 7 minutes to release the mount. After this the inodes on the filesystem had changes to approx 2% used... Running a gfs2_fsck came up with the following messages. Ondisk and fsck bitmaps differ at block 137 (0x89) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Ondisk and fsck bitmaps differ at block 139 (0x89) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) The blocks counted up during the fsck This ran for approximately 80 hours. We had to stop this due to the project needing to use the filesystem... a recreate of the filesystem was need to allow work to proceed The issue is that as long as one node has still has an inode object around for the file, it can't drop it's iopen holder. Until the iopen lock is dropped by the last node, the space will not be freed. Unfortunately, the inode gets freed based on memory pressure, not diskspace pressure, so by creating and deleting large files, you can have an empty filesystem that is completely full. Right now, I'm trying to find some way for GFS2 to be able to communicate when a file is deleted on one node to the vfs layer on another node. So that it can flush the inode from its cache. The other possible solution would be for the iopen holder to only exist while the file is open, instead of while the inode exists in cache. However this is a much bigger locking change. Changing the iopen lock seems likely to be prone to all kinds of issues. We already use the callback mechanism on iopen locks to indicate when link count has hit zero, so we probably just need to ensure that we also invalidate that entry in the dcache if there are no users left, since I suspect that its the dcache thats keeping the entry from going away. In other words we need to expand the current flag setting on a callback to something specific to iopen locks. I guess we might want to add a gl_ops operation for callbacks in that case. Need to check that we can grab dcache/icache locks from that context without any issues, otherwise we'd have to do it from run_queue. You should be able to use the tracing in upstream to track the demote requests if that is useful to you. Does that sound reasonable? I did some work along this route, and it should work... but there's a catch. If we free the last inode reference in glock_work_func(), then we end up calling gfs2_delete_inode() from within glock_work_func(). This needs to acquire two exclusive locks, which themselves will require getting callbacks and running glock_work_func(). This doesn't work. Even if we could block in glock_work_func() and allow other glock_workqueue processes to handle these callbacks, what happens if the other glock_workqueue process is also waiting in gfs2_delete_inode()? No matter how many glock_workqueue processes we have, they could all be stuck in gfs2_delete_inode, and so none of them could handle the callbacks to acquire the locks. We also must finish freeing up the inode by the time we return from gfs2_delete_inode(), so we can't easily push out the work of deleting the data until later. It seems like the most reasonable solution is to not free up the dcache in the workqueue, but instead shunt it off to a different thread that just does this. I'm not very thrilled with that solution, so if anyone has another way that this could work, I'd love to hear it. But I'm starting work on this idea now. Yes, I think a different thread will be required... maybe we can use an existing abstraction though? Perhaps another use for slow-work? Need to ensure that there will be no interactions with the recovery code though. Alternatively quotad might be useable as we already use this for dealing with pending truncates for similar reasons. Created attachment 354232 [details]
First cut at a patch.
This patch is pretty ugly, but so far, it has seems to work correctly. I'm going to keep testing, and cleaning it up.
The patch doesn't look too bad... could you move the trigger for the delete workqueue out from run_queue and put it in the callback code perhaps? Maybe add a callback entry to the glops structure so that we can do type specific call backs? That way we'd move that out of the common code and into iopen only code. When would the patch be ready for this issue ...as I have a prod gfs2 cluster with same issue which run on RHEL5 Update3, Kernel 2.6.18-128.1.14.el5 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 354793 [details]
cleaned up patch
This patch is lite the previous one, but it removes the debug printouts, moves some of the logic around, and fixes a bug where unmounting a filesystem while there was still work on the gfs2_delete_workqueue caused a withdraw.
POSTed. in kernel-2.6.18-160.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. verified that the patch to this bug is included in kernel-2.6.18-160.el5 with patch #24367 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html Has this bug been reintroduced? I am seeing this exact issue on 2.6.18-194.26.1.el5 kernel (In reply to comment #44) > Has this bug been reintroduced? I am seeing this exact issue on > 2.6.18-194.26.1.el5 kernel David, a very similar but new issues has recently been found: https://bugzilla.redhat.com/show_bug.cgi?id=666080 |