Bug 666080
Summary: | GFS2: Blocks not marked free on delete | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Adam Drew <adrew> | ||||||
Component: | kernel | Assignee: | Ben Marzinski <bmarzins> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.7 | CC: | adas, ahecox, ajb2, andresp, anton, bmarzins, brsmith, casmith, cmaiolin, cww, dhoward, jwest, liko, qcai, rfreire, rpeterso, rprice, rwheeler, sbradley, ssaha, swhiteho | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a particular node while other nodes in the cluster were caching that same inode. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 669877 (view as bug list) | Environment: | |||||||
Last Closed: | 2011-07-21 10:04:32 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 669877, 675909 | ||||||||
Attachments: |
|
Description
Adam Drew
2010-12-28 20:26:44 UTC
Please don't use gfs2_tool df since it is obsolete. Also, I'm wondering whether fast statfs was being used here? If so that might explain the apparent lack of free blocks. Otherwise the most likely cause is that the original inode is being pinned in core by the dcache on the creating node. So we need to figure out whether that is happening, and also why, since that is supposed to result in that dcache entry being flushed, which was fixed a long time ago. "Also, I'm wondering whether fast statfs was being used here?" No mount options in use. This can be reproduced (it seems) on any RHEL 5 cluster running 2.6.18-194.26.1.el5 or higher. Carlos, multiple customers, and I have all been able to reproduce it. To note, the issue is not happening on RHEL 6. I tested on 2.6.32-71.7.1.el6 and saw no issue on delete. Hi, I just sent a patch to the cluster-devel list which addresses the same issue, but the problem indeed is DLM related, I'm not sure if it's a regression of BZ 505548, since this is a DLM related issue, but the symptoms are the same, space not freed when files are deleted. The patch I've sent is: https://www.redhat.com/archives/cluster-devel/2011-January/msg00008.html Created attachment 471750 [details]
Patch fixing send_bast_queue() dlm function
Are there any updates on this one yet? Carlos was definitely correct that this issue started when the dlm stopped sending callbacks to node issuing the glock request as part of the fix for Bug 504188. However that patch is correct. GFS2 is doing something incorrectly. I'm looking into what's happening right now. Right now, it looks like with the dlm fix inplace, gfs2_delete_inode() is getting called, but gfs2_file_dealloc() is not. I'm still trying to figure out why the space isn't returned immediately, but the good news is that this doesn't actually cause any real damage to the filesystem. The file does get deleted, but the space is still allocated. This should hurt anything. The next time gfs2 tries to use that inode's resource group, it will find the unused but still allocated inode and delete it, freeing up all the space. This doesn't require any special recovery actions. The check happens whenever gfs2 tries to allocate space. gfs2 tries to free up space as soon as things are deleted on any node, but this isn't always possible in a clustered environment, at least not without a performance hit. In these cases, gfs deletes the file and reclaims the space later. However, I still don't see why gfs2 shouldn't be able to return the space right away in this case. please verify that after you delete the file, and don't see the space freed up, you are still able to create another file of the same size as the one you just deleted. Yup, still able to use the space even though it appears to not be free: [root@node1 gfs2]# mount -l -t gfs2 /dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=0:id=327681:first=1) [adrew-rhel5:space_test] [root@node1 gfs2]# pwd /mnt/gfs2 [root@node1 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 259M 707M 27% /mnt/gfs2 [root@node1 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900 dd: writing `test.out': No space left on device 705+0 records in 704+0 records out 738734080 bytes (739 MB) copied, 146.124 seconds, 5.1 MB/s [root@node1 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node1 gfs2]# ssh node2 root@node2's password: Last login: Wed Jan 12 18:26:18 2011 from node1.adrew.net [root@node2 ~]# cd /mnt/gfs2 [root@node2 gfs2]# mount -l -t gfs2 /dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=1:id=327681:first=0) [adrew-rhel5:space_test] [root@node2 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node2 gfs2]# rm -rf test.out [root@node2 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node2 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900 dd: writing `test.out': No space left on device 705+0 records in 704+0 records out 738734080 bytes (739 MB) copied, 109.049 seconds, 6.8 MB/s The thing I always found strangest is that this *doesn't* happen if all operations are done on a single node. If I create the file and delete it on the same node then the space does get "freed" up: [root@node1 gfs2]# mount -l -t gfs2 /dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=0:id=327681:first=1) [adrew-rhel5:space_test] [root@node1 gfs2]# pwd /mnt/gfs2 [root@node1 gfs2]# df -h | grep mpath /dev/mapper/mpath0p2 965M 259M 707M 27% /mnt/gfs2 [root@node1 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900 dd: writing `test.out': No space left on device 705+0 records in 704+0 records out 738734080 bytes (739 MB) copied, 143.501 seconds, 5.1 MB/s [root@node1 gfs2]# df -h | grep mpath /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node1 gfs2]# rm -rf test.out [root@node1 gfs2]# df -h | grep mpath /dev/mapper/mpath0p2 965M 259M 707M 27% /mnt/gfs2 One suggestion is this: After the removal of the file, look to see how much free space there is on both nodes. Assuming that both nodes were caching the inode, then only one of them should be deallocating the blocks. The node doing the final unlink should use try locks in order to pass on the baton to any node still caching the inode. If it did this in the case above, then the result should be that the other node will then (assuming no local openers as in this case) attempt to also deallocate the inode. Assuming that we have fast statfs and that the other node did the deallocation, then it would not show up on the unlinking node right away, but it would show up on the other node, just as soon as the deallocation was complete. When I remove the file, I can see that both nodes call gfs2_delete_inode(), and both fail in gfs2_glock_nq() with GLR_TRYFAILED, trying to relock the iopen lock in the exclusive state. Afterwards, the space is not there on either node, since neither one makes it to gfs2_file_dealloc(). When a node later notices the unused inode during gfs2_inplace_reserve_i(), that's when the space is finally deallocated, and it shows up on both nodes. This bug looks identical with fast statfs on and off. I'm currently trying to figure out why one of those nodes isn't able to complete the delete the first time around. Here's what this problem is: Let's say you create the file on nodeA and remove it on nodeB. nodeB fails to acquire the iopen glock in the exclusive state since nodeA still has it cached in the shared state from when it opened the file, and nodes only do a trylock when they try to get the iopen lock on deletes. When this happens, the iopen lock stays cached in the shared state on nodeB as well, so when nodeA tries to grab the glock in the exclusive state, it fails as well. Before Dave's fix, nodeB was sending a callback to itself when it tried to acquire the glock in exclusive state. This caused it to drop the glock from its cache, which let nodeA acquire it. To fix this, when a node fails to delete a file competely, it now drops the glock from it's cache by calling handle_callback() and then scheduling work on the glock. This let's the other node acquire the iopen glock in exclusive, and finish the delete immediately. I have this working, but the fix is littered with debugging code. I'll clean it up and post it in the morning. Created attachment 473625 [details]
Fix to allow space to be freed immediately on delete
This is simpler idea than what I described before. We simply don't cache the shared iopen glock when we dequeue it. Since we need to acquire the lock in exclusive anyway, dropping the shared lock doesn't hurt anything. If we fail to grab the iopen glock exclusively, then we won't have anything cached, and the other node should be able to acquire the lock to finish up the delete.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The mechanism for ensuring that inodes are deallocated when the final close occurs was relying on a bug which was previously corrected in bz #504188. In order to ensure that iopen locks are not cached beyond the lifetime of the inode, and thus prevent dealloction by another node in the cluster, this change marks the iopen glock as not to be cached during the inode disposal process. The consequences of the process not completing are not that great. There is already a separate process in place which deals with deallocating allocated, but unlinked inodes. This is similar in intent to the ext3 orphan list. The symptoms of this bug are that space does not appear to be freed when inodes are unlinked. However, the space is available for reuse, and an attempt to reuse the space will trigger the process mentioned above which will deallocate the inode and make the space available for future allocations. This bug only affects inodes that are cached by more than one node and which are then unlinked. in kernel-2.6.18-241.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,7 +1 @@ -The mechanism for ensuring that inodes are deallocated when the final close occurs was relying on a bug which was previously corrected in bz #504188. In order to ensure that iopen locks are not cached beyond the lifetime of the inode, and thus prevent dealloction by another node in the cluster, this change marks the iopen glock as not to be cached during the inode disposal process. +Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a different inode than the inode that created it. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.- -The consequences of the process not completing are not that great. There is already a separate process in place which deals with deallocating allocated, but unlinked inodes. This is similar in intent to the ext3 orphan list. - -The symptoms of this bug are that space does not appear to be freed when inodes are unlinked. However, the space is available for reuse, and an attempt to reuse the space will trigger the process mentioned above which will deallocate the inode and make the space available for future allocations. - -This bug only affects inodes that are cached by more than one node and which are then unlinked. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a different inode than the inode that created it. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.+Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a particular node while other nodes in the cluster were caching that same inode. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process. Verified new test case using kernel-2.6.18-238.el5 (RHEL 5.6) Verified fixed in kernel-2.6.18-256.el5. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |