Description of problem: Blocks not being marked free when a delete happens on a GFS2 filesystem. Specifically, this is happening when the file is deleted on a different node than the node that created it. If we create a file on node A and delete it on node A then the blocks are freed up. If we create a file on node A and delete it on node B then the blocks are not freed up. The result of this is space not being freed up on-delete when the filesystem is being accessed by multiple nodes concurrently. This may be a regression of BZ505548 but that bug issues surfaced regardless of where the file was deleted. This issue is more specific. Running gfs2_fsck fixes the corruption with "Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)" messages for all effected blocks. Version-Release number of selected component (if applicable): 2.6.18-194.26.1.el5 How reproducible: Easily. Can be reproduced 100% of the time with simple tests. Steps to Reproduce: 1. Set up a 2+ node cluster 2. Create a GFS2 filesystem and mount it on all nodes 3. Create a file on the GFS2 filesystem and then delete the file from another node 4. Run the gfs2_tool df or regular df and observe your space not freed up 5. Run FSCK and observe the corruption fixed Actual results: [root@node1 test]# dd if=/dev/zero of=test.img bs=1024 count=262144 262144+0 records in 262144+0 records out 268435456 bytes (268 MB) copied, 51.1185 seconds, 5.3 MB/s [root@node1 test]# gfs2_tool df /mnt/test: SB lock proto = "lock_dlm" SB lock table = "adrew-rhel5:gfs2-delete" SB ondisk format = 1801 SB multihost format = 1900 Block size = 4096 Journals = 2 Resource Groups = 40 Mounted lock proto = "lock_dlm" Mounted lock table = "adrew-rhel5:gfs2-delete" Mounted host data = "jid=0:id=196609:first=1" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Type Total Blocks Used Blocks Free Blocks use% ------------------------------------------------------------------------ data 2612352 131861 2480491 5% inodes 2480508 17 2480491 0% [root@node1 test]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 9014656 6383424 2165928 75% / /dev/vda1 101086 25845 70022 27% /boot tmpfs 255292 0 255292 0% /dev/shm /dev/mapper/mpath0p2 10449408 527444 9921964 6% /mnt/test [root@node2 test]# rm test.img rm: remove regular file `test.img'? y [root@node1 test]# gfs2_tool df /mnt/test: SB lock proto = "lock_dlm" SB lock table = "adrew-rhel5:gfs2-delete" SB ondisk format = 1801 SB multihost format = 1900 Block size = 4096 Journals = 2 Resource Groups = 40 Mounted lock proto = "lock_dlm" Mounted lock table = "adrew-rhel5:gfs2-delete" Mounted host data = "jid=0:id=196609:first=1" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Type Total Blocks Used Blocks Free Blocks use% ------------------------------------------------------------------------ data 2612352 131861 2480491 5% inodes 2480508 17 2480491 0% [root@node1 test]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 9014656 6383424 2165928 75% / /dev/vda1 101086 25845 70022 27% /boot tmpfs 255292 0 255292 0% /dev/shm /dev/mapper/mpath0p2 10449408 527444 9921964 6% /mnt/test Expected results: Deleting files results in the associated blocks being marked free, regardless of on which node the delete occurs
Please don't use gfs2_tool df since it is obsolete. Also, I'm wondering whether fast statfs was being used here? If so that might explain the apparent lack of free blocks. Otherwise the most likely cause is that the original inode is being pinned in core by the dcache on the creating node. So we need to figure out whether that is happening, and also why, since that is supposed to result in that dcache entry being flushed, which was fixed a long time ago.
"Also, I'm wondering whether fast statfs was being used here?" No mount options in use. This can be reproduced (it seems) on any RHEL 5 cluster running 2.6.18-194.26.1.el5 or higher. Carlos, multiple customers, and I have all been able to reproduce it. To note, the issue is not happening on RHEL 6. I tested on 2.6.32-71.7.1.el6 and saw no issue on delete.
Hi, I just sent a patch to the cluster-devel list which addresses the same issue, but the problem indeed is DLM related, I'm not sure if it's a regression of BZ 505548, since this is a DLM related issue, but the symptoms are the same, space not freed when files are deleted. The patch I've sent is: https://www.redhat.com/archives/cluster-devel/2011-January/msg00008.html
Created attachment 471750 [details] Patch fixing send_bast_queue() dlm function
Are there any updates on this one yet?
Carlos was definitely correct that this issue started when the dlm stopped sending callbacks to node issuing the glock request as part of the fix for Bug 504188. However that patch is correct. GFS2 is doing something incorrectly. I'm looking into what's happening right now.
Right now, it looks like with the dlm fix inplace, gfs2_delete_inode() is getting called, but gfs2_file_dealloc() is not.
I'm still trying to figure out why the space isn't returned immediately, but the good news is that this doesn't actually cause any real damage to the filesystem. The file does get deleted, but the space is still allocated. This should hurt anything. The next time gfs2 tries to use that inode's resource group, it will find the unused but still allocated inode and delete it, freeing up all the space. This doesn't require any special recovery actions. The check happens whenever gfs2 tries to allocate space. gfs2 tries to free up space as soon as things are deleted on any node, but this isn't always possible in a clustered environment, at least not without a performance hit. In these cases, gfs deletes the file and reclaims the space later. However, I still don't see why gfs2 shouldn't be able to return the space right away in this case. please verify that after you delete the file, and don't see the space freed up, you are still able to create another file of the same size as the one you just deleted.
Yup, still able to use the space even though it appears to not be free: [root@node1 gfs2]# mount -l -t gfs2 /dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=0:id=327681:first=1) [adrew-rhel5:space_test] [root@node1 gfs2]# pwd /mnt/gfs2 [root@node1 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 259M 707M 27% /mnt/gfs2 [root@node1 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900 dd: writing `test.out': No space left on device 705+0 records in 704+0 records out 738734080 bytes (739 MB) copied, 146.124 seconds, 5.1 MB/s [root@node1 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node1 gfs2]# ssh node2 root@node2's password: Last login: Wed Jan 12 18:26:18 2011 from node1.adrew.net [root@node2 ~]# cd /mnt/gfs2 [root@node2 gfs2]# mount -l -t gfs2 /dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=1:id=327681:first=0) [adrew-rhel5:space_test] [root@node2 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node2 gfs2]# rm -rf test.out [root@node2 gfs2]# df -h /dev/mapper/mpath0p2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node2 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900 dd: writing `test.out': No space left on device 705+0 records in 704+0 records out 738734080 bytes (739 MB) copied, 109.049 seconds, 6.8 MB/s The thing I always found strangest is that this *doesn't* happen if all operations are done on a single node. If I create the file and delete it on the same node then the space does get "freed" up: [root@node1 gfs2]# mount -l -t gfs2 /dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=0:id=327681:first=1) [adrew-rhel5:space_test] [root@node1 gfs2]# pwd /mnt/gfs2 [root@node1 gfs2]# df -h | grep mpath /dev/mapper/mpath0p2 965M 259M 707M 27% /mnt/gfs2 [root@node1 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900 dd: writing `test.out': No space left on device 705+0 records in 704+0 records out 738734080 bytes (739 MB) copied, 143.501 seconds, 5.1 MB/s [root@node1 gfs2]# df -h | grep mpath /dev/mapper/mpath0p2 965M 965M 216K 100% /mnt/gfs2 [root@node1 gfs2]# rm -rf test.out [root@node1 gfs2]# df -h | grep mpath /dev/mapper/mpath0p2 965M 259M 707M 27% /mnt/gfs2
One suggestion is this: After the removal of the file, look to see how much free space there is on both nodes. Assuming that both nodes were caching the inode, then only one of them should be deallocating the blocks. The node doing the final unlink should use try locks in order to pass on the baton to any node still caching the inode. If it did this in the case above, then the result should be that the other node will then (assuming no local openers as in this case) attempt to also deallocate the inode. Assuming that we have fast statfs and that the other node did the deallocation, then it would not show up on the unlinking node right away, but it would show up on the other node, just as soon as the deallocation was complete.
When I remove the file, I can see that both nodes call gfs2_delete_inode(), and both fail in gfs2_glock_nq() with GLR_TRYFAILED, trying to relock the iopen lock in the exclusive state. Afterwards, the space is not there on either node, since neither one makes it to gfs2_file_dealloc(). When a node later notices the unused inode during gfs2_inplace_reserve_i(), that's when the space is finally deallocated, and it shows up on both nodes. This bug looks identical with fast statfs on and off. I'm currently trying to figure out why one of those nodes isn't able to complete the delete the first time around.
Here's what this problem is: Let's say you create the file on nodeA and remove it on nodeB. nodeB fails to acquire the iopen glock in the exclusive state since nodeA still has it cached in the shared state from when it opened the file, and nodes only do a trylock when they try to get the iopen lock on deletes. When this happens, the iopen lock stays cached in the shared state on nodeB as well, so when nodeA tries to grab the glock in the exclusive state, it fails as well. Before Dave's fix, nodeB was sending a callback to itself when it tried to acquire the glock in exclusive state. This caused it to drop the glock from its cache, which let nodeA acquire it. To fix this, when a node fails to delete a file competely, it now drops the glock from it's cache by calling handle_callback() and then scheduling work on the glock. This let's the other node acquire the iopen glock in exclusive, and finish the delete immediately. I have this working, but the fix is littered with debugging code. I'll clean it up and post it in the morning.
Created attachment 473625 [details] Fix to allow space to be freed immediately on delete This is simpler idea than what I described before. We simply don't cache the shared iopen glock when we dequeue it. Since we need to acquire the lock in exclusive anyway, dropping the shared lock doesn't hurt anything. If we fail to grab the iopen glock exclusively, then we won't have anything cached, and the other node should be able to acquire the lock to finish up the delete.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The mechanism for ensuring that inodes are deallocated when the final close occurs was relying on a bug which was previously corrected in bz #504188. In order to ensure that iopen locks are not cached beyond the lifetime of the inode, and thus prevent dealloction by another node in the cluster, this change marks the iopen glock as not to be cached during the inode disposal process. The consequences of the process not completing are not that great. There is already a separate process in place which deals with deallocating allocated, but unlinked inodes. This is similar in intent to the ext3 orphan list. The symptoms of this bug are that space does not appear to be freed when inodes are unlinked. However, the space is available for reuse, and an attempt to reuse the space will trigger the process mentioned above which will deallocate the inode and make the space available for future allocations. This bug only affects inodes that are cached by more than one node and which are then unlinked.
in kernel-2.6.18-241.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,7 +1 @@ -The mechanism for ensuring that inodes are deallocated when the final close occurs was relying on a bug which was previously corrected in bz #504188. In order to ensure that iopen locks are not cached beyond the lifetime of the inode, and thus prevent dealloction by another node in the cluster, this change marks the iopen glock as not to be cached during the inode disposal process. +Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a different inode than the inode that created it. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.- -The consequences of the process not completing are not that great. There is already a separate process in place which deals with deallocating allocated, but unlinked inodes. This is similar in intent to the ext3 orphan list. - -The symptoms of this bug are that space does not appear to be freed when inodes are unlinked. However, the space is available for reuse, and an attempt to reuse the space will trigger the process mentioned above which will deallocate the inode and make the space available for future allocations. - -This bug only affects inodes that are cached by more than one node and which are then unlinked.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a different inode than the inode that created it. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.+Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a particular node while other nodes in the cluster were caching that same inode. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.
Verified new test case using kernel-2.6.18-238.el5 (RHEL 5.6) Verified fixed in kernel-2.6.18-256.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html