Bug 666080

Summary:

GFS2: Blocks not marked free on delete

Product:

Red Hat Enterprise Linux 5

Reporter:

Adam Drew <adrew>

Component:

kernel

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

5.7

CC:

adas, ahecox, ajb2, andresp, anton, bmarzins, brsmith, casmith, cmaiolin, cww, dhoward, jwest, liko, qcai, rfreire, rpeterso, rprice, rwheeler, sbradley, ssaha, swhiteho

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a particular node while other nodes in the cluster were caching that same inode. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.

Story Points:

---

Clone Of:

Clones:

669877 (view as bug list)

Environment:

Last Closed:

2011-07-21 10:04:32 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

669877, 675909

Attachments:

Description	Flags
Patch fixing send_bast_queue() dlm function	none
Fix to allow space to be freed immediately on delete	none

Description Adam Drew 2010-12-28 20:26:44 UTC

Description of problem:
Blocks not being marked free when a delete happens on a GFS2 filesystem. Specifically, this is happening when the file is deleted on a different node than the node that created it. If we create a file on node A and delete it on node A then the blocks are freed up. If we create a file on node A and delete it on node B then the blocks are not freed up. 

The result of this is space not being freed up on-delete when the filesystem is being accessed by multiple nodes concurrently.

This may be a regression of BZ505548 but that bug issues surfaced regardless of where the file was deleted. This issue is more specific.

Running gfs2_fsck fixes the corruption with "Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)" messages for all effected blocks.

Version-Release number of selected component (if applicable):
2.6.18-194.26.1.el5

How reproducible:
Easily. Can be reproduced 100% of the time with simple tests.

Steps to Reproduce:
1. Set up a 2+ node cluster
2. Create a GFS2 filesystem and mount it on all nodes
3. Create a file on the GFS2 filesystem and then delete the file from another node
4. Run the gfs2_tool df or regular df and observe your space not freed up
5. Run FSCK and observe the corruption fixed

  
Actual results:
[root@node1 test]# dd if=/dev/zero of=test.img bs=1024 count=262144
262144+0 records in
262144+0 records out
268435456 bytes (268 MB) copied, 51.1185 seconds, 5.3 MB/s
[root@node1 test]# gfs2_tool df
/mnt/test:
  SB lock proto = "lock_dlm"
  SB lock table = "adrew-rhel5:gfs2-delete"
  SB ondisk format = 1801
  SB multihost format = 1900
  Block size = 4096
  Journals = 2
  Resource Groups = 40
  Mounted lock proto = "lock_dlm"
  Mounted lock table = "adrew-rhel5:gfs2-delete"
  Mounted host data = "jid=0:id=196609:first=1"
  Journal number = 0
  Lock module flags = 0
  Local flocks = FALSE
  Local caching = FALSE

  Type           Total Blocks   Used Blocks    Free Blocks    use%           
  ------------------------------------------------------------------------
  data           2612352        131861         2480491        5%
  inodes         2480508        17             2480491        0%
[root@node1 test]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       9014656   6383424   2165928  75% /
/dev/vda1               101086     25845     70022  27% /boot
tmpfs                   255292         0    255292   0% /dev/shm
/dev/mapper/mpath0p2  10449408    527444   9921964   6% /mnt/test

[root@node2 test]# rm test.img 
rm: remove regular file `test.img'? y

[root@node1 test]# gfs2_tool df
/mnt/test:
  SB lock proto = "lock_dlm"
  SB lock table = "adrew-rhel5:gfs2-delete"
  SB ondisk format = 1801
  SB multihost format = 1900
  Block size = 4096
  Journals = 2
  Resource Groups = 40
  Mounted lock proto = "lock_dlm"
  Mounted lock table = "adrew-rhel5:gfs2-delete"
  Mounted host data = "jid=0:id=196609:first=1"
  Journal number = 0
  Lock module flags = 0
  Local flocks = FALSE
  Local caching = FALSE

  Type           Total Blocks   Used Blocks    Free Blocks    use%           
  ------------------------------------------------------------------------
  data           2612352        131861         2480491        5%
  inodes         2480508        17             2480491        0%
[root@node1 test]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       9014656   6383424   2165928  75% /
/dev/vda1               101086     25845     70022  27% /boot
tmpfs                   255292         0    255292   0% /dev/shm
/dev/mapper/mpath0p2  10449408    527444   9921964   6% /mnt/test


Expected results:
Deleting files results in the associated blocks being marked free, regardless of on which node the delete occurs

Comment 1 Steve Whitehouse 2011-01-03 09:54:36 UTC

Please don't use gfs2_tool df since it is obsolete. Also, I'm wondering whether fast statfs was being used here? If so that might explain the apparent lack of free blocks.

Otherwise the most likely cause is that the original inode is being pinned in core by the dcache on the creating node. So we need to figure out whether that is happening, and also why, since that is supposed to result in that dcache entry being flushed, which was fixed a long time ago.

Comment 2 Adam Drew 2011-01-03 15:24:11 UTC

"Also, I'm wondering whether fast statfs was being used here?"

No mount options in use. This can be reproduced (it seems) on any RHEL 5 cluster running 2.6.18-194.26.1.el5 or higher. Carlos, multiple customers, and I have all been able to reproduce it.

To note, the issue is not happening on RHEL 6. I tested on 2.6.32-71.7.1.el6 and saw no issue on delete.

Comment 3 Carlos Maiolino 2011-01-04 20:44:55 UTC

Hi, 

I just sent a patch to the cluster-devel list which addresses the same issue, but the problem indeed is DLM related, I'm not sure if it's a regression of BZ 505548, since this is a DLM related issue, but the symptoms are the same, space not freed when files are deleted.

The patch I've sent is:
https://www.redhat.com/archives/cluster-devel/2011-January/msg00008.html

Comment 4 Carlos Maiolino 2011-01-04 21:13:45 UTC

Created attachment 471750 [details]
Patch fixing send_bast_queue() dlm function

Comment 5 Steve Whitehouse 2011-01-07 12:33:19 UTC

Are there any updates on this one yet?

Comment 7 Ben Marzinski 2011-01-12 06:13:52 UTC

Carlos was definitely correct that this issue started when the dlm stopped sending callbacks to node issuing the glock request as part of the fix for Bug 504188. However that patch is correct. GFS2 is doing something incorrectly.  I'm looking into what's happening right now.

Comment 8 Ben Marzinski 2011-01-12 06:29:53 UTC

Right now, it looks like with the dlm fix inplace, gfs2_delete_inode() is getting called, but gfs2_file_dealloc() is not.

Comment 9 Ben Marzinski 2011-01-12 23:02:43 UTC

I'm still trying to figure out why the space isn't returned immediately, but the good news is that this doesn't actually cause any real damage to the filesystem. The file does get deleted, but the space is still allocated. This should hurt anything. The next time gfs2 tries to use that inode's resource group, it will find the unused but still allocated inode and delete it, freeing up all the space.    This doesn't require any special recovery actions. The check happens whenever gfs2 tries to allocate space.

gfs2 tries to free up space as soon as things are deleted on any node, but this isn't always possible in a clustered environment, at least not without a performance hit.  In these cases, gfs deletes the file and reclaims the space later.

However, I still don't see why gfs2 shouldn't be able to return the space right away in this case.

please verify that after you delete the file, and don't see the space freed up, you are still able to create another file of the same size as the one you just deleted.

Comment 10 Adam Drew 2011-01-12 23:35:06 UTC

Yup, still able to use the space even though it appears to not be free:

[root@node1 gfs2]# mount -l -t gfs2
/dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=0:id=327681:first=1) [adrew-rhel5:space_test]
[root@node1 gfs2]# pwd
/mnt/gfs2
[root@node1 gfs2]# df -h /dev/mapper/mpath0p2
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/mpath0p2  965M  259M  707M  27% /mnt/gfs2
[root@node1 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900
dd: writing `test.out': No space left on device
705+0 records in
704+0 records out
738734080 bytes (739 MB) copied, 146.124 seconds, 5.1 MB/s
[root@node1 gfs2]# df -h /dev/mapper/mpath0p2
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/mpath0p2  965M  965M  216K 100% /mnt/gfs2
[root@node1 gfs2]# ssh node2
root@node2's password: 
Last login: Wed Jan 12 18:26:18 2011 from node1.adrew.net
[root@node2 ~]# cd /mnt/gfs2
[root@node2 gfs2]# mount -l -t gfs2
/dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=1:id=327681:first=0) [adrew-rhel5:space_test]
[root@node2 gfs2]# df -h /dev/mapper/mpath0p2
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/mpath0p2  965M  965M  216K 100% /mnt/gfs2
[root@node2 gfs2]# rm -rf test.out 
[root@node2 gfs2]# df -h /dev/mapper/mpath0p2
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/mpath0p2  965M  965M  216K 100% /mnt/gfs2
[root@node2 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900
dd: writing `test.out': No space left on device
705+0 records in
704+0 records out
738734080 bytes (739 MB) copied, 109.049 seconds, 6.8 MB/s

The thing I always found strangest is that this *doesn't* happen if all operations are done on a single node. If I create the file and delete it on the same node then the space does get "freed" up:

[root@node1 gfs2]# mount -l -t gfs2
/dev/mapper/mpath0p2 on /mnt/gfs2 type gfs2 (rw,hostdata=jid=0:id=327681:first=1) [adrew-rhel5:space_test]
[root@node1 gfs2]# pwd
/mnt/gfs2
[root@node1 gfs2]# df -h | grep mpath
/dev/mapper/mpath0p2  965M  259M  707M  27% /mnt/gfs2
[root@node1 gfs2]# dd if=/dev/zero of=test.out bs=1024k count=900
dd: writing `test.out': No space left on device
705+0 records in
704+0 records out
738734080 bytes (739 MB) copied, 143.501 seconds, 5.1 MB/s
[root@node1 gfs2]# df -h | grep mpath
/dev/mapper/mpath0p2  965M  965M  216K 100% /mnt/gfs2
[root@node1 gfs2]# rm -rf test.out 
[root@node1 gfs2]# df -h | grep mpath
/dev/mapper/mpath0p2  965M  259M  707M  27% /mnt/gfs2

Comment 11 Steve Whitehouse 2011-01-13 10:17:24 UTC

One suggestion is this:

After the removal of the file, look to see how much free space there is on both nodes. Assuming that both nodes were caching the inode, then only one of them should be deallocating the blocks. The node doing the final unlink should use try locks in order to pass on the baton to any node still caching the inode. If it did this in the case above, then the result should be that the other node will then (assuming no local openers as in this case) attempt to also deallocate the inode.

Assuming that we have fast statfs and that the other node did the deallocation, then it would not show up on the unlinking node right away, but it would show up on the other node, just as soon as the deallocation was complete.

Comment 12 Ben Marzinski 2011-01-13 14:58:26 UTC

When I remove the file, I can see that both nodes call gfs2_delete_inode(), and both fail in gfs2_glock_nq() with GLR_TRYFAILED, trying to relock the iopen lock in the exclusive state. Afterwards, the space is not there on either node, since neither one makes it to gfs2_file_dealloc().  When a node later notices the unused inode during gfs2_inplace_reserve_i(), that's when the space is finally deallocated, and it shows up on both nodes.  This bug looks identical with fast statfs on and off.  I'm currently trying to figure out why one of those nodes isn't able to complete the delete the first time around.

Comment 13 Ben Marzinski 2011-01-14 04:23:41 UTC

Here's what this problem is:

Let's say you create the file on nodeA and remove it on nodeB. nodeB fails to acquire the iopen glock in the exclusive state since nodeA still has it cached in the shared state from when it opened the file, and nodes only do a trylock when they try to get the iopen lock on deletes. When this happens, the iopen lock stays cached in the shared state on nodeB as well, so when nodeA tries to grab the glock in the exclusive state, it fails as well.  Before Dave's fix, nodeB was sending a callback to itself when it tried to acquire the glock in exclusive state. This caused it to drop the glock from its cache, which let nodeA acquire it.

To fix this, when a node fails to delete a file competely, it now drops the glock from it's cache by calling handle_callback() and then scheduling work on the glock. This let's the other node acquire the iopen glock in exclusive, and finish the delete immediately.  I have this working, but the fix is littered with debugging code. I'll clean it up and post it in the morning.

Comment 14 Ben Marzinski 2011-01-15 04:45:12 UTC

Created attachment 473625 [details]
Fix to allow space to be freed immediately on delete

This is simpler idea than what I described before. We simply don't cache the shared iopen glock when we dequeue it.  Since we need to acquire the lock in exclusive anyway, dropping the shared lock doesn't hurt anything.  If we fail to grab the iopen glock exclusively, then we won't have anything cached, and the other node should be able to acquire the lock to finish up the delete.

Comment 15 Steve Whitehouse 2011-01-15 10:57:12 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The mechanism for ensuring that inodes are deallocated when the final close occurs was relying on a bug which was previously corrected in bz #504188. In order to ensure that iopen locks are not cached beyond the lifetime of the inode, and thus prevent dealloction by another node in the cluster, this change marks the iopen glock as not to be cached during the inode disposal process.

The consequences of the process not completing are not that great. There is already a separate process in place which deals with deallocating allocated, but unlinked inodes. This is similar in intent to the ext3 orphan list.

The symptoms of this bug are that space does not appear to be freed when inodes are unlinked. However, the space is available for reuse, and an attempt to reuse the space will trigger the process mentioned above which will deallocate the inode and make the space available for future allocations.

This bug only affects inodes that are cached by more than one node and which are then unlinked.

Comment 19 Jarod Wilson 2011-01-26 21:09:35 UTC

in kernel-2.6.18-241.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 25 Martin Prpič 2011-04-14 10:14:31 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,7 +1 @@
-The mechanism for ensuring that inodes are deallocated when the final close occurs was relying on a bug which was previously corrected in bz #504188. In order to ensure that iopen locks are not cached beyond the lifetime of the inode, and thus prevent dealloction by another node in the cluster, this change marks the iopen glock as not to be cached during the inode disposal process.
+Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a different inode than the inode that created it. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.-
-The consequences of the process not completing are not that great. There is already a separate process in place which deals with deallocating allocated, but unlinked inodes. This is similar in intent to the ext3 orphan list.
-
-The symptoms of this bug are that space does not appear to be freed when inodes are unlinked. However, the space is available for reuse, and an attempt to reuse the space will trigger the process mentioned above which will deallocate the inode and make the space available for future allocations.
-
-This bug only affects inodes that are cached by more than one node and which are then unlinked.

Comment 26 Steve Whitehouse 2011-04-14 10:32:52 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a different inode than the inode that created it. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.+Deleting a file on a GFS2 file system caused the inode, which the deleted file previously occupied, to not be freed. Specifically, this only occurred when a file was deleted on a particular node while other nodes in the cluster were caching that same inode. The mechanism for ensuring that inodes are correctly deallocated when the final close occurs was dependent on a previously corrected bug (BZ#504188). In order to ensure that iopen glocks are not cached beyond the lifetime of the inode, and thus prevent deallocation by another inode in the cluster, this update marks the iopen glock as not to be cached during the inode disposal process.

Comment 27 Nate Straz 2011-05-17 20:22:38 UTC

Verified new test case using kernel-2.6.18-238.el5 (RHEL 5.6)
Verified fixed in kernel-2.6.18-256.el5.

Comment 28 errata-xmlrpc 2011-07-21 10:04:32 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html