Bug 505171 - gfs2: filesystem consistency error with statfs_slow = 1
gfs2: filesystem consistency error with statfs_slow = 1
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
low Severity high
: rc
: ---
Assigned To: Ben Marzinski
Red Hat Kernel QE team
:
Depends On:
Blocks: 514700
  Show dependency treegraph
 
Reported: 2009-06-10 17:13 EDT by John Ruemker
Modified: 2009-09-02 04:54 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:54:27 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Change to statfs_slow code to fix withdraws (1.06 KB, patch)
2009-07-08 15:00 EDT, Ben Marzinski
no flags Details | Diff

  None (edit)
Description John Ruemker 2009-06-10 17:13:56 EDT
Description of problem: After creating and removing a large file on a gfs2 filesystem using statfs_slow, the filesystem withdraws.  This happened on a fresh filesystem for both myself and a customer where nothing has been written other than this one file.  I could not reproduce the issue when mounting with just one node, nor when using a smaller (700 Mb) file.  I was able to reproduce the issue on the first try (using procedure below), but unfortunately I can't seem to anymore.  

Version-Release number of selected component (if applicable):  kernel-2.6.18-128.1.10.el5, gfs2-utils-0.1.53-1.el5_3.3


How reproducible: Sometimes


Steps to Reproduce:
# mkfs.gfs2 -p lock_dlm -t jrummy5:gfs2-a -j 2 /dev/clust/gsf2-a
# mount /dev/clust/gfs2-a /mnt/gfs2-a  <-- on both nodes
# gfs2_tool settune /mnt/gfs2-a/ statfs_slow 1  <-- on both nodes
# cd /mnt/gfs2-a/
# dd if=/dev/zero of=./bigfile bs=1M count=3800
# df -k | grep lv1
# rm bigfile
# df -k | grep lv1
  
Actual results: I/O errors received and withdraw seen in logs:

Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: fatal: filesystem consistency error
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0:   RG = 65388
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0:   function = gfs2_rgrp_verify, file = fs/gfs2/rgrp.c, line = 274
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: about to withdraw this file system
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: telling LM to withdraw
Jun 10 16:09:19 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: withdrawn
Jun 10 16:09:19 jrummy5-1 kernel: 
Jun 10 16:09:19 jrummy5-1 kernel: Call Trace:
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88508526>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800255df>] find_or_create_page+0x1e/0x75
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff80063097>] thread_return+0x62/0xfe
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8851b6cf>] :gfs2:gfs2_consist_rgrpd_i+0x34/0x39
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88517977>] :gfs2:gfs2_rgrp_verify+0x176/0x230
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88519742>] :gfs2:gfs2_statfs_slow+0xc6/0x19b
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88517ff9>] :gfs2:gfs2_rindex_hold+0x32/0x153
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff885139b0>] :gfs2:gfs2_statfs+0x46/0xa0
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800db2c3>] vfs_statfs+0x63/0x7f
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800db422>] vfs_statfs_native+0x13/0x34
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800db4ea>] sys_statfs+0x3f/0x79
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff80066bcd>] do_page_fault+0x4fe/0x830
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8000de64>] do_mmap_pgoff+0x66c/0x7d7
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800b46ab>] audit_syscall_entry+0x16e/0x1a1
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jun 10 16:09:19 jrummy5-1 kernel: 
Jun 10 16:09:19 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: used data mismatch:  65326 != 65326

Expected results: File is removed, space is reclaimed.  

Additional info: I'll keep trying to reproduce again and see if there is some repeatable steps that can be taken
Comment 1 Steve Whitehouse 2009-06-11 10:45:47 EDT
Ben, does this look familiar from your recent work on statfs?
Comment 2 Robert Peterson 2009-06-11 11:18:55 EDT
Sounds more like file system corruption to me, that is detected
when the slow statfs traverses the rgrp linked list.
Comment 7 Ben Marzinski 2009-07-01 14:55:44 EDT
I finally have recreated this myself on a 3 node x86 cluster. However I needed to do multiple dd's and removes and dfs on different nodes to hit it.
Comment 8 Ben Marzinski 2009-07-08 15:00:10 EDT
Created attachment 350981 [details]
Change to statfs_slow code to fix withdraws

This change in the statfs_slow code fixes the issue for me.

Since both linked and unlinked inodes are counted by rgd->rd_dinodes,
It makes no sense to count them with the used data blocks (first check that I changed), it makes sense to count them with the linked inodes (second check), and it makes no sense to care if there are more unlinked inodes than linked ones.
Comment 11 Linda Wang 2009-07-10 15:58:17 EDT
patch posted on 7/10/09: 3:38 PM; set dev ack.
Comment 13 Don Zickus 2009-07-14 16:57:23 EDT
in kernel-2.6.18-158.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 16 errata-xmlrpc 2009-09-02 04:54:27 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.