Bug 505171

Summary: gfs2: filesystem consistency error with statfs_slow = 1
Product: Red Hat Enterprise Linux 5 Reporter: John Ruemker <jruemker>
Component: kernelAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: adas, bturner, dmair, dzickus, lwang, mgahagan, rpeterso, swhiteho, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:54:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514700    
Attachments:
Description Flags
Change to statfs_slow code to fix withdraws none

Description John Ruemker 2009-06-10 21:13:56 UTC
Description of problem: After creating and removing a large file on a gfs2 filesystem using statfs_slow, the filesystem withdraws.  This happened on a fresh filesystem for both myself and a customer where nothing has been written other than this one file.  I could not reproduce the issue when mounting with just one node, nor when using a smaller (700 Mb) file.  I was able to reproduce the issue on the first try (using procedure below), but unfortunately I can't seem to anymore.  

Version-Release number of selected component (if applicable):  kernel-2.6.18-128.1.10.el5, gfs2-utils-0.1.53-1.el5_3.3


How reproducible: Sometimes


Steps to Reproduce:
# mkfs.gfs2 -p lock_dlm -t jrummy5:gfs2-a -j 2 /dev/clust/gsf2-a
# mount /dev/clust/gfs2-a /mnt/gfs2-a  <-- on both nodes
# gfs2_tool settune /mnt/gfs2-a/ statfs_slow 1  <-- on both nodes
# cd /mnt/gfs2-a/
# dd if=/dev/zero of=./bigfile bs=1M count=3800
# df -k | grep lv1
# rm bigfile
# df -k | grep lv1
  
Actual results: I/O errors received and withdraw seen in logs:

Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: fatal: filesystem consistency error
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0:   RG = 65388
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0:   function = gfs2_rgrp_verify, file = fs/gfs2/rgrp.c, line = 274
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: about to withdraw this file system
Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: telling LM to withdraw
Jun 10 16:09:19 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: withdrawn
Jun 10 16:09:19 jrummy5-1 kernel: 
Jun 10 16:09:19 jrummy5-1 kernel: Call Trace:
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88508526>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800255df>] find_or_create_page+0x1e/0x75
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff80063097>] thread_return+0x62/0xfe
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8851b6cf>] :gfs2:gfs2_consist_rgrpd_i+0x34/0x39
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88517977>] :gfs2:gfs2_rgrp_verify+0x176/0x230
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88519742>] :gfs2:gfs2_statfs_slow+0xc6/0x19b
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff88517ff9>] :gfs2:gfs2_rindex_hold+0x32/0x153
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff885139b0>] :gfs2:gfs2_statfs+0x46/0xa0
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800db2c3>] vfs_statfs+0x63/0x7f
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800db422>] vfs_statfs_native+0x13/0x34
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800db4ea>] sys_statfs+0x3f/0x79
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff80066bcd>] do_page_fault+0x4fe/0x830
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8000de64>] do_mmap_pgoff+0x66c/0x7d7
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff800b46ab>] audit_syscall_entry+0x16e/0x1a1
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Jun 10 16:09:19 jrummy5-1 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jun 10 16:09:19 jrummy5-1 kernel: 
Jun 10 16:09:19 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: used data mismatch:  65326 != 65326

Expected results: File is removed, space is reclaimed.  

Additional info: I'll keep trying to reproduce again and see if there is some repeatable steps that can be taken

Comment 1 Steve Whitehouse 2009-06-11 14:45:47 UTC
Ben, does this look familiar from your recent work on statfs?

Comment 2 Robert Peterson 2009-06-11 15:18:55 UTC
Sounds more like file system corruption to me, that is detected
when the slow statfs traverses the rgrp linked list.

Comment 7 Ben Marzinski 2009-07-01 18:55:44 UTC
I finally have recreated this myself on a 3 node x86 cluster. However I needed to do multiple dd's and removes and dfs on different nodes to hit it.

Comment 8 Ben Marzinski 2009-07-08 19:00:10 UTC
Created attachment 350981 [details]
Change to statfs_slow code to fix withdraws

This change in the statfs_slow code fixes the issue for me.

Since both linked and unlinked inodes are counted by rgd->rd_dinodes,
It makes no sense to count them with the used data blocks (first check that I changed), it makes sense to count them with the linked inodes (second check), and it makes no sense to care if there are more unlinked inodes than linked ones.

Comment 11 Linda Wang 2009-07-10 19:58:17 UTC
patch posted on 7/10/09: 3:38 PM; set dev ack.

Comment 13 Don Zickus 2009-07-14 20:57:23 UTC
in kernel-2.6.18-158.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 16 errata-xmlrpc 2009-09-02 08:54:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html