Bug 253590 - GFS2: meta data corruption under heavy IOs
GFS2: meta data corruption under heavy IOs
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
high Severity high
: ---
: ---
Assigned To: Don Zickus
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-08-20 14:39 EDT by Wendy Cheng
Modified: 2007-11-30 17:07 EST (History)
4 users (show)

See Also:
Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 15:00:32 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test patch (506 bytes, patch)
2007-08-21 09:27 EDT, Wendy Cheng
no flags Details | Diff
RHEL5 patch (468 bytes, patch)
2007-08-22 11:02 EDT, Wendy Cheng
no flags Details | Diff
Revised patch (2.87 KB, patch)
2007-08-23 13:40 EDT, Wendy Cheng
no flags Details | Diff
Final patch - can be applied to both RHEL5 and git tree (2.89 KB, patch)
2007-08-24 09:02 EDT, Wendy Cheng
no flags Details | Diff

  None (edit)
Description Wendy Cheng 2007-08-20 14:39:00 EDT
Description of problem:
This is previously bugzilla 251053 where large amount of NO-OP truncate
calls (setattr() with "request size" equal to "actual file size") would 
cause inode meta data (only with di_blocks field) corruption during 
SPECsfs benchmark runs. We avoid the corruption in 251053 by creating 
a new routine that handles the "equal size" special case without actually
generating a series of do-nothing IOs. 

This bugzilla is opened to trace down the root cause of the corruption.   

It is a rare event - only happens with performance group's machine setup
that equipped with 4 network interfaces, 4 nfs client machines, 56 
filesystems, 128 NFSD threads, without DLM lock protocl (lock_nolock)
under SPECsfs benchmarks continuously running for a while. The corruption
normally happens around 15000 Op per second - note that this is NFS op 
count (each NFS operation could carry a large amount of file data itself). 

Version-Release number of selected component (if applicable):
RHEL 5.1 41.el5 build

How reproducible:
Average once for every 18-hours run (1 in 3 benchmark runs - each
run lasts about 6 hours). 

Steps to Reproduce:
1. Run SPECsfs with performance group's standard script
2. Normally happens when OP per seconds exceeds 15000.
  
Actual results:
Kernel assertion is triggered - on memory inode di_blocks count 
is 0, on-disk di_blocks count is -1, but fsck (and/or examining 
the disk by hand) expects di_blocks count to be 1. All other part 
of the disk seems to be ok (at lease fsck can't find anything). 
The stack dump at the time of assertion is :

GFS2: fsid=notaclu:sdr.0: fatal: filesystem consistency error
GFS2: fsid=notaclu:sdr.0:   inode = 11679 47716
GFS2: fsid=notaclu:sdr.0:   function = do_strip, file = fs/gfs2/bmap.c, line = 764
GFS2: fsid=notaclu:sdr.0: about to withdraw this file system
GFS2: fsid=notaclu:sdr.0: telling LM to withdraw
GFS2: fsid=notaclu:sdr.0: withdrawn

Call Trace:
[<ffffffff88560f42>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
[<ffffffff88564ca9>] :gfs2:gfs2_pin+0xca/0x10d
[<ffffffff88563b2d>] :gfs2:buf_lo_add+0x23/0xf2
[<ffffffff88570ec5>] :gfs2:gfs2_statfs_change+0xc3/0xd2
[<ffffffff8856ebaf>] :gfs2:rgblk_free+0x139/0x159
[<ffffffff8857265b>] :gfs2:gfs2_consist_inode_i+0x43/0x48
[<ffffffff88555921>] :gfs2:do_strip+0x271/0x350
[<ffffffff88554840>] :gfs2:recursive_scan+0xf4/0x177
[<ffffffff8855e379>] :gfs2:gfs2_glmutex_lock+0xaa/0xb3
[<ffffffff8855495c>] :gfs2:trunc_dealloc+0x99/0xc1
[<ffffffff885556b0>] :gfs2:do_strip+0x0/0x350
[<ffffffff88570000>] :gfs2:gfs2_statfs_slow+0x85/0x19b
[<ffffffff88555fb8>] :gfs2:gfs2_truncatei+0x5b8/0x5f6
[<ffffffff8856970f>] :gfs2:gfs2_setattr+0xa6/0x2ce
[<ffffffff8855e41c>] :gfs2:gfs2_glock_dq+0x9a/0xaa
[<ffffffff88569699>] :gfs2:gfs2_setattr+0x30/0x2ce
[<ffffffff80062536>] __down_write_nested+0x12/0x92
[<ffffffff8002c63d>] notify_change+0x14b/0x2e1
[<ffffffff88511438>] :nfsd:nfsd_setattr+0x349/0x4ce
[<ffffffff88512b08>] :nfsd:nfsd_create_v3+0x34b/0x4ea
[<ffffffff88518755>] :nfsd:nfsd3_proc_create+0x12f/0x140
[<ffffffff8850d1c4>] :nfsd:nfsd_dispatch+0xd7/0x198
[<ffffffff8843f518>] :sunrpc:svc_process+0x44d/0x70b
[<ffffffff800625cf>] __down_read+0x12/0x92
[<ffffffff8850d54d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8850d6fb>] :nfsd:nfsd+0x1ae/0x2db
[<ffffffff8005bfb1>] child_rip+0xa/0x11
[<ffffffff8850d54d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8850d54d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8005bfa7>] child_rip+0x0/0x11

Expected results:
No corruption. 

Additional info:
Comment 1 Wendy Cheng 2007-08-20 14:52:24 EDT
When corruption occurs, only di_blocks count is off. All other meta data 
fields seems to be ok.  
Comment 2 Wendy Cheng 2007-08-20 14:58:39 EDT
Very possible an issue with journal log flush - look like GFS2 is not able 
to keep its meta buffer(s) in sync during heavy journal IOs.  
Comment 3 Wendy Cheng 2007-08-21 09:27:13 EDT
Created attachment 161967 [details]
test patch

Combing thru the code yesterday .. this is the only possibility I can think of
at this moment.
Comment 4 Wendy Cheng 2007-08-21 10:44:25 EDT
The thinking behind the patch in comment #3 .. assume I can trust the rest
of gfs2's journal code.

On-disk meta block is initialized by init_dinode(). After the buffer is 
updated, buffer head is released. Then come with VM memory reclaiming that
yanks this buffer out. GFS2, still in the middle of nfsd create code patch,
comes to gfs2_setattr to invoke gfs2_meta_indirect_inode. Since it is not
"new", it ends up calling ll_rw_block to read in garbage from disk. The fix
is to add this special buffer into journal code's ip icache and let journal
code manage it.  

Doing another round of test now. 
Comment 5 Wendy Cheng 2007-08-21 15:34:21 EDT
So far so good - first 6-hours test run just completed... keep going ...
Comment 6 Wendy Cheng 2007-08-21 22:09:04 EDT
NFS clients closed connections around the end of the 2nd run. Checked the 
log - looked like someone manually rebooted the client machines. Server 
(GFS2) still works ok. Add Barry to this bugzilla to inform him about the
work-in-progress. Would like to be able to run 3 time (with his script -
totally 45 individual runs) before declaring this is fixed.

Also start to assess the impact of this bug, particular if "write" follows
the "create" (we catch this issue with "setattr" follows "write") to see
whether it is the cause of some other symptoms we have been seeing.   
Comment 7 Wendy Cheng 2007-08-21 22:11:29 EDT
grr... s/we catch this issue with "setattr" follows "write"/
        /we catch this issue with "setattr" follows "create/
Comment 8 Wendy Cheng 2007-08-22 00:29:26 EDT
Steve,

The inode meta buffer needs to stay in memory to prevent disk read until 
journal flush is done. I was checking the code to make sure I can use
i_cache for this purpose. The gfs2_meta_cache_flush() at the end of the 
gfs2_writepage() concerns me. I think it could blow my plan away (though
the benchmark still runs good so far). Is there any reason GFS2 needs to 
release meta buffer after writepage ? 
Comment 9 Wendy Cheng 2007-08-22 00:44:40 EDT
Steve, nm .. just see writepage is used to sync inode (that ends up doing
journal flush). So I should be safe here. No worry. 
Comment 10 Steve Whitehouse 2007-08-22 08:29:13 EDT
Yes, I think the patch looks good. Can you send it via cluster-devel for
upstream and then I can put it in?
Comment 11 Wendy Cheng 2007-08-22 08:44:14 EDT
ok, look good. Will package the patch. 
Comment 12 Wendy Cheng 2007-08-22 11:02:17 EDT
Created attachment 162067 [details]
RHEL5 patch
Comment 13 Rob Kenna 2007-08-22 15:17:23 EDT
bumped priority
Comment 14 Wendy Cheng 2007-08-22 16:44:14 EDT
Testing was done on 2.6.18-41.gfs2abhi.002 kernel without do_touch patch
(for problem recreation). It has been working on bmarson's benchmark 
machine that survives 4 full loop (60 runs, 26 hours total). Look solid. 

Upstream patch posted to cluster-devel... rhkernel-list will follow.  

This one-liner could silently corrupt meta data in write too, not isolated 
to truncate call. We found it in truncate call simply because do_strip
has a sanity check. Change bugzilla abstract accordingly. 
Comment 15 Wendy Cheng 2007-08-23 10:23:47 EDT
Sorry, I have caused a regression in this patch. I've attached the buffer
into the directory inode, istead of the file inode itself. Need to revise
the patch. 
Comment 16 Wendy Cheng 2007-08-23 10:58:05 EDT
The obvious symptom of the regression is that the directory databuf is 
not able to get released. It will fail module unload if anyone tries 
to rm gfs2.ko module:

Aug 22 18:15:41 salem kernel: slab error in kmem_cache_destroy(): cache
`gfs2_bufdata': Can't free all objects
Aug 22 18:15:41 salem kernel:
Aug 22 18:15:41 salem kernel: Call Trace:
Aug 22 18:15:41 salem kernel:  [<ffffffff800d27f1>] kmem_cache_destroy+0x7e/0x179
Aug 22 18:15:41 salem kernel:  [<ffffffff8844b876>] :gfs2:exit_gfs2_fs+0x32/0x50
Aug 22 18:15:41 salem kernel:  [<ffffffff800a0cdf>] sys_delete_module+0x196/0x1c5
Aug 22 18:15:41 salem kernel:  [<ffffffff8005b28d>] tracesys+0xd5/0xe0


Comment 17 Wendy Cheng 2007-08-23 13:40:19 EDT
Created attachment 171499 [details]
Revised patch

RHEL 5 patch

The bh is now passed back to gfs2_createi. It is added into icache before
gfs2_meta_inode_buffer is invoked. Previous patch will let the bh hanging
around with ref count non-zero. So the buffer never gets released.

With this change, now the databuf associated with this inode is completely
managed by journaling code. Hopefully the journaling code will hold up.
Comment 18 Wendy Cheng 2007-08-23 14:38:38 EDT
Re-do the testing at this moment ... 
Comment 19 Steve Whitehouse 2007-08-23 15:40:24 EDT
The patch looks good, so fingers crossed for the testing.
Comment 20 Wendy Cheng 2007-08-24 09:02:34 EDT
Created attachment 172411 [details]
Final patch - can be applied to both RHEL5 and git tree
Comment 21 Wendy Cheng 2007-08-24 09:05:09 EDT
Posted to cluster-devel. 
Comment 22 Wendy Cheng 2007-08-24 09:28:02 EDT
Posted to rhkernel-list. 
Comment 23 Barry Marson 2007-08-28 17:24:00 EDT
I have run the SPECsfs test suite on our BIGI large server with Abhi's
2.6.18-43.gfs2abhi.003 kernel.  The test has run twice to completion
successfully.  This was with default scand time.  I will be running additional
tests extending that time (since we found corruption happening there).

Of note, performance has dropped.  The NFS op write seems to be taking ~20% more
time at all measurement points.  We lost 1000 Ops peak rate ... But so far we
seem stable ...

Ill report back as I find more ...

Thanks,

Barry
Comment 24 Wendy Cheng 2007-08-28 17:31:47 EDT
I noticed that too while running on abhi's kernel - most likely from 
bz 248480. However, 248480 fix is too critical to back out. We'll have
to wait until we get into stable stage before re-tuning the performance.  
Comment 25 Don Zickus 2007-08-28 18:34:51 EDT
in 2.6.18-44.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 28 errata-xmlrpc 2007-11-07 15:00:32 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.