Red Hat Bugzilla – Bug 253590
GFS2: meta data corruption under heavy IOs
Last modified: 2007-11-30 17:07:47 EST
Description of problem:
This is previously bugzilla 251053 where large amount of NO-OP truncate
calls (setattr() with "request size" equal to "actual file size") would
cause inode meta data (only with di_blocks field) corruption during
SPECsfs benchmark runs. We avoid the corruption in 251053 by creating
a new routine that handles the "equal size" special case without actually
generating a series of do-nothing IOs.
This bugzilla is opened to trace down the root cause of the corruption.
It is a rare event - only happens with performance group's machine setup
that equipped with 4 network interfaces, 4 nfs client machines, 56
filesystems, 128 NFSD threads, without DLM lock protocl (lock_nolock)
under SPECsfs benchmarks continuously running for a while. The corruption
normally happens around 15000 Op per second - note that this is NFS op
count (each NFS operation could carry a large amount of file data itself).
Version-Release number of selected component (if applicable):
RHEL 5.1 41.el5 build
Average once for every 18-hours run (1 in 3 benchmark runs - each
run lasts about 6 hours).
Steps to Reproduce:
1. Run SPECsfs with performance group's standard script
2. Normally happens when OP per seconds exceeds 15000.
Kernel assertion is triggered - on memory inode di_blocks count
is 0, on-disk di_blocks count is -1, but fsck (and/or examining
the disk by hand) expects di_blocks count to be 1. All other part
of the disk seems to be ok (at lease fsck can't find anything).
The stack dump at the time of assertion is :
GFS2: fsid=notaclu:sdr.0: fatal: filesystem consistency error
GFS2: fsid=notaclu:sdr.0: inode = 11679 47716
GFS2: fsid=notaclu:sdr.0: function = do_strip, file = fs/gfs2/bmap.c, line = 764
GFS2: fsid=notaclu:sdr.0: about to withdraw this file system
GFS2: fsid=notaclu:sdr.0: telling LM to withdraw
GFS2: fsid=notaclu:sdr.0: withdrawn
When corruption occurs, only di_blocks count is off. All other meta data
fields seems to be ok.
Very possible an issue with journal log flush - look like GFS2 is not able
to keep its meta buffer(s) in sync during heavy journal IOs.
Created attachment 161967 [details]
Combing thru the code yesterday .. this is the only possibility I can think of
at this moment.
The thinking behind the patch in comment #3 .. assume I can trust the rest
of gfs2's journal code.
On-disk meta block is initialized by init_dinode(). After the buffer is
updated, buffer head is released. Then come with VM memory reclaiming that
yanks this buffer out. GFS2, still in the middle of nfsd create code patch,
comes to gfs2_setattr to invoke gfs2_meta_indirect_inode. Since it is not
"new", it ends up calling ll_rw_block to read in garbage from disk. The fix
is to add this special buffer into journal code's ip icache and let journal
code manage it.
Doing another round of test now.
So far so good - first 6-hours test run just completed... keep going ...
NFS clients closed connections around the end of the 2nd run. Checked the
log - looked like someone manually rebooted the client machines. Server
(GFS2) still works ok. Add Barry to this bugzilla to inform him about the
work-in-progress. Would like to be able to run 3 time (with his script -
totally 45 individual runs) before declaring this is fixed.
Also start to assess the impact of this bug, particular if "write" follows
the "create" (we catch this issue with "setattr" follows "write") to see
whether it is the cause of some other symptoms we have been seeing.
grr... s/we catch this issue with "setattr" follows "write"/
/we catch this issue with "setattr" follows "create/
The inode meta buffer needs to stay in memory to prevent disk read until
journal flush is done. I was checking the code to make sure I can use
i_cache for this purpose. The gfs2_meta_cache_flush() at the end of the
gfs2_writepage() concerns me. I think it could blow my plan away (though
the benchmark still runs good so far). Is there any reason GFS2 needs to
release meta buffer after writepage ?
Steve, nm .. just see writepage is used to sync inode (that ends up doing
journal flush). So I should be safe here. No worry.
Yes, I think the patch looks good. Can you send it via cluster-devel for
upstream and then I can put it in?
ok, look good. Will package the patch.
Created attachment 162067 [details]
Testing was done on 2.6.18-41.gfs2abhi.002 kernel without do_touch patch
(for problem recreation). It has been working on bmarson's benchmark
machine that survives 4 full loop (60 runs, 26 hours total). Look solid.
Upstream patch posted to cluster-devel... rhkernel-list will follow.
This one-liner could silently corrupt meta data in write too, not isolated
to truncate call. We found it in truncate call simply because do_strip
has a sanity check. Change bugzilla abstract accordingly.
Sorry, I have caused a regression in this patch. I've attached the buffer
into the directory inode, istead of the file inode itself. Need to revise
The obvious symptom of the regression is that the directory databuf is
not able to get released. It will fail module unload if anyone tries
to rm gfs2.ko module:
Aug 22 18:15:41 salem kernel: slab error in kmem_cache_destroy(): cache
`gfs2_bufdata': Can't free all objects
Aug 22 18:15:41 salem kernel:
Aug 22 18:15:41 salem kernel: Call Trace:
Aug 22 18:15:41 salem kernel: [<ffffffff800d27f1>] kmem_cache_destroy+0x7e/0x179
Aug 22 18:15:41 salem kernel: [<ffffffff8844b876>] :gfs2:exit_gfs2_fs+0x32/0x50
Aug 22 18:15:41 salem kernel: [<ffffffff800a0cdf>] sys_delete_module+0x196/0x1c5
Aug 22 18:15:41 salem kernel: [<ffffffff8005b28d>] tracesys+0xd5/0xe0
Created attachment 171499 [details]
RHEL 5 patch
The bh is now passed back to gfs2_createi. It is added into icache before
gfs2_meta_inode_buffer is invoked. Previous patch will let the bh hanging
around with ref count non-zero. So the buffer never gets released.
With this change, now the databuf associated with this inode is completely
managed by journaling code. Hopefully the journaling code will hold up.
Re-do the testing at this moment ...
The patch looks good, so fingers crossed for the testing.
Created attachment 172411 [details]
Final patch - can be applied to both RHEL5 and git tree
Posted to cluster-devel.
Posted to rhkernel-list.
I have run the SPECsfs test suite on our BIGI large server with Abhi's
2.6.18-43.gfs2abhi.003 kernel. The test has run twice to completion
successfully. This was with default scand time. I will be running additional
tests extending that time (since we found corruption happening there).
Of note, performance has dropped. The NFS op write seems to be taking ~20% more
time at all measurement points. We lost 1000 Ops peak rate ... But so far we
seem stable ...
Ill report back as I find more ...
I noticed that too while running on abhi's kernel - most likely from
bz 248480. However, 248480 fix is too critical to back out. We'll have
to wait until we get into stable stage before re-tuning the performance.
You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.