Bug 713949

Summary: GFS2 filesystem consistency error routine on all cluster nodes
Product: Red Hat Enterprise Linux 6 Reporter: joshua
Component: kernelAssignee: Robert Peterson <rpeterso>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: low    
Version: 6.1CC: adas, anprice, bmarzins, rpeterso, rwheeler, swhiteho
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-18 18:16:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 712139    
Bug Blocks:    

Description joshua 2011-06-16 20:50:34 UTC
Description of problem:

I have an HA Cluster with an iSCSI-based GFS2 filesystem.  Everything works fine, except for every few days I get GFS2 failure on a single (but random) node:


GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: fatal: filesystem consistency error
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1:   inode = 2064 17819491
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1:   function = gfs2_dinode_dealloc, file = fs/gfs2/inode.c, line = 352
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: about to withdraw this file system
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: telling LM to unmount
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: withdrawn


Pid: 7064, comm: delete_workqueu Tainted: G          I---------------- T 2.6.32-131.4.1.el6.x86_64 #1
Call Trace:
 [<ffffffffa04f6fd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
 [<ffffffffa04cc209>] ? trunc_dealloc+0xa9/0x130 [gfs2]
 [<ffffffffa04f71dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2]
 [<ffffffffa04dc584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2]
 [<ffffffffa04f51da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2]
 [<ffffffffa04f50ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2]
 [<ffffffffa04f5020>] ? gfs2_delete_inode+0x0/0x280 [gfs2]
 [<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0
 [<ffffffffa04d9940>] ? delete_work_func+0x0/0x80 [gfs2]
 [<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80
 [<ffffffffa04f3c4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2]
 [<ffffffff8118bf82>] ? iput+0x62/0x70
 [<ffffffffa04d9994>] ? delete_work_func+0x54/0x80 [gfs2]
 [<ffffffff810887d0>] ? worker_thread+0x170/0x2a0
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81088660>] ? worker_thread+0x0/0x2a0
 [<ffffffff8108dd96>] ? kthread+0x96/0xa0
 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
  no_formal_ino = 2064
  no_addr = 17819491
  i_disksize = 106496
  blocks = 0
  i_goal = 14946110
  i_diskflags = 0x00000000
  i_height = 1
  i_depth = 0
  i_entries = 0
  i_eattr = 0
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: gfs2_delete_inode: -5
gdlm_unlock 5,10fe763 err=-22
INFO: task gfs2_logd:7076 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D 0000000000000006     0  7076      2 0x00000000
 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc
 ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059
 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8
Call Trace:
 [<ffffffff814db013>] io_schedule+0x73/0xc0
 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2]
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2]
 [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2]
 [<ffffffff8108dd96>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
INFO: task gfs2_logd:7076 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D 0000000000000006     0  7076      2 0x00000000
 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc
 ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059
 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8
Call Trace:
 [<ffffffff814db013>] io_schedule+0x73/0xc0
 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2]
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2]
 [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2]
 [<ffffffff8108dd96>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
INFO: task gfs2_logd:7076 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D 0000000000000006     0  7076      2 0x00000000
 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc
 ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059
 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8
Call Trace:
 [<ffffffff814db013>] io_schedule+0x73/0xc0
 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2]
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2]
 [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2]
 [<ffffffff8108dd96>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
INFO: task gfs2_logd:7076 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D 0000000000000006     0  7076      2 0x00000000
 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc
 ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059
 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8
Call Trace:
 [<ffffffff814db013>] io_schedule+0x73/0xc0
 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2]
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2]
 [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2]
 [<ffffffff8108dd96>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
INFO: task gfs2_logd:7076 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D 0000000000000006     0  7076      2 0x00000000
 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc
 ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059
 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8
Call Trace:
 [<ffffffff814db013>] io_schedule+0x73/0xc0
 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2]
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2]
 [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2]
 [<ffffffff8108dd96>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
INFO: task gfs2_logd:7076 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D 0000000000000006     0  7076      2 0x00000000
 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc
 ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059
 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8
Call Trace:
 [<ffffffff814db013>] io_schedule+0x73/0xc0
 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2]
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2]
 [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2]
 [<ffffffff8108dd96>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
INFO: task gfs2_logd:7076 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D 0000000000000006     0  7076      2 0x00000000
 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc
 ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059
 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8
Call Trace:
 [<ffffffff814db013>] io_schedule+0x73/0xc0
 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2]
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2]
 [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2]
 [<ffffffff8108dd96>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd00>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20


The none failing member of the cluster shows this at the exact time of the failing nodes filesystem consistency error:

GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Trying to acquire journal lock...
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Looking at journal...
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Acquiring the transaction lock...
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Replaying journal...
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Replayed 2189 of 5302 blocks
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Found 298 revoke tags
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Journal replayed in 6s
GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Done


Is this a known problem with GFS2 on iSCSI, or something else completely?




Version-Release number of selected component (if applicable):

RHEL 6.1, all updates, kernel-2.6.32-131.4.1.el6.x86_64

Comment 2 Robert Peterson 2011-06-16 21:31:02 UTC
If you're a RHEL customer, please contact Red Hat GSS (Global
Support Services) so we can keep them in the loop.

This consistency error looks like other problems we've recently
solved.  If you are willing to try a test kernel, I recommend:

(1) You run the latest fsck.gfs2 on your file system to
    eliminate possible corruption.  I recommend you redirect
    the output for later examination:
    fsck.gfs2 -y /dev/your/device &> ~/fsck.out
(2) You try an experimental new kernel that has our latest GFS2
    fixes.  I think we've got a good candidate, but I need to
    check a few things.  Then perhaps I can put it on my
    people page for you to try.  Let me know if that's okay
    with you.

I'll check on which kernel is best to solve this, and track down
the bugzilla record I'm thinking of.  I could be wrong too, so
no promises.  I'll look into this tomorrow.

Comment 3 joshua 2011-06-16 21:59:10 UTC
Running fsck.gfs2 now...

Sure, I can run a test kernel to see if that clears this up.  This happens 2 to 4 times a week, so I should know in a week or so if this fixes this issue.

Thank you!

Comment 5 Robert Peterson 2011-06-17 19:12:04 UTC
Hi Joshua,

I put the experimental kernel here if you want to try it:

http://people.redhat.com/rpeterso/Experimental/RHEL6.x/kernel*

I'm going to set the NEEDINFO flag until I hear back whether
this solved the problem.  If it does, I'll close this as a
duplicate.

Comment 6 joshua 2011-06-17 21:10:24 UTC
Ok... running this kernel now on all nodes.  Thanks again.