Hide Forgot
Description of problem: I have an HA Cluster with an iSCSI-based GFS2 filesystem. Everything works fine, except for every few days I get GFS2 failure on a single (but random) node: GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: fatal: filesystem consistency error GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: inode = 2064 17819491 GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: function = gfs2_dinode_dealloc, file = fs/gfs2/inode.c, line = 352 GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: about to withdraw this file system GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: telling LM to unmount GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: withdrawn Pid: 7064, comm: delete_workqueu Tainted: G I---------------- T 2.6.32-131.4.1.el6.x86_64 #1 Call Trace: [<ffffffffa04f6fd2>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2] [<ffffffffa04cc209>] ? trunc_dealloc+0xa9/0x130 [gfs2] [<ffffffffa04f71dd>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2] [<ffffffffa04dc584>] ? gfs2_dinode_dealloc+0x64/0x210 [gfs2] [<ffffffffa04f51da>] ? gfs2_delete_inode+0x1ba/0x280 [gfs2] [<ffffffffa04f50ad>] ? gfs2_delete_inode+0x8d/0x280 [gfs2] [<ffffffffa04f5020>] ? gfs2_delete_inode+0x0/0x280 [gfs2] [<ffffffff8118cfbe>] ? generic_delete_inode+0xde/0x1d0 [<ffffffffa04d9940>] ? delete_work_func+0x0/0x80 [gfs2] [<ffffffff8118d115>] ? generic_drop_inode+0x65/0x80 [<ffffffffa04f3c4e>] ? gfs2_drop_inode+0x2e/0x30 [gfs2] [<ffffffff8118bf82>] ? iput+0x62/0x70 [<ffffffffa04d9994>] ? delete_work_func+0x54/0x80 [gfs2] [<ffffffff810887d0>] ? worker_thread+0x170/0x2a0 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81088660>] ? worker_thread+0x0/0x2a0 [<ffffffff8108dd96>] ? kthread+0x96/0xa0 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 no_formal_ino = 2064 no_addr = 17819491 i_disksize = 106496 blocks = 0 i_goal = 14946110 i_diskflags = 0x00000000 i_height = 1 i_depth = 0 i_entries = 0 i_eattr = 0 GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.1: gfs2_delete_inode: -5 gdlm_unlock 5,10fe763 err=-22 INFO: task gfs2_logd:7076 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D 0000000000000006 0 7076 2 0x00000000 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8 Call Trace: [<ffffffff814db013>] io_schedule+0x73/0xc0 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2] [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2] [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2] [<ffffffff8108dd96>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 INFO: task gfs2_logd:7076 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D 0000000000000006 0 7076 2 0x00000000 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8 Call Trace: [<ffffffff814db013>] io_schedule+0x73/0xc0 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2] [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2] [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2] [<ffffffff8108dd96>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 INFO: task gfs2_logd:7076 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D 0000000000000006 0 7076 2 0x00000000 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8 Call Trace: [<ffffffff814db013>] io_schedule+0x73/0xc0 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2] [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2] [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2] [<ffffffff8108dd96>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 INFO: task gfs2_logd:7076 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D 0000000000000006 0 7076 2 0x00000000 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8 Call Trace: [<ffffffff814db013>] io_schedule+0x73/0xc0 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2] [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2] [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2] [<ffffffff8108dd96>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 INFO: task gfs2_logd:7076 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D 0000000000000006 0 7076 2 0x00000000 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8 Call Trace: [<ffffffff814db013>] io_schedule+0x73/0xc0 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2] [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2] [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2] [<ffffffff8108dd96>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 INFO: task gfs2_logd:7076 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D 0000000000000006 0 7076 2 0x00000000 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8 Call Trace: [<ffffffff814db013>] io_schedule+0x73/0xc0 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2] [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2] [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2] [<ffffffff8108dd96>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 INFO: task gfs2_logd:7076 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D 0000000000000006 0 7076 2 0x00000000 ffff8802348c5dd0 0000000000000046 0000000000000000 0000000032632afc ffff88019fb37130 0000000000000441 ffff8802348c5d70 0000000103e01059 ffff8802348c3af8 ffff8802348c5fd8 000000000000f598 ffff8802348c3af8 Call Trace: [<ffffffff814db013>] io_schedule+0x73/0xc0 [<ffffffffa04df16a>] gfs2_log_flush+0x44a/0x6b0 [gfs2] [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa04df4b1>] gfs2_logd+0xe1/0x150 [gfs2] [<ffffffffa04df3d0>] ? gfs2_logd+0x0/0x150 [gfs2] [<ffffffff8108dd96>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd00>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 The none failing member of the cluster shows this at the exact time of the failing nodes filesystem consistency error: GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Trying to acquire journal lock... GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Looking at journal... GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Acquiring the transaction lock... GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Replaying journal... GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Replayed 2189 of 5302 blocks GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Found 298 revoke tags GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Journal replayed in 6s GFS2: fsid=CDTG-HA-Cluster:CDTG_GFS2_LV.0: jid=1: Done Is this a known problem with GFS2 on iSCSI, or something else completely? Version-Release number of selected component (if applicable): RHEL 6.1, all updates, kernel-2.6.32-131.4.1.el6.x86_64
If you're a RHEL customer, please contact Red Hat GSS (Global Support Services) so we can keep them in the loop. This consistency error looks like other problems we've recently solved. If you are willing to try a test kernel, I recommend: (1) You run the latest fsck.gfs2 on your file system to eliminate possible corruption. I recommend you redirect the output for later examination: fsck.gfs2 -y /dev/your/device &> ~/fsck.out (2) You try an experimental new kernel that has our latest GFS2 fixes. I think we've got a good candidate, but I need to check a few things. Then perhaps I can put it on my people page for you to try. Let me know if that's okay with you. I'll check on which kernel is best to solve this, and track down the bugzilla record I'm thinking of. I could be wrong too, so no promises. I'll look into this tomorrow.
Running fsck.gfs2 now... Sure, I can run a test kernel to see if that clears this up. This happens 2 to 4 times a week, so I should know in a week or so if this fixes this issue. Thank you!
Hi Joshua, I put the experimental kernel here if you want to try it: http://people.redhat.com/rpeterso/Experimental/RHEL6.x/kernel* I'm going to set the NEEDINFO flag until I hear back whether this solved the problem. If it does, I'll close this as a duplicate.
Ok... running this kernel now on all nodes. Thanks again.