Bug 164331

Summary:

fatal: filesystem consistency error during umount of GFS

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Henry Harris <henry.harris>

Component:

gfs

Assignee:

Abhijith Das <adas>

Status:

CLOSED ERRATA

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

axel.thimm, kanderso, nobody+wcheng, nstraz, rkenna

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2006-0234

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-03-09 19:45:59 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

164915

Attachments:

Description	Flags
This patch seems to make the crash go away - need further confirm and test though.	none

Description Henry Harris 2005-07-27 00:22:50 UTC

Description of problem: Did a shutdown -r on all three nodes in a three node 
cluster.  One of the nodes rebooted, the other two hung during shutdown 
showing the following on the console:

Jul 26 15:03:34 igtest01 kernel: GFS: fsid=igtest:snapfrom1.1: fatal: 
filesystem consistency error
Jul 26 15:03:34 igtest01 kernel: GFS: fsid=igtest:snapfrom1.1:   function = 
trans_go_xmote_bh
Jul 26 15:03:34 igtest01 kernel: GFS: fsid=igtest:snapfrom1.1:   file = 
fs/gfs/glops.c, line = 542
Jul 26 15:03:34 igtest01 kernel: GFS: fsid=igtest:snapfrom1.1:   time = 
1122411814
Jul 26 15:03:34 igtest01 kernel: GFS: fsid=igtest:snapfrom1.1: about to 
withdraw from the cluster
Jul 26 15:03:34 igtest01 kernel: GFS: fsid=igtest:snapfrom1.1: waiting for 
outstanding I/O
Jul 26 15:03:34 igtest01 kernel: GFS: fsid=igtest:snapfrom1.1: telling LM to 
withdraw
Jul 26 15:03:35 igtest01 kernel: lock_dlm: withdraw abandoned memory


The above sequence was repeated for multiple filesystems and was also seen 
in /var/log/messages on the one node that rebooted sucessfully.


Version-Release number of selected component (if applicable):


How reproducible:
Have not tried to reproduce.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Corey Marthaler 2005-09-20 16:03:15 UTC

I appeared to have reproduced this on the x86_64 link cluster (link-01 actually)
while running regressions tests. One node (link-08) was shot by link-01 after
missing heartbeats and then I started cleaning up the cluster inorder to start
tests again. I attempted to umount the GFS on link-01 and it then Oops:

GFS: fsid=LINK_128:vedder.0: fatal: filesystem consistency error
GFS: fsid=LINK_128:vedder.0:   function = trans_go_xmote_bh
GFS: fsid=LINK_128:vedder.0:   file =
/usr/src/build/614138-x86_64/BUILD/gfs-kernel-2.6.9-42/smp/src/gfs/glops.c, line
= 542
GFS: fsid=LINK_128:vedder.0:   time = 1127213653
GFS: fsid=LINK_128:vedder.0: about to withdraw from the cluster
GFS: fsid=LINK_128:vedder.0: waiting for outstanding I/O
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lm:190
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_gulm(U) lock_harness(U)
md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
pcmcia_core dm_mod ohci_hcd hw_random tg3 floppy ext3 jbd qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
Pid: 14991, comm: gulm_Cb_Handler Tainted: G   M  2.6.9-20.ELsmp
RIP: 0010:[<ffffffffa01ec767>] <ffffffffa01ec767>{:gfs:gfs_lm_withdraw+215}
RSP: 0018:000001002da6fc58  EFLAGS: 00010202
RAX: 0000000000000039 RBX: ffffff00001e48b8 RCX: 0000000100000000
RDX: ffffffff803d7748 RSI: 0000000000000246 RDI: ffffffff803d7740
RBP: ffffff00001ac000 R08: ffffffff803d7748 R09: ffffff00001e48b8
R10: ffffffff8011de14 R11: ffffffff8011de14 R12: 000001003bac67bc
R13: 000001003bac6790 R14: ffffff00001ac000 R15: 0000000000000003
FS:  0000002a95574b00(0000) GS:ffffffff804d2f00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000004570b6 CR3: 000000003ff38000 CR4: 00000000000006e0
Process gulm_Cb_Handler (pid: 14991, threadinfo 000001002da6e000, task
00000100363a2030)
Stack: 0000003000000030 000001002da6fd68 000001002da6fc78 00000000054c7d70
       0000000000000000 0000000000000000 ffffff00001e48b8 ffffff00001e48b8
       ffffffffa0205520 ffffff00001e48b8
Call Trace:<ffffffffa0204c60>{:gfs:gfs_consist_i+45}
<ffffffffa01e572b>{:gfs:trans_go_xmote_bh+154}
       <ffffffffa01e3793>{:gfs:xmote_bh+897}
<ffffffffa01e5122>{:gfs:gfs_glock_cb+194}
       <ffffffffa01c723a>{:lock_gulm:handler+394}
<ffffffff80132dcd>{default_wake_function+0}
       <ffffffff80132dcd>{default_wake_function+0}
<ffffffff80131bad>{finish_task_switch+55}
       <ffffffff80110ca3>{child_rip+8} <ffffffffa01c70b0>{:lock_gulm:handler+0}
       <ffffffff80110c9b>{child_rip+0}

Code: 0f 0b 71 86 20 a0 ff ff ff ff be 00 8b 85 98 88 03 00 85 c0
RIP <ffffffffa01ec767>{:gfs:gfs_lm_withdraw+215} RSP <000001002da6fc58>
 <0>Sep 20 05:54:13 link-01 sshd(pam_unix)[16671]: session opened for user root
by (uid=0)
Sep 20 05:54:13 link-01 kernel: GFS: fsid=LINK_128:vedder.0: fatal: filesystem
consistency error
Sep 20 05:54:13 link-01 kernel: GFS: fsid=LINK_128:vedder.0:   function =
trans_go_xmoKte_bh
Sep 20 05:54:13 link-01 kernel: GFS: fsid=LINK_128:vedder.0:   file =
/usr/src/build/614138-x86_64/BUILD/gfs-kernel-2.6.9-42/smp/src/gfs/glops.c, line
= 542
Sep 20 05:54:13 link-01 kernel: GFS: fsid=LINK_128:vedder.0:   time = 1127213653
Sep 20 05:54:13 link-01 keernel: GFS: fsid=LINK_128:vedder.0: about to withdraw
from the cluster
Sep 20 05:54:13 link-01 kernel: GFS: fsid=LINK_128:vedder.0: waiting for
outstanding I/O

Message from syslogd@link-01 at Tue Sep 20 05:54:13 2005 ...
link-01 kernel: invalid operand: 0000 [1] SrMP
nel panic - not syncing: Oops

Comment 2 Corey Marthaler 2005-10-04 18:56:36 UTC

FYI - hit this oops again today while trying to unmount a gfs filesystem on link-02.

Comment 3 Axel Thimm 2005-10-04 21:20:46 UTC

Looks a bit familiar to bug #169693.

Comment 4 Corey Marthaler 2005-12-13 18:36:56 UTC

*** Bug 175539 has been marked as a duplicate of this bug. ***

Comment 5 Ben Marzinski 2006-01-04 19:05:38 UTC

*** Bug 169693 has been marked as a duplicate of this bug. ***

Comment 6 Nate Straz 2006-01-18 14:29:31 UTC

I hit this again today after finishing up testing on 2.6.9-22.0.2.EL.

Jan 17 23:16:31 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: fatal: filesystem
 consistency error
Jan 17 23:16:31 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4:   function = tran
s_go_xmote_bh
Jan 17 23:16:31 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4:   file = /usr/src
/build/678343-i686/BUILD/gfs-kernel-2.6.9-45/up/src/gfs/glops.c, line = 542
Jan 17 23:16:31 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4:   time = 11375613
91
Jan 17 23:16:31 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: about to withdraw
 from the cluster
Jan 17 23:16:31 tank-05 kernel: GFS: fsid=tank-cluster:gfs0.4: waiting for outst
anding I/O
Jan 17 23:16:31 tank-05 kernel: ------------[ cut here ]------------
Jan 17 23:16:31 tank-05 kernel: kernel BUG at /usr/src/build/678343-i686/BUILD/g
fs-kernel-2.6.9-45/up/src/gfs/lm.c:190!
Jan 17 23:16:31 tank-05 kernel: invalid operand: 0000 [#1]
Jan 17 23:16:31 tank-05 kernel: Modules linked in: lock_dlm(U) gfs(U) lock_harne
ss(U) parport_pc lp parport autofs4 i2c_dev i2c_core dlm(U) cman(U) md5 ipv6 sun
rpc button battery ac uhci_hcd hw_random shpchp e1000 floppy dm_snapshot dm_zero
 dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Jan 17 23:16:31 tank-05 kernel: CPU:    0
Jan 17 23:16:31 tank-05 kernel: EIP:    0060:[<f8ced988>]    Not tainted VLI
Jan 17 23:16:31 tank-05 kernel: EFLAGS: 00010202   (2.6.9-22.0.2.EL)
Jan 17 23:16:31 tank-05 kernel: EIP is at gfs_lm_withdraw+0x50/0xbc [gfs]
Jan 17 23:16:31 tank-05 kernel: eax: 0000003b   ebx: f8c81890   ecx: f8d0e755
edx: f7067e40
Jan 17 23:16:31 tank-05 kernel: esi: f8c6d000   edi: 000004a0   ebp: f8c6d000
esp: f7067e54
Jan 17 23:16:31 tank-05 kernel: ds: 007b   es: 007b   ss: 0068
Jan 17 23:16:31 tank-05 kernel: Process lock_dlm1 (pid: 3639, threadinfo=f706700
0 task=f6474cd0)
Jan 17 23:16:31 tank-05 kernel: Stack: f8c6d000 f6ac86e0 f8d0a999 f8c6d000 f8d12
aeb f8c81890 f8c81890 f8d0b20c
Jan 17 23:16:31 tank-05 kernel:        f8c81890 f8d0d477 0000021e f8c81890 43cdc
f2f f8ce4b12 f8d0d477 0000021e
Jan 17 23:16:31 tank-05 kernel:        01161970 00000008 00000000 00000000 00000
320 00000000 00000000 00000000
Jan 17 23:16:31 tank-05 kernel: Call Trace:
Jan 17 23:16:31 tank-05 kernel:  [<f8d0a999>] gfs_consist_i+0x24/0x28 [gfs]
Jan 17 23:16:31 tank-05 kernel:  [<f8ce4b12>] trans_go_xmote_bh+0x86/0xbc [gfs]
Jan 17 23:16:31 tank-05 kernel:  [<f8ce00d3>] xmote_bh+0x660/0x7a1 [gfs]
Jan 17 23:16:31 tank-05 kernel:  [<f8ce252b>] gfs_glock_cb+0xa2/0x12f [gfs]
Jan 17 23:16:31 tank-05 kernel:  [<f8c99ae0>] process_complete+0x3af/0x3b7 [lock
_dlm]
Jan 17 23:16:31 tank-05 kernel:  [<f8c99e79>] dlm_async+0x391/0x416 [lock_dlm]
Jan 17 23:16:31 tank-05 kernel:  [<c011cf22>] default_wake_function+0x0/0xc
Jan 17 23:16:31 tank-05 kernel:  [<c030e08f>] schedule+0x43f/0x552
Jan 17 23:16:31 tank-05 kernel:  [<c011cf22>] default_wake_function+0x0/0xc
Jan 17 23:16:31 tank-05 kernel:  [<f8c99ae8>] dlm_async+0x0/0x416 [lock_dlm]
Jan 17 23:16:31 tank-05 kernel:  [<c013972d>] kthread+0x69/0x91
Jan 17 23:16:31 tank-05 kernel:  [<c01396c4>] kthread+0x0/0x91
Jan 17 23:16:31 tank-05 kernel:  [<c01041d9>] kernel_thread_helper+0x5/0xb
Jan 17 23:16:31 tank-05 kernel: Code: ff 74 24 14 e8 a4 3c 43 c7 53 68 23 e7 d0
f8 e8 88 3c 43 c7 53 68 55 e7 d0 f8 e8 7d 3c 43 c7 83 c4 18 83 be 34 02 00 00 00
 74 08 <0f> 0b be 00 59 e6 d0 f8 8b 86 70 48 01 00 85 c0 74 1b b8 00 f0

Comment 11 Nate Straz 2006-02-14 15:35:45 UTC

I hit this after running on 2.6.9-31.ELsmp.

Feb 14 01:17:13 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.0: fatal:
filesystem consistency error
Feb 14 01:17:13 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.0:   function =
trans_go_xmote_bh
Feb 14 01:17:13 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.0:   file =
/usr/src/build/700436-i686/BUILD/gfs-kernel-2.6.9-48/smp/src/gfs/glops.c, line = 542
Feb 14 01:17:13 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.0:   time = 1139901433
Feb 14 01:17:13 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.0: about to
withdraw from the cluster
Feb 14 01:17:13 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.0: waiting for
outstanding I/O
Feb 14 01:17:13 morph-04 kernel: ------------[ cut here ]------------
Feb 14 01:17:13 morph-04 kernel: kernel BUG at
/usr/src/build/700436-i686/BUILD/gfs-kernel-2.6.9-48/smp/src/gfs/lm.c:190!
Feb 14 01:17:13 morph-04 kernel: invalid operand: 0000 [#1]
Feb 14 01:17:13 morph-04 kernel: SMP
Feb 14 01:17:13 morph-04 kernel: Modules linked in: lock_dlm(U) parport_pc lp
parport autofs4 i2c_dev i2c_core gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6
sunrpc button battery a
c uhci_hcd e7xxx_edac edac_mc hw_random e1000 floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Feb 14 01:17:13 morph-04 kernel: CPU:    1
Feb 14 01:17:13 morph-04 kernel: EIP:    0060:[<f8cd14a7>]    Not tainted VLI
Feb 14 01:17:13 morph-04 kernel: EFLAGS: 00010202   (2.6.9-31.ELsmp)
Feb 14 01:17:13 morph-04 kernel: EIP is at gfs_lm_withdraw+0x51/0xc0 [gfs]
Feb 14 01:17:13 morph-04 kernel: eax: 0000003c   ebx: f8c87730   ecx: f6863e34 
 edx: f8ced5e5
Feb 14 01:17:13 morph-04 kernel: esi: f8c63000   edi: f6d9eed0   ebp: f7e41f10 
 esp: f6863e48
Feb 14 01:17:13 morph-04 kernel: ds: 007b   es: 007b   ss: 0068
Feb 14 01:17:13 morph-04 kernel: Process lock_dlm1 (pid: 4073,
threadinfo=f6863000 task=f74c5430)
Feb 14 01:17:13 morph-04 kernel: Stack: f8c63000 f7e41e64 f8ce9ef7 f8c63000
f8cf0b25 f8c87730 f8c87730 f8cea738
Feb 14 01:17:13 morph-04 kernel:        f8c87730 f8cec505 0000021e f8c87730
43f183f9 f8cca65c f8cec505 0000021e
Feb 14 01:17:13 morph-04 kernel:        01161970 00000008 00000000 00000000
00000320 00000000 00000000 00000000
Feb 14 01:17:13 morph-04 kernel: Call Trace:
Feb 14 01:17:13 morph-04 kernel:  [<f8ce9ef7>] gfs_consist_i+0x24/0x28 [gfs]
Feb 14 01:17:13 morph-04 kernel:  [<f8cca65c>] trans_go_xmote_bh+0x86/0xbc [gfs]
Feb 14 01:17:13 morph-04 kernel:  [<f8cc7404>] xmote_bh+0x312/0x3ab [gfs]
Feb 14 01:17:13 morph-04 kernel:  [<f8cc8adc>] gfs_glock_cb+0xa3/0x131 [gfs]
Feb 14 01:17:13 morph-04 kernel:  [<f8c9d6dd>] process_complete+0x3b7/0x3bf
[lock_dlm]
Feb 14 01:17:13 morph-04 kernel:  [<f8c9d95b>] dlm_async+0x276/0x2ff [lock_dlm]
Feb 14 01:17:13 morph-04 kernel:  [<c011e6fb>] default_wake_function+0x0/0xc
Feb 14 01:17:13 morph-04 kernel:  [<c011e6fb>] default_wake_function+0x0/0xc
Feb 14 01:17:13 morph-04 kernel:  [<f8c9d6e5>] dlm_async+0x0/0x2ff [lock_dlm]
Feb 14 01:17:13 morph-04 kernel:  [<c0133ead>] kthread+0x73/0x9b
Feb 14 01:17:13 morph-04 kernel:  [<c0133e3a>] kthread+0x0/0x9b
Feb 14 01:17:13 morph-04 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Feb 14 01:17:13 morph-04 kernel: Code: ff 74 24 14 e8 a6 11 45 c7 53 68 b3 d5 ce
f8 e8 8a 11 45 c7 53 68 e5 d5 ce f8 e8 7f 11 45 c7 83 c4 18 83 be 34 02 00 00 00
74 08 <0f> 0b be 00
 e8 d4 ce f8 8b 86 10 47 02 00 85 c0 74 1c ba 02 00

Comment 12 Abhijith Das 2006-02-28 17:17:55 UTC

Wendy's fix for this bug is in CVS.
While granting exclusive lock, gfs_glock_cb() expects all other threads
have relinguished their writes and journal has been flushed and shutdown.
Otherwise it aborts the call and forces a filesystem consistency error.
The current umount code (gfs_put_super) doesn't follow this logic by
doing flushes without log shutdown before the exclusive lock is requested.
The patch works around this issue by relocating the flushes into
gfs_make_fs_ro() call itself after the gfs_glock_nq_init().

Comment 15 Red Hat Bugzilla 2006-03-09 19:45:59 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0234.html