Bug 453897 - Consistent kernel panics with most of our 3 GFS nodes all pointing to the same line and file: Kernel panic: GFS: Assertion failed on line 1227 of file rgrp.c
Summary: Consistent kernel panics with most of our 3 GFS nodes all pointing to the sa...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: GFS-kernel
Version: 3
Hardware: i686
OS: Linux
low
urgent
Target Milestone: ---
Assignee: Ben Marzinski
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-03 02:42 UTC by Dennis
Modified: 2010-01-12 03:20 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-09-26 16:22:20 UTC
Embargoed:


Attachments (Terms of Use)

Description Dennis 2008-07-03 02:42:40 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9) Gecko/2008052912 Firefox/3.0

Description of problem:
Were getting consistent kernel panics with most of our GFS nodes all 
pointing to the same line and file:

Kernel panic: GFS: Assertion failed on line 1227 of file rgrp.c

Configured are 3 nodes as lock_gulm servers and also gfs clients. 1 gnbd storage. IBMx3650 are 2 nodes, IBMx346 is another node and IBMx346 as gnbd server.

GNBD server serve only 1 GFS file system. /home is mounted on all 3 nodes w/ capacity of 500GBytes serving as mailbox for our mail systems.

Currently all are running the same version of kernel, GFS and GFS modules 
as list below.

- GFS-modules-smp-6.0.2.27-0
- GFS-6.0.2.27-0

- 2.4.21-37.ELsmp
Jun 28 20:35:49 drgenesis kernel: e5b61bac f8ea8b72 00000246 00001000 e5a44100 f8fc4000 e5a44100 f8ea8d70
Jun 28 20:35:49 drgenesis kernel:        00000246 000001f0 00000000 553a5ba8 f4d1f330 00000005 00000004 ffffffff
Jun 28 20:35:49 drgenesis kernel:        f8ec26d0 f8ec9c8f f8ec9bc4 000004cb 00000016 efcf3e00 00000006 f8fc4000
Jun 28 20:35:49 drgenesis kernel: Call Trace:   [<f8ea8b72>] gfs_asserti [gfs] 0x32 (0xe5b61bb0)
Jun 28 20:35:49 drgenesis kernel: [<f8ea8d70>] gmalloc [gfs] 0x20 (0xe5b61bc8)
Jun 28 20:35:49 drgenesis kernel: [<f8ec26d0>] blkalloc_internal [gfs] 0x130 (0xe5b61bec)
Jun 28 20:35:49 drgenesis kernel: [<f8ec9c8f>] .rodata.str1.1 [gfs] 0x1da3 (0xe5b61bf0)
Jun 28 20:35:49 drgenesis kernel: [<f8ec9bc4>] .rodata.str1.1 [gfs] 0x1cd8 (0xe5b61bf4)
Jun 28 20:35:49 drgenesis kernel: [<f8ec2b8b>] gfs_blkalloc [gfs] 0x7b (0xe5b61c20)
Jun 28 20:35:49 drgenesis kernel: [<f8e9c90c>] get_datablock [gfs] 0xfc (0xe5b61c4c)
Jun 28 20:35:49 drgenesis kernel: [<f8e9cc43>] gfs_block_map [gfs] 0x333 (0xe5b61c70)
Jun 28 20:35:49 drgenesis kernel: [<c0149093>] find_or_create_page [kernel] 0x63 (0xe5b61c9c)
Jun 28 20:35:49 drgenesis kernel: [<f8e8d08c>] gfs_dgetblk [gfs] 0x3c (0xe5b61cec)
Jun 28 20:35:49 drgenesis kernel: [<f8ec17bb>] gfs_rgrp_read [gfs] 0xab (0xe5b61d10)
Jun 28 20:35:49 drgenesis kernel: [<f8e96239>] get_block [gfs] 0xb9 (0xe5b61d28)
Jun 28 20:35:49 drgenesis kernel: [<c016814b>] __block_prepare_write [kernel] 0x1ab (0xe5b61d64)
Jun 28 20:35:49 drgenesis kernel: [<c0168b09>] block_prepare_write [kernel] 0x39 (0xe5b61da8)
Jun 28 20:35:49 drgenesis kernel: [<f8e96180>] get_block [gfs] 0x0 (0xe5b61dbc)
Jun 28 20:35:49 drgenesis kernel: [<f8e968fc>] gfs_prepare_write [gfs] 0x12c (0xe5b61dc8)
Jun 28 20:35:49 drgenesis kernel: [<f8e96180>] get_block [gfs] 0x0 (0xe5b61dd8)
Jun 28 20:35:49 drgenesis kernel: [<c014c053>] do_generic_file_write [kernel] 0x1e3 (0xe5b61df4)
Jun 28 20:35:49 drgenesis kernel: [<f8e90bab>] do_do_write [gfs] 0x2ab (0xe5b61e48)
Jun 28 20:35:49 drgenesis kernel: [<f8e90feb>] do_write [gfs] 0x18b (0xe5b61e94)
Jun 28 20:35:49 drgenesis kernel: [<f8e8ef1e>] gfs_walk_vma [gfs] 0x12e (0xe5b61ed0)
Jun 28 20:35:49 drgenesis kernel: [<f8eab4d7>] gfs_glock_nq_init [gfs] 0x37 (0xe5b61f2c)
Jun 28 20:35:49 drgenesis kernel: [<f8eab513>] gfs_glock_dq_uninit [gfs] 0x13 (0xe5b61f3c)
Jun 28 20:35:49 drgenesis kernel: [<f8e8ede7>] gfs_llseek [gfs] 0xc7 (0xe5b61f48)
Jun 28 20:35:49 drgenesis kernel: [<f8e910c1>] gfs_write [gfs] 0x91 (0xe5b61f6c)
Jun 28 20:35:49 drgenesis kernel: [<f8e90e60>] do_write [gfs] 0x0 (0xe5b61f80)
Jun 28 20:35:49 drgenesis kernel: [<c0164b27>] sys_write [kernel] 0x97 (0xe5b61f94)
Jun 28 20:35:49 drgenesis kernel:
Jun 28 20:35:49 drgenesis kernel: Kernel panic: GFS: Assertion failed on line 1227 of file rgrp.c
Jun 28 20:35:49 drgenesis kernel: GFS: assertion: "x <= length"
Jun 28 20:35:49 drgenesis kernel: GFS: time = 1214656549
Jun 28 20:35:49 drgenesis kernel: GFS: fsid=alpha:home.2: RG = 64975595
Jun 28 20:35:49 drgenesis kernel:
Jun 29 11:00:06 drgenesis syslogd 1.4.1: restart.
Jun 29 11:00:06 drgenesis syslog: syslogd startup succeeded



Version-Release number of selected component (if applicable):
kernel-2.4.21-37.ELsmp,  GFS-modules-smp-6.0.2.27-0 , GFS-6.0.2.27-0 

How reproducible:
Always


Steps to Reproduce:
If one of 3 nodes failed we do this manually,
1. Load the gfs modules (gnbd,gfs,pool,lock_gulm)
2. Start the gnbd_import
3. Start the pool,ccsd,lock_gulmd and gfs

Actual Results:
After 8 or 12 hours of joining it into cluster. One or two nodes will be panic, like above errors

Expected Results:


Additional info:

Comment 1 Ben Marzinski 2008-07-03 20:29:37 UTC
I see that you filed the bug under gnbd-kernel, has anything happened to lead
you to believe that gnbd is the cause of this problem?

Also, is it possible to upgrade to the most recent kernel and GFS-modules packages?

Comment 2 Ben Marzinski 2008-07-03 20:38:56 UTC
What kind of load are you running on the filesystems?

Comment 3 Dennis 2008-07-04 07:32:36 UTC
(In reply to comment #1)
> I see that you filed the bug under gnbd-kernel, has anything happened to lead
> you to believe that gnbd is the cause of this problem?
> 
> Also, is it possible to upgrade to the most recent kernel and GFS-modules
packages?

We already upgraded it into higher version of kernel and GFS.
kernel-smp-2.4.21-50.EL.i686.rpm, GFS-modules-smp-6.0.2.30-0.i386.rpm and
GFS-6.0.2.30-0.i386.rpm yesterday, but no to avail. We'll try to do gfs_fsck to
it, but it might take 8 to 11 hours for +300GB of size.


Comment 4 Dennis 2008-07-04 07:38:23 UTC
Here's the new error we have encountered after kernel and GFS upgrade in 1 our
node. This server serves POP.

Jul  4 10:18:11 drexodus kernel: Bad metadata at 64975751, should be 5
Jul  4 10:18:11 drexodus kernel:   mh_magic = 0x01161970
Jul  4 10:18:11 drexodus kernel:   mh_type = 4
Jul  4 10:18:11 drexodus kernel:   mh_generation = 375
Jul  4 10:18:11 drexodus kernel:   mh_format = 400
Jul  4 10:18:11 drexodus kernel:   mh_incarn = 123
Jul  4 10:18:11 drexodus kernel: db6a3b8c f8f1afa2 00000001 c0387e98 00000000
00000246 00000012 00000000
Jul  4 10:18:11 drexodus kernel:        c01298c3 0000000a 00000400 f8f3b831
db6a3bfc cde536b0 00000030 00000000
Jul  4 10:18:11 drexodus kernel:        f8f0052d f8f3c848 f8f3a43a 000004e5
00000013 f8f67000 db6a3bf8 cde53810
Jul  4 10:18:11 drexodus kernel: Call Trace:   [<f8f1afa2>] gfs_asserti [gfs]
0x32 (0xdb6a3b90)
Jul  4 10:18:11 drexodus kernel: [<c01298c3>] printk [kernel] 0x153 (0xdb6a3bac)
Jul  4 10:18:11 drexodus kernel: [<f8f3b831>] .rodata.str1.1 [gfs] 0x14c5
(0xdb6a3bb8)
Jul  4 10:18:11 drexodus kernel: [<f8f0052d>] gfs_get_meta_buffer [gfs] 0x29d
(0xdb6a3bcc)
Jul  4 10:18:11 drexodus kernel: [<f8f3c848>] .rodata.str1.4 [gfs] 0x3bc
(0xdb6a3bd0)
Jul  4 10:18:11 drexodus kernel: [<f8f3a43a>] .rodata.str1.1 [gfs] 0xce (0xdb6a3bd4)
Jul  4 10:18:11 drexodus kernel: [<f8f0ec3b>] gfs_block_map [gfs] 0x2eb (0xdb6a3c2c)
Jul  4 10:18:11 drexodus kernel: [<c011c610>] flush_tlb_all_ipi [kernel] 0x0
(0xdb6a3c54)
Jul  4 10:18:11 drexodus kernel: [<c01629a8>] map_new_virtual [kernel] 0x1a8
(0xdb6a3c9c)
Jul  4 10:18:11 drexodus kernel: [<f8f08249>] get_block [gfs] 0xb9 (0xdb6a3ce4)
Jul  4 10:18:11 drexodus kernel: [<c0168dd6>] block_read_full_page [kernel]
0x2e6 (0xdb6a3d20)
Jul  4 10:18:11 drexodus kernel: [<c0159ba4>] __alloc_pages [kernel] 0xc4
(0xdb6a3d60)
Jul  4 10:18:11 drexodus kernel: [<f8f086e2>] gfs_readpage [gfs] 0x82 (0xdb6a3d84)
Jul  4 10:18:11 drexodus kernel: [<f8f08190>] get_block [gfs] 0x0 (0xdb6a3d8c)
Jul  4 10:18:11 drexodus kernel: [<c0148cca>] add_to_page_cache_unique [kernel]
0x5a (0xdb6a3d90)
Jul  4 10:18:11 drexodus kernel: [<c0148f21>] page_cache_read [kernel] 0xe1
(0xdb6a3da4)
Jul  4 10:18:11 drexodus kernel: [<c0149947>] generic_file_readahead [kernel]
0xd7 (0xdb6a3dcc)
Jul  4 10:18:11 drexodus kernel: [<c0149f24>] do_generic_file_read [kernel]
0x4d4 (0xdb6a3de8)
Jul  4 10:18:11 drexodus kernel: [<c014a7db>] generic_file_new_read [kernel]
0xbb (0xdb6a3e28)
Jul  4 10:18:11 drexodus kernel: [<c014a620>] file_read_actor [kernel] 0x0
(0xdb6a3e38)
Jul  4 10:18:11 drexodus kernel: [<c014a91f>] generic_file_read [kernel] 0x3f
(0xdb6a3e7c)
Jul  4 10:18:11 drexodus kernel: [<f8f01aa4>] do_read [gfs] 0x1a4 (0xdb6a3e9c)
Jul  4 10:18:11 drexodus kernel: [<f8f00f3e>] gfs_walk_vma [gfs] 0x12e (0xdb6a3ed0)
Jul  4 10:18:11 drexodus kernel: [<c0134f2d>] update_process_time_intertick
[kernel] 0x7d (0xdb6a3f30)
Jul  4 10:18:11 drexodus kernel: [<f8f00d40>] gfs_llseek [gfs] 0x0 (0xdb6a3f38)
Jul  4 10:18:11 drexodus kernel: [<f8f00d8c>] gfs_llseek [gfs] 0x4c (0xdb6a3f48)
Jul  4 10:18:11 drexodus kernel: [<f8f01b1e>] gfs_read [gfs] 0x6e (0xdb6a3f6c)
Jul  4 10:18:11 drexodus kernel: [<f8f01900>] do_read [gfs] 0x0 (0xdb6a3f80)
Jul  4 10:18:11 drexodus kernel: [<c0165127>] sys_read [kernel] 0x97 (0xdb6a3f94)
Jul  4 10:18:11 drexodus kernel: [<c02af06f>] no_timing [kernel] 0x7 (0xdb6a3fc0)
Jul  4 10:18:11 drexodus kernel:
   Jul  4 10:18:11 drexodus kernel:
Jul  4 10:18:11 drexodus kernel: Kernel panic: GFS: Assertion failed on line
1253 of file linux_dio.c
Jul  4 10:18:11 drexodus kernel: GFS: assertion: "metatype_check_magic ==
GFS_MAGIC && metatype_check_type == ((height) ? (5) : (4))"
Jul  4 10:18:11 drexodus kernel: GFS: time = 1215137891
Jul  4 10:18:11 drexodus kernel: GFS: fsid=alpha:home.2
Jul  4 10:18:11 drexodus kernel:
Jul  4 11:33:10 drexodus syslogd 1.4.1: restart.
                                                                               
                

Comment 5 Ben Marzinski 2008-07-07 19:17:31 UTC
Have you run gfs_fsck? That definitely looks like it could be filesystem corruption.

Comment 6 Dennis 2008-07-08 08:04:10 UTC
(In reply to comment #5)
> Have you run gfs_fsck? That definitely looks like it could be filesystem
corruption.

Yes we did that. We're monitoring its performance. If its not panic for 24
hours. We'll declare this as resolve. :-).

Comment 7 Ben Marzinski 2008-09-26 16:22:20 UTC
There's no way to know the cause of this corruption. Since it's not recreateable, there's really nothing that can be done.


Note You need to log in before you can comment on or make changes to this bug.