From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9) Gecko/2008052912 Firefox/3.0 Description of problem: Were getting consistent kernel panics with most of our GFS nodes all pointing to the same line and file: Kernel panic: GFS: Assertion failed on line 1227 of file rgrp.c Configured are 3 nodes as lock_gulm servers and also gfs clients. 1 gnbd storage. IBMx3650 are 2 nodes, IBMx346 is another node and IBMx346 as gnbd server. GNBD server serve only 1 GFS file system. /home is mounted on all 3 nodes w/ capacity of 500GBytes serving as mailbox for our mail systems. Currently all are running the same version of kernel, GFS and GFS modules as list below. - GFS-modules-smp-6.0.2.27-0 - GFS-6.0.2.27-0 - 2.4.21-37.ELsmp Jun 28 20:35:49 drgenesis kernel: e5b61bac f8ea8b72 00000246 00001000 e5a44100 f8fc4000 e5a44100 f8ea8d70 Jun 28 20:35:49 drgenesis kernel: 00000246 000001f0 00000000 553a5ba8 f4d1f330 00000005 00000004 ffffffff Jun 28 20:35:49 drgenesis kernel: f8ec26d0 f8ec9c8f f8ec9bc4 000004cb 00000016 efcf3e00 00000006 f8fc4000 Jun 28 20:35:49 drgenesis kernel: Call Trace: [<f8ea8b72>] gfs_asserti [gfs] 0x32 (0xe5b61bb0) Jun 28 20:35:49 drgenesis kernel: [<f8ea8d70>] gmalloc [gfs] 0x20 (0xe5b61bc8) Jun 28 20:35:49 drgenesis kernel: [<f8ec26d0>] blkalloc_internal [gfs] 0x130 (0xe5b61bec) Jun 28 20:35:49 drgenesis kernel: [<f8ec9c8f>] .rodata.str1.1 [gfs] 0x1da3 (0xe5b61bf0) Jun 28 20:35:49 drgenesis kernel: [<f8ec9bc4>] .rodata.str1.1 [gfs] 0x1cd8 (0xe5b61bf4) Jun 28 20:35:49 drgenesis kernel: [<f8ec2b8b>] gfs_blkalloc [gfs] 0x7b (0xe5b61c20) Jun 28 20:35:49 drgenesis kernel: [<f8e9c90c>] get_datablock [gfs] 0xfc (0xe5b61c4c) Jun 28 20:35:49 drgenesis kernel: [<f8e9cc43>] gfs_block_map [gfs] 0x333 (0xe5b61c70) Jun 28 20:35:49 drgenesis kernel: [<c0149093>] find_or_create_page [kernel] 0x63 (0xe5b61c9c) Jun 28 20:35:49 drgenesis kernel: [<f8e8d08c>] gfs_dgetblk [gfs] 0x3c (0xe5b61cec) Jun 28 20:35:49 drgenesis kernel: [<f8ec17bb>] gfs_rgrp_read [gfs] 0xab (0xe5b61d10) Jun 28 20:35:49 drgenesis kernel: [<f8e96239>] get_block [gfs] 0xb9 (0xe5b61d28) Jun 28 20:35:49 drgenesis kernel: [<c016814b>] __block_prepare_write [kernel] 0x1ab (0xe5b61d64) Jun 28 20:35:49 drgenesis kernel: [<c0168b09>] block_prepare_write [kernel] 0x39 (0xe5b61da8) Jun 28 20:35:49 drgenesis kernel: [<f8e96180>] get_block [gfs] 0x0 (0xe5b61dbc) Jun 28 20:35:49 drgenesis kernel: [<f8e968fc>] gfs_prepare_write [gfs] 0x12c (0xe5b61dc8) Jun 28 20:35:49 drgenesis kernel: [<f8e96180>] get_block [gfs] 0x0 (0xe5b61dd8) Jun 28 20:35:49 drgenesis kernel: [<c014c053>] do_generic_file_write [kernel] 0x1e3 (0xe5b61df4) Jun 28 20:35:49 drgenesis kernel: [<f8e90bab>] do_do_write [gfs] 0x2ab (0xe5b61e48) Jun 28 20:35:49 drgenesis kernel: [<f8e90feb>] do_write [gfs] 0x18b (0xe5b61e94) Jun 28 20:35:49 drgenesis kernel: [<f8e8ef1e>] gfs_walk_vma [gfs] 0x12e (0xe5b61ed0) Jun 28 20:35:49 drgenesis kernel: [<f8eab4d7>] gfs_glock_nq_init [gfs] 0x37 (0xe5b61f2c) Jun 28 20:35:49 drgenesis kernel: [<f8eab513>] gfs_glock_dq_uninit [gfs] 0x13 (0xe5b61f3c) Jun 28 20:35:49 drgenesis kernel: [<f8e8ede7>] gfs_llseek [gfs] 0xc7 (0xe5b61f48) Jun 28 20:35:49 drgenesis kernel: [<f8e910c1>] gfs_write [gfs] 0x91 (0xe5b61f6c) Jun 28 20:35:49 drgenesis kernel: [<f8e90e60>] do_write [gfs] 0x0 (0xe5b61f80) Jun 28 20:35:49 drgenesis kernel: [<c0164b27>] sys_write [kernel] 0x97 (0xe5b61f94) Jun 28 20:35:49 drgenesis kernel: Jun 28 20:35:49 drgenesis kernel: Kernel panic: GFS: Assertion failed on line 1227 of file rgrp.c Jun 28 20:35:49 drgenesis kernel: GFS: assertion: "x <= length" Jun 28 20:35:49 drgenesis kernel: GFS: time = 1214656549 Jun 28 20:35:49 drgenesis kernel: GFS: fsid=alpha:home.2: RG = 64975595 Jun 28 20:35:49 drgenesis kernel: Jun 29 11:00:06 drgenesis syslogd 1.4.1: restart. Jun 29 11:00:06 drgenesis syslog: syslogd startup succeeded Version-Release number of selected component (if applicable): kernel-2.4.21-37.ELsmp, GFS-modules-smp-6.0.2.27-0 , GFS-6.0.2.27-0 How reproducible: Always Steps to Reproduce: If one of 3 nodes failed we do this manually, 1. Load the gfs modules (gnbd,gfs,pool,lock_gulm) 2. Start the gnbd_import 3. Start the pool,ccsd,lock_gulmd and gfs Actual Results: After 8 or 12 hours of joining it into cluster. One or two nodes will be panic, like above errors Expected Results: Additional info:
I see that you filed the bug under gnbd-kernel, has anything happened to lead you to believe that gnbd is the cause of this problem? Also, is it possible to upgrade to the most recent kernel and GFS-modules packages?
What kind of load are you running on the filesystems?
(In reply to comment #1) > I see that you filed the bug under gnbd-kernel, has anything happened to lead > you to believe that gnbd is the cause of this problem? > > Also, is it possible to upgrade to the most recent kernel and GFS-modules packages? We already upgraded it into higher version of kernel and GFS. kernel-smp-2.4.21-50.EL.i686.rpm, GFS-modules-smp-6.0.2.30-0.i386.rpm and GFS-6.0.2.30-0.i386.rpm yesterday, but no to avail. We'll try to do gfs_fsck to it, but it might take 8 to 11 hours for +300GB of size.
Here's the new error we have encountered after kernel and GFS upgrade in 1 our node. This server serves POP. Jul 4 10:18:11 drexodus kernel: Bad metadata at 64975751, should be 5 Jul 4 10:18:11 drexodus kernel: mh_magic = 0x01161970 Jul 4 10:18:11 drexodus kernel: mh_type = 4 Jul 4 10:18:11 drexodus kernel: mh_generation = 375 Jul 4 10:18:11 drexodus kernel: mh_format = 400 Jul 4 10:18:11 drexodus kernel: mh_incarn = 123 Jul 4 10:18:11 drexodus kernel: db6a3b8c f8f1afa2 00000001 c0387e98 00000000 00000246 00000012 00000000 Jul 4 10:18:11 drexodus kernel: c01298c3 0000000a 00000400 f8f3b831 db6a3bfc cde536b0 00000030 00000000 Jul 4 10:18:11 drexodus kernel: f8f0052d f8f3c848 f8f3a43a 000004e5 00000013 f8f67000 db6a3bf8 cde53810 Jul 4 10:18:11 drexodus kernel: Call Trace: [<f8f1afa2>] gfs_asserti [gfs] 0x32 (0xdb6a3b90) Jul 4 10:18:11 drexodus kernel: [<c01298c3>] printk [kernel] 0x153 (0xdb6a3bac) Jul 4 10:18:11 drexodus kernel: [<f8f3b831>] .rodata.str1.1 [gfs] 0x14c5 (0xdb6a3bb8) Jul 4 10:18:11 drexodus kernel: [<f8f0052d>] gfs_get_meta_buffer [gfs] 0x29d (0xdb6a3bcc) Jul 4 10:18:11 drexodus kernel: [<f8f3c848>] .rodata.str1.4 [gfs] 0x3bc (0xdb6a3bd0) Jul 4 10:18:11 drexodus kernel: [<f8f3a43a>] .rodata.str1.1 [gfs] 0xce (0xdb6a3bd4) Jul 4 10:18:11 drexodus kernel: [<f8f0ec3b>] gfs_block_map [gfs] 0x2eb (0xdb6a3c2c) Jul 4 10:18:11 drexodus kernel: [<c011c610>] flush_tlb_all_ipi [kernel] 0x0 (0xdb6a3c54) Jul 4 10:18:11 drexodus kernel: [<c01629a8>] map_new_virtual [kernel] 0x1a8 (0xdb6a3c9c) Jul 4 10:18:11 drexodus kernel: [<f8f08249>] get_block [gfs] 0xb9 (0xdb6a3ce4) Jul 4 10:18:11 drexodus kernel: [<c0168dd6>] block_read_full_page [kernel] 0x2e6 (0xdb6a3d20) Jul 4 10:18:11 drexodus kernel: [<c0159ba4>] __alloc_pages [kernel] 0xc4 (0xdb6a3d60) Jul 4 10:18:11 drexodus kernel: [<f8f086e2>] gfs_readpage [gfs] 0x82 (0xdb6a3d84) Jul 4 10:18:11 drexodus kernel: [<f8f08190>] get_block [gfs] 0x0 (0xdb6a3d8c) Jul 4 10:18:11 drexodus kernel: [<c0148cca>] add_to_page_cache_unique [kernel] 0x5a (0xdb6a3d90) Jul 4 10:18:11 drexodus kernel: [<c0148f21>] page_cache_read [kernel] 0xe1 (0xdb6a3da4) Jul 4 10:18:11 drexodus kernel: [<c0149947>] generic_file_readahead [kernel] 0xd7 (0xdb6a3dcc) Jul 4 10:18:11 drexodus kernel: [<c0149f24>] do_generic_file_read [kernel] 0x4d4 (0xdb6a3de8) Jul 4 10:18:11 drexodus kernel: [<c014a7db>] generic_file_new_read [kernel] 0xbb (0xdb6a3e28) Jul 4 10:18:11 drexodus kernel: [<c014a620>] file_read_actor [kernel] 0x0 (0xdb6a3e38) Jul 4 10:18:11 drexodus kernel: [<c014a91f>] generic_file_read [kernel] 0x3f (0xdb6a3e7c) Jul 4 10:18:11 drexodus kernel: [<f8f01aa4>] do_read [gfs] 0x1a4 (0xdb6a3e9c) Jul 4 10:18:11 drexodus kernel: [<f8f00f3e>] gfs_walk_vma [gfs] 0x12e (0xdb6a3ed0) Jul 4 10:18:11 drexodus kernel: [<c0134f2d>] update_process_time_intertick [kernel] 0x7d (0xdb6a3f30) Jul 4 10:18:11 drexodus kernel: [<f8f00d40>] gfs_llseek [gfs] 0x0 (0xdb6a3f38) Jul 4 10:18:11 drexodus kernel: [<f8f00d8c>] gfs_llseek [gfs] 0x4c (0xdb6a3f48) Jul 4 10:18:11 drexodus kernel: [<f8f01b1e>] gfs_read [gfs] 0x6e (0xdb6a3f6c) Jul 4 10:18:11 drexodus kernel: [<f8f01900>] do_read [gfs] 0x0 (0xdb6a3f80) Jul 4 10:18:11 drexodus kernel: [<c0165127>] sys_read [kernel] 0x97 (0xdb6a3f94) Jul 4 10:18:11 drexodus kernel: [<c02af06f>] no_timing [kernel] 0x7 (0xdb6a3fc0) Jul 4 10:18:11 drexodus kernel: Jul 4 10:18:11 drexodus kernel: Jul 4 10:18:11 drexodus kernel: Kernel panic: GFS: Assertion failed on line 1253 of file linux_dio.c Jul 4 10:18:11 drexodus kernel: GFS: assertion: "metatype_check_magic == GFS_MAGIC && metatype_check_type == ((height) ? (5) : (4))" Jul 4 10:18:11 drexodus kernel: GFS: time = 1215137891 Jul 4 10:18:11 drexodus kernel: GFS: fsid=alpha:home.2 Jul 4 10:18:11 drexodus kernel: Jul 4 11:33:10 drexodus syslogd 1.4.1: restart.
Have you run gfs_fsck? That definitely looks like it could be filesystem corruption.
(In reply to comment #5) > Have you run gfs_fsck? That definitely looks like it could be filesystem corruption. Yes we did that. We're monitoring its performance. If its not panic for 24 hours. We'll declare this as resolve. :-).
There's no way to know the cause of this corruption. Since it's not recreateable, there's really nothing that can be done.