From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050512 Red Hat/1.0.4-1.4.1 Firefox/1.0.4 Description of problem: running regression tests on 5 node cluster. one of the nodes oops'ed with: Jul 26 14:14:17 tank-01 kernel: Unable to handle kernel paging request at virtual address 00200214 Jul 26 14:14:17 tank-01 kernel: printing eip: Jul 26 14:14:17 tank-01 kernel: f8cf4d02 Jul 26 14:14:17 tank-01 kernel: *pde = 00004001 Jul 26 14:14:17 tank-01 kernel: Oops: 0000 [#1] Jul 26 14:14:17 tank-01 kernel: SMP Jul 26 14:14:17 tank-01 kernel: Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) lpfc qla2300 qla2xxx parport_pc lp parport autofs4 i2c_dev i2c_core dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod scsi_transport_fc sd_mod scsi_mod Jul 26 14:14:17 tank-01 kernel: CPU: 2 Jul 26 14:14:17 tank-01 kernel: EIP: 0060:[<f8cf4d02>] Tainted: GF VLI Jul 26 14:14:17 tank-01 kernel: EFLAGS: 00010206 (2.6.9-11.28.ELsmp) Jul 26 14:14:17 tank-01 kernel: EIP is at depend_sync_old+0x40/0x59 [gfs] Jul 26 14:14:17 tank-01 kernel: eax: 04549df9 ebx: f8c7724c ecx: f7fff800 edx: f8c7724c Jul 26 14:14:17 tank-01 kernel: esi: 0000ea60 edi: f8c77000 ebp: 002001f8 esp: f68acea4 Jul 26 14:14:17 tank-01 kernel: ds: 007b es: 007b ss: 0068 Jul 26 14:14:17 tank-01 kernel: Process gfs_inoded (pid: 4265, threadinfo=f68ac000 task=f6654630) Jul 26 14:14:17 tank-01 kernel: Stack: c330e600 f8c77000 ed0aa258 ed0aa22c f8c77000 00000000 f8cce3a8 00000001 Jul 26 14:14:17 tank-01 kernel: eb96ec48 c330e600 c5246dac cb3c3590 c5246dac c330e600 c330e600 f8c77000 Jul 26 14:14:17 tank-01 kernel: f8cf6aec 004ed57a 00000000 00000001 00000000 00000000 00000000 c5246dac Jul 26 14:14:17 tank-01 kernel: Call Trace: Jul 26 14:14:17 tank-01 kernel: [<f8cce3a8>] gfs_wipe_buffers+0x2a6/0x2ae [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8cf6aec>] gfs_difree+0x39/0x3f [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8cdb170>] dinode_dealloc+0x113/0x164 [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8cdb351>] inode_dealloc+0x190/0x1d6 [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8cd8093>] gfs_glock_dq+0x111/0x11f [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8cdb3e8>] inode_dealloc_init+0x51/0x64 [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8cf96a6>] .text.lock.unlinked+0x1a/0x74 [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8cf9602>] gfs_unlinked_dealloc+0x2b/0xb5 [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8ccc209>] gfs_inoded+0x3a/0xbc [gfs] Jul 26 14:14:17 tank-01 kernel: [<f8ccc1cf>] gfs_inoded+0x0/0xbc [gfs] Jul 26 14:14:17 tank-01 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Jul 26 14:14:17 tank-01 kernel: Code: 02 00 00 8b a8 20 01 00 00 89 d8 e8 2a 8d 5d c7 8b b7 78 02 00 00 83 ed 08 89 d8 e8 8b 8d 5d c7 a1 20 e9 31 c0 69 f6 e8 03 00 00 <03> 75 1c 39 f0 78 0b 89 ea 89 f8 e8 f3 fe ff ff eb bd 5b 5e 5b Jul 26 14:14:17 tank-01 kernel: <0>Fatal exception: panic in 5 seconds Version-Release number of selected component (if applicable): GFS-kernel-smp-2.6.9-36.2 How reproducible: Didn't try Steps to Reproduce: currently trying to reproduce. Additional info:
Please let me know if this is reproduceable.
*** Bug 166293 has been marked as a duplicate of this bug. ***
o.k. I guess it is reproduceable
O.k. I found a bug in the depend_sync_old code, that could definitely cause this error. Only problem is, I'm not totally sure that it *IS* causing this error, and I'm even fuzzier on how it would cause 166293. My best guess is that the stack trace for 166293 is incompelete, and that it is exactly the same bug. Here's the delema. in depend_sync_old, if it takes longer than "depend_secs" (which is a tuneable parameter set to 60 seconds by default) to sync all the old depenent inodes to disk, bad things happen, and you end up overwriting the resource group descriptor structure. If you manage to trash this structure without crashing, on the next loop, this bug is exactly what you would definitely see. This explains why we saw it with gnbd. Using gnbd, it would take longer to sync the inodes to disk. I knocked down depend_secs to 0, and I can hit this bug within minutes, every time. The problem is, I always crash while mucking with the structure. However, I don't think that you must always crash. (i.e. when you access what you think should be a pointer, it is actually a pointer in the rgd structure. There's no place where the memory that you access will never have a valid value). I think the reason that I always crash early has something to do with knocking the depend_secs down, so that other parts of the rgd don't have time to be set to valid values. If we could reproduce this bug reliably, we could verify a fix. But I can't see another way for this error to happen, and this bug could definitely cause it.
Unless someone can recreate this problem which my change in, I'm calling this bug fixed
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-740.html