Description of problem: A couple of times a month we get a kernel panic Unable to handle kernel NULL pointer dereference at virtual address 00000000 We have 2 nodes and this only happens on one of them. The node it happens on runs ypserv and I notice in the kernel panic it always mentions ypserv. The other node does not run this service. Version-Release number of selected component (if applicable): CentOS 4.4 Linux scylla1 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:21:39 CDT 2006 i686 i686 i386 GNU/Linux clustat version 1.9.54 Connected via: CMAN/SM Plugin v1.1.7.1 How reproducible: I am not able to reproduce this at will. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Apr 13 21:06:04 scylla1 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Apr 13 21:06:04 scylla1 kernel: printing eip: Apr 13 21:06:04 scylla1 kernel: f8c743a6 Apr 13 21:06:04 scylla1 kernel: *pde = 18643001 Apr 13 21:06:04 scylla1 kernel: Oops: 0000 [#1] Apr 13 21:06:04 scylla1 kernel: SMP Apr 13 21:06:04 scylla1 kernel: CPU: 0 Apr 13 21:06:04 scylla1 kernel: EIP: 0060:[<f8c743a6>] Not tainted VLI Apr 13 21:06:04 scylla1 kernel: EFLAGS: 00010203 (2.6.9-42.0.3.ELsmp) Apr 13 21:06:04 scylla1 kernel: EIP is at gfs_glock_dq+0xaf/0x16e [gfs] Apr 13 21:06:04 scylla1 kernel: eax: eaf39524 ebx: eaf39518 ecx: f7f464ff edx: 00000000 Apr 13 21:06:04 scylla1 kernel: esi: 00000000 edi: eaf394fc ebp: f68dd61c esp: f67dce98 Apr 13 21:06:04 scylla1 kernel: ds: 007b es: 007b ss: 0068 Apr 13 21:06:04 scylla1 kernel: Process ypserv (pid: 5328, threadinfo=f67dc000 task=f27e3630) Apr 13 21:06:04 scylla1 kernel: Stack: 00117975 de2e939c f8ca96a0 f8945000 f68dd61c f68dd61c f68dd604 f68dd600 Apr 13 21:06:04 scylla1 kernel: f8c747aa c2b48e80 f8c8945c f67dceec c2b48e80 00000000 00000007 c2b48e80 Apr 13 21:06:04 scylla1 kernel: f8c894d0 c2b48e80 f8ca98e0 edb22768 c016e8ac 00000000 00000000 00000000 Apr 13 21:06:04 scylla1 kernel: Call Trace: Apr 13 21:06:04 scylla1 kernel: [<f8c747aa>] gfs_glock_dq_uninit+0x8/0x10 [gfs] Apr 13 21:06:04 scylla1 kernel: [<f8c8945c>] do_unflock+0x4f/0x61 [gfs] Apr 13 21:06:04 scylla1 kernel: [<f8c894d0>] gfs_flock+0x62/0x76 [gfs] Apr 13 21:06:04 scylla1 kernel: [<c016e8ac>] locks_remove_flock+0x49/0xe1 Apr 13 21:06:04 scylla1 kernel: [<c015bbc2>] __fput+0x41/0x100 Apr 13 21:06:04 scylla1 kernel: [<c015a7f5>] filp_close+0x59/0x5f Apr 13 21:06:04 scylla1 kernel: [<c0123b5b>] put_files_struct+0x57/0xc0 Apr 13 21:06:04 scylla1 kernel: [<c012476f>] do_exit+0x245/0x404 Apr 13 21:06:04 scylla1 kernel: [<c0124a19>] sys_exit_group+0x0/0xd Apr 13 21:06:04 scylla1 kernel: [<c02d47cb>] syscall_call+0x7/0xb Apr 13 21:06:04 scylla1 kernel: <0>Fatal exception: panic in 5 seconds
And the previous panic. I am only adding it because it lists the modules linked in. Mar 30 04:35:06 scylla1 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Mar 30 04:35:06 scylla1 kernel: printing eip: Mar 30 04:35:06 scylla1 kernel: f8c743a6 Mar 30 04:35:06 scylla1 kernel: *pde = 00004001 Mar 30 04:35:06 scylla1 kernel: Oops: 0000 [#1] Mar 30 04:35:06 scylla1 kernel: SMP Mar 30 04:35:06 scylla1 kernel: Modules linked in: nfsd exportfs lockd nfs_acl parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) sunrpc dm_mirror dm_multipath dm_mod button battery ac m d5 ipv6 uhci_hcd ehci_hcd hw_random tg3 floppy ext3 jbd cciss sd_mod scsi_mod Mar 30 04:35:06 scylla1 kernel: CPU: 1 Mar 30 04:35:06 scylla1 kernel: EIP: 0060:[<f8c743a6>] Not tainted VLI Mar 30 04:35:06 scylla1 kernel: EFLAGS: 00010207 (2.6.9-42.0.3.ELsmp) Mar 30 04:35:06 scylla1 kernel: EIP is at gfs_glock_dq+0xaf/0x16e [gfs] Mar 30 04:35:06 scylla1 kernel: eax: ebb81a84 ebx: ebb81a78 ecx: f7f46400 edx: 00000000 Mar 30 04:35:06 scylla1 kernel: esi: 00000000 edi: ebb81a5c ebp: ce9a251c esp: e17f9e98 Mar 30 04:35:06 scylla1 kernel: ds: 007b es: 007b ss: 0068 Mar 30 04:35:06 scylla1 kernel: Process ypserv (pid: 9039, threadinfo=e17f9000 task=f32f7330) Mar 30 04:35:06 scylla1 kernel: Stack: 0000630a e314889c f8ca96a0 f8945000 ce9a251c ce9a251c ce9a2504 ce9a2500 Mar 30 04:35:06 scylla1 kernel: f8c747aa ef2ed980 f8c8945c e17f9eec ef2ed980 00000000 00000007 ef2ed980 Mar 30 04:35:06 scylla1 kernel: f8c894d0 ef2ed980 f8ca98e0 ec21b208 c016e8ac 00000000 00000000 00000000 Mar 30 04:35:06 scylla1 kernel: Call Trace: Mar 30 04:35:06 scylla1 kernel: [<f8c747aa>] gfs_glock_dq_uninit+0x8/0x10 [gfs] Mar 30 04:35:06 scylla1 kernel: [<f8c8945c>] do_unflock+0x4f/0x61 [gfs] Mar 30 04:35:06 scylla1 kernel: [<f8c894d0>] gfs_flock+0x62/0x76 [gfs] Mar 30 04:35:06 scylla1 kernel: [<c016e8ac>] locks_remove_flock+0x49/0xe1 Mar 30 04:35:06 scylla1 kernel: [<c015bbc2>] __fput+0x41/0x100 Mar 30 04:35:06 scylla1 kernel: [<c015a7f5>] filp_close+0x59/0x5f Mar 30 04:35:06 scylla1 kernel: [<c0123b5b>] put_files_struct+0x57/0xc0 Mar 30 04:35:06 scylla1 kernel: [<c012476f>] do_exit+0x245/0x404 Mar 30 04:35:06 scylla1 kernel: [<c0124a19>] sys_exit_group+0x0/0xd Mar 30 04:35:06 scylla1 kernel: [<c02d47cb>] syscall_call+0x7/0xb Mar 30 04:35:06 scylla1 kernel: Code: f8 ba 57 85 c9 f8 68 2d 82 c9 f8 8b 44 24 14 e8 e0 1e 02 00 59 5b f6 45 15 08 74 06 f0 0f ba 6f 08 04 f6 45 15 04 74 38 8b 57 28 <8b> 02 0f 18 00 90 8d 47 28 39 c2 74 0b ff 04 24 89 54 24 04 8b Mar 30 04:35:06 scylla1 kernel: <0>Fatal exception: panic in 5 seconds
Fixing product name. Cluster Suite components were integrated into Enterprise Linux for verion 5.0.
The bug said version 5, but the kernel said CentOS 4.4, with a 2.6.9-42 kernel. Therefore, I'm changing version and setting it back to cluster-suite and gfs-kernel.
I tried recreating the problem with a variety of programs that take flocks and exit with them held. It didn't recreate. I also dug through the code and didn't find anything obvious relating to this code path.
Ok. We havent had this issue again since reporting it. Though it did happen a few times before I finally reported it. Did you need any other info about the machine or anything else?
I don't need anything at the moment, short of a good way to recreate the problem. It appears that the process is exiting while still holding flock(s). I've tried many variations on that, but didn't recreate this problem. Abhi Das was the last person to work on the flock code and he offered to investigate it some more, so I'm adding him to the cc list.
At this point I have nothing else to report. It hasnt happened again since this report and I'm in the process of updating right now. So, feel free to close this ticket/bug if you want to and I'll file another one if it happens again.
I haven't seen this problem again either. Please feel free to open up another bug if you see the problem again.