Bug 419391
Summary: | gfs:gfs_glock_dq kernel oops | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Reiner Rottmann <rrottmann> | ||||
Component: | GFS-kernel | Assignee: | Ben Marzinski <bmarzins> | ||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4 | CC: | hlawatschek, jwest, merz, rpeterso | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | ia64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHBA-2008-0802 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-07-25 19:27:25 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 441747 | ||||||
Attachments: |
|
Description
Reiner Rottmann
2007-12-11 09:23:59 UTC
Created attachment 283881 [details]
kernel oops message
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Can I get the userland script (and all prerequisite programs) that recreates this problem? I can see it has something to do with flocks, but I want to understand it better and try to recreate it here. Thanks. FYI:
A similar looking error happens on a 32bit machine (completely different
cluster), this time cause by smtpd (postfix).
Relevant RPMs installed:
GFS-6.1.14-0
cman-1.0.17-0
dlm-1.0.3-1
ccs-1.0.10-0
Kernel:
2.6.9-55.0.2.ELsmp
Info:
The filesystem is a -p lock_dlm FS, but only mounted on one host.
Mounted with options: rw,noatime,nodiratime
No settune options used at the time of the occurance.
The FS is used for the SMTP-Out Spool-Dirs of Postfix.
Postfix is heavily used on this machine for outgoing mail traffic (100.000 -
200.000 Mails/Day).
* serverX naming is just to anonymize the host *
================================================
Dec 4 07:38:04 serverX.local Unable to handle kernel NULL pointer
dereference at virtual address 00000000
Dec 4 07:38:04 serverX.local printing eip:
Dec 4 07:38:04 serverX.local f8b82542
Dec 4 07:38:04 serverX.local *pde = 0ff6f001
Dec 4 07:38:04 serverX.local Oops: 0000 [#1]
Dec 4 07:38:04 serverX.local SMP
Dec 4 07:38:04 serverX.local Modules linked in: netconsole netdump sg sunrpc
ipt_REJECT ipt_LOG ipt_limit iptable_filter ip_tables ext3 jbd u
hci_hcd ehci_hcd e752x_edac edac_mc hw_random floppy scsi_transport_fc
ata_piix libata md5 ipv6 lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U)
qla2400(U
) qla2300(U) qla2xxx(U) qla2xxx_conf(U) cciss sd_mod scsi_mod dm_snapshot
dm_mirror dm_mod tg3
Dec 4 07:38:04 serverX.local CPU: 0
Dec 4 07:38:04 serverX.local EIP: 0060:[<f8b82542>] Not tainted VLI
Dec 4 07:38:04 serverX.local EFLAGS: 00010206 (2.6.9-55.0.2.ELsmp)
Dec 4 07:38:04 serverX.local EIP is at gfs_glock_dq+0xaf/0x16e [gfs]
Dec 4 07:38:04 serverX.local eax: c8bb8b30 ebx: c8bb8b24 ecx: f5417400
edx: 00000000
Dec 4 07:38:04 serverX.local esi: 00000000 edi: c8bb8b08 ebp: d210de9c
esp: ee3e2f58
Dec 4 07:38:04 serverX.local ds: 007b es: 007b ss: 0068
Dec 4 07:38:04 serverX.local Process smtpd (pid: 9245, threadinfo=ee3e2000
task=f6b307b0)
Dec 4 07:38:04 serverX.local Stack: 3abe4e03 d1e3099c f8bb8c00 f8ad3000
d210de9c d210de9c d210de84 d210de80
Dec 4 07:38:04 serverX.local f8b82946 f56ba380 f8b979b2 f4fcb6cc
f56ba380 00000000 00000007 00000000
Dec 4 07:38:04 serverX.local f8b97a26 f8b979c4 00000001 f56ba380
c016e439 f4fcb6cc 00000010 09bd1ba8
Dec 4 07:38:04 serverX.local Call Trace:
Dec 4 07:38:04 serverX.local [<f8b82946>] gfs_glock_dq_uninit+0x8/0x10
[gfs]
Dec 4 07:38:04 serverX.local [<f8b979b2>] do_unflock+0x4f/0x61 [gfs]
Dec 4 07:38:04 serverX.local [<f8b97a26>] gfs_flock+0x62/0x76 [gfs]
Dec 4 07:38:04 serverX.local [<f8b979c4>] gfs_flock+0x0/0x76 [gfs]
Dec 4 07:38:04 serverX.local [<c016e439>] sys_flock+0x96/0x119
Dec 4 07:38:04 serverX.local [<c02d6093>] syscall_call+0x7/0xb
Dec 4 07:38:04 serverX.local Code: f8 ba a9 77 ba f8 68 8a 74 ba f8 8b 44 24
14 e8 9c 2f 02 00 59 5b f6 45 15 08 74 06 f0 0f ba 6f 08 04 f6 4
5 15 04 74 38 8b 57 28 <8b> 02 0f 18 00 90 8d 47 28 39 c2 74 0b ff 04 24 89 54
24 04 8b
[thomas@mobilix-05 temp]$ cat dmp1.txt |sed 's/dmp1.messe-muenchen/serverX/g'
> serverX.txt
[thomas@mobilix-05 temp]$ cat serverX.txt
Dec 4 07:38:04 serverX.local Unable to handle kernel NULL pointer
dereference at virtual address 00000000
Dec 4 07:38:04 serverX.local printing eip:
Dec 4 07:38:04 serverX.local f8b82542
Dec 4 07:38:04 serverX.local *pde = 0ff6f001
Dec 4 07:38:04 serverX.local Oops: 0000 [#1]
Dec 4 07:38:04 serverX.local SMP
Dec 4 07:38:04 serverX.local Modules linked in: netconsole netdump sg sunrpc
ipt_REJECT ipt_LOG ipt_limit iptable_filter ip_tables ext3 jbd u
hci_hcd ehci_hcd e752x_edac edac_mc hw_random floppy scsi_transport_fc
ata_piix libata md5 ipv6 lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U)
qla2400(U
) qla2300(U) qla2xxx(U) qla2xxx_conf(U) cciss sd_mod scsi_mod dm_snapshot
dm_mirror dm_mod tg3
Dec 4 07:38:04 serverX.local CPU: 0
Dec 4 07:38:04 serverX.local EIP: 0060:[<f8b82542>] Not tainted VLI
Dec 4 07:38:04 serverX.local EFLAGS: 00010206 (2.6.9-55.0.2.ELsmp)
Dec 4 07:38:04 serverX.local EIP is at gfs_glock_dq+0xaf/0x16e [gfs]
Dec 4 07:38:04 serverX.local eax: c8bb8b30 ebx: c8bb8b24 ecx: f5417400
edx: 00000000
Dec 4 07:38:04 serverX.local esi: 00000000 edi: c8bb8b08 ebp: d210de9c
esp: ee3e2f58
Dec 4 07:38:04 serverX.local ds: 007b es: 007b ss: 0068
Dec 4 07:38:04 serverX.local Process smtpd (pid: 9245, threadinfo=ee3e2000
task=f6b307b0)
Dec 4 07:38:04 serverX.local Stack: 3abe4e03 d1e3099c f8bb8c00 f8ad3000
d210de9c d210de9c d210de84 d210de80
Dec 4 07:38:04 serverX.local f8b82946 f56ba380 f8b979b2 f4fcb6cc
f56ba380 00000000 00000007 00000000
Dec 4 07:38:04 serverX.local f8b97a26 f8b979c4 00000001 f56ba380
c016e439 f4fcb6cc 00000010 09bd1ba8
Dec 4 07:38:04 serverX.local Call Trace:
Dec 4 07:38:04 serverX.local [<f8b82946>] gfs_glock_dq_uninit+0x8/0x10
[gfs]
Dec 4 07:38:04 serverX.local [<f8b979b2>] do_unflock+0x4f/0x61 [gfs]
Dec 4 07:38:04 serverX.local [<f8b97a26>] gfs_flock+0x62/0x76 [gfs]
Dec 4 07:38:04 serverX.local [<f8b979c4>] gfs_flock+0x0/0x76 [gfs]
Dec 4 07:38:04 serverX.local [<c016e439>] sys_flock+0x96/0x119
Dec 4 07:38:04 serverX.local [<c02d6093>] syscall_call+0x7/0xb
Dec 4 07:38:04 serverX.local Code: f8 ba a9 77 ba f8 68 8a 74 ba f8 8b 44 24
14 e8 9c 2f 02 00 59 5b f6 45 15 08 74 06 f0 0f ba 6f 08 04 f6 4
5 15 04 74 38 8b 57 28 <8b> 02 0f 18 00 90 8d 47 28 39 c2 74 0b ff 04 24 89 54
24 04 8b
Hello, yet another occurance of the error, this time caused by Apache (httpd). Same RPMs and architecture (32bit) as in comment #5, but a different cluster. * serverY naming is just to anonymize the host * ================================================= Dec 15 05:51:44 serverY.local kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000c Dec 15 05:51:44 serverY.local kernel: printing eip: Dec 15 05:51:44 serverY.local kernel: f8b62037 Dec 15 05:51:44 serverY.local kernel: *pde = 22566001 Dec 15 05:51:44 serverY.local kernel: Oops: 0000 [#1] Dec 15 05:51:44 serverY.local kernel: SMP Dec 15 05:51:44 serverY.local kernel: Modules linked in: sg sunrpc ext3 jbd ohci_hcd cpqphp e100 mii floppy scsi_transport_fc md5 ipv6 lock_dlm(U) dlm(U) gfs(U) l ock_harness(U) cman(U) qla2400(U) qla2300(U) qla2xxx(U) qla2xxx_conf(U) cciss sd_mod scsi_mod dm_snapshot dm_mirror dm_mod tg3 Dec 15 05:51:44 serverY.local kernel: CPU: 1 Dec 15 05:51:44 serverY.local kernel: EIP: 0060:[] Not tainted VLI Dec 15 05:51:44 serverY.local kernel: EFLAGS: 00210213 (2.6.9-55.0.2.ELsmp) Dec 15 05:51:44 serverY.local kernel: EIP is at add_to_queue+0x2c/0x27b [gfs] Dec 15 05:51:44 serverY.local kernel: eax: e5989630 ebx: cb775c9c ecx: cb775cc0 edx: f168a274 Dec 15 05:51:44 serverY.local kernel: esi: 00000000 edi: f168a24c ebp: f168a24c esp: e2564eec Dec 15 05:51:44 serverY.local kernel: ds: 007b es: 007b ss: 0068 Dec 15 05:51:44 serverY.local kernel: Process httpd (pid: 2875, threadinfo=e2564000 task=e5989630) Dec 15 05:51:44 serverY.local kernel: Stack: f904b000 f168a268 cb775c9c f904b000 f168a24c f8b6234e 00000000 c51141f8 Dec 15 05:51:44 serverY.local kernel: 00000000 00000480 cb775c9c f8b778f2 cb775c9c 00000001 c51141f8 cb775c80 Dec 15 05:51:44 serverY.local kernel: f7d6906c d7ed9e80 f168a24c d7ed9e80 00000042 00000180 c015a5b0 d7ed9e80 Dec 15 05:51:44 serverY.local kernel: Call Trace: Dec 15 05:51:44 serverY.local kernel: [] gfs_glock_nq+0xc8/0x116 [gfs] Dec 15 05:51:44 serverY.local kernel: [] do_flock+0x111/0x182 [gfs] Dec 15 05:51:44 serverY.local kernel: [] filp_open+0x5c/0x70 Dec 15 05:51:44 serverY.local kernel: [] cache_alloc_refill+0x156/0x19d Dec 15 05:51:44 serverY.local kernel: [] gfs_flock+0x0/0x76 [gfs] Dec 15 05:51:44 serverY.local kernel: [] sys_flock+0x96/0x119 Dec 15 05:51:44 serverY.local kernel: [] syscall_call+0x7/0xb Dec 15 05:51:44 serverY.local kernel: [] unix_stream_sendmsg+0x227/0x33a Dec 15 05:51:44 serverY.local kernel: Code: 57 56 53 89 c3 51 8b 78 08 8b 87 9c 00 00 00 89 04 24 8b 43 0c 85 c0 0f 84 29 02 00 00 8b 77 28 8d 57 28 39 d6 0f 84 f 6 00 00 00 <39> 46 0c 0f 85 e6 00 00 00 f6 43 14 08 75 2d f6 46 14 08 74 27 Dec 15 05:51:44 serverY.local kernel: <0>Fatal exception: panic in 5 seconds We just recently found this problem on RHEL5. It's documented internally as bug #426291, although I don't know if permissions on the bugzilla record will allow the folks at Atix to view that. (I don't have the ability to change the permissions; sorry). I'm reassigning this to the developer who did the RHEL5 fix so we can get it fixed in RHEL4 as well. Basically, gfs_glock_dq has a list traversal that isn't protected by the necessary spin_lock. While you are traversing the list, you can get moved onto another list, which mucks things up. I've checked in the fix for that. Unfortunately, it doesn't explain the panic in Comment #6, which is in completely different code. I can't see a way for the gfs_glock_dq issue to effect add_to_queue, so I'm betting that they are two seperate issues. I'm looking into it. The panic in add_to_queue happens in almost the exact same way that the one in gfs_glock_dq happens. The big difference is that add_to_queue has taken out the necessary spin_lock. This leads me to believe that some function is modifying the gl_holders list without holding the spin_lock. This should be pretty easy to figure out once I set up a RHEL4 cluster. I have been totally unable to reproduce this second issue during multiple days of testing. I still can't see a way for my original fix to change this. However I wasn't able to spot any place in the code where a function was modifying this list without locking. Unless anyone has serious objections, I'm going to mark this bug as modified since there was a bug fixing code change associated with it. If the other panic can still be reproduced, we can open a new bug for it. *** Bug 441593 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0802.html |