Bug 128318 - Another assertion failure in dlm/lock.c during recovery
Summary: Another assertion failure in dlm/lock.c during recovery
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: dlm
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-07-21 18:27 UTC by Corey Marthaler
Modified: 2009-04-16 20:29 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-03-31 22:37:22 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2004-07-21 18:27:36 UTC
Description of problem: 
This may be related to bz127008.  
Similar senario, healthy cluster running I/O. Two nodes are shot 
(morph-01 and morph-03) and that causes morph-04 to assert and then 
panic: 
 
un 2,29a4bdf id 20036 cur 5 0 
un 2,29a4be0 id 10045 cur 5 0 
qc 2,1a 3,3 id 10224 sts -65538 
un 2,29cff84 id 40053 cur 5 0 
un 2,29a52d4 id 400ea cur 5 0 
un 2,29b40ef id 302ad cur 5 0 
 
lock_dlm:  Assertion failed on line 333 of file 
/usr/src/cluster/gfs-kernel/src/dlm/lock.c 
lock_dlm:  assertion:  "!error" 
lock_dlm:  time = 698515 
foobar5: error=-22 num=2,29b40ef 
 
Kernel panic: lock_dlm:  Record message above and reboot. 
 
Jul 21 13:07:20 morph-04 kernel: dlm: foobar2: total nodes 3 
Jul 21 13:07:20 morph-04 kernel: dlm: foobar2: rebuild resource 
directory 
Jul 21 13:07:20 morph-04 kernel: dlm: foobar2: rebuilt 2080 
resources 
Jul 21 13:07:20 morph-04 kernel: dlm: foobar2: purge requests 
Jul 21 13:07:20 morph-04 kernel: dlm: foobar2: purged 0 requests 
Jul 21 13:07:20 morph-04 ccsd[3769]: Error while processing get: No 
data available 
Jul 21 13:07:20 morph-04 ccsd[3769]: Error while processing get: No 
data available 
Jul 21 13:07:21 morph-04 kernel: dlm: foobar2: mark waiting requests 
Jul 21 13:07:21 morph-04 kernel: dlm: foobar2: marked 0 requests 
 
<1>Unable to handle kernel NULL pointer dereference at virtual 
address 00000000 
 printing eip: 
00000000 
*pde = 00000000 
Oops: 0000 [#2] 
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs 
lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy 
sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac 
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod 
CPU:    0 
EIP:    0060:[<00000000>]    Not tainted 
EFLAGS: 00010017   (2.6.7) 
EIP is at 0x0 
eax: 00beacd4   ebx: 00beacd4   ecx: 00000000   edx: 00000003 
esi: 00000000   edi: f7fc36e4   ebp: f479fe54   esp: f479fe34 
ds: 007b   es: 007b   ss: 0068 
Process gfs_glockd (pid: 4041, threadinfo=f479e000 task=f49ae1b0) 
Stack: c0118897 00000000 bffffe24 00000001 00000003 00000000 
00000286 f479fe7c 
       f479fe6c c01188f2 00000000 00000000 f8a7c3a0 c23b1e48 
f479fe7c f8a7c3be 
       00000000 c012209d f479fe7c f479fe7c c0122217 00000001 
c03b4ea8 0000000a 
Call Trace: 
 [<c0118897>] __wake_up_common+0x37/0x70 
 [<c01188f2>] __wake_up+0x22/0x30 
 [<f8a7c3a0>] dlm_wait_timer_fn+0x0/0x20 [dlm] 
 [<f8a7c3be>] dlm_wait_timer_fn+0x1e/0x20 [dlm] 
 [<c012209d>] run_timer_softirq+0xad/0x150 
 [<c0122217>] do_timer+0xc7/0xd0 
 [<c011e809>] __do_softirq+0x79/0x80 
 [<c011e837>] do_softirq+0x27/0x30 
 [<c01077c5>] do_IRQ+0xd5/0x110 
 [<c0105e6c>] common_interrupt+0x18/0x20 
 [<c011b1d0>] panic+0xe0/0x100 
 [<f8b9f624>] do_dlm_unlock+0xf4/0x100 [lock_dlm] 
 [<f8b9f94c>] lm_dlm_unlock+0x1c/0x70 [lock_dlm] 
 [<f8a22fed>] gfs_glock_drop_th+0x5d/0x120 [gfs] 
 [<f8a22697>] rq_demote+0x87/0xa0 [gfs] 
 [<f8a2272f>] run_queue+0x7f/0xa0 [gfs] 
 [<f8a2458b>] gfs_reclaim_glock+0x7b/0x110 [gfs] 
 [<f8a16dd7>] gfs_glockd+0x107/0x120 [gfs] 
 [<c0118850>] default_wake_function+0x0/0x10 
 [<c0105c12>] ret_from_fork+0x6/0x14 
 [<c0118850>] default_wake_function+0x0/0x10 
 [<f8a16cd0>] gfs_glockd+0x0/0x120 [gfs] 
 [<c010429d>] kernel_thread_helper+0x5/0x18 
 
Code:  Bad EIP value. 
 <0>Kernel panic: Fatal exception in interrupt 
In interrupt handler - not syncing

Comment 1 David Teigland 2004-08-19 04:47:56 UTC
to reproduce this I had 4 nodes running make_panic and another
four nodes running mount/umount loop for several hours.


Comment 2 David Teigland 2004-09-17 07:04:46 UTC
if we ever hit this again, there will be a line in the dumped
dlm debug log specifying the exact EINVAL condition

Comment 3 Kiersten (Kerri) Anderson 2004-11-04 15:16:08 UTC
Updates with the proper version and component name.

Comment 4 Corey Marthaler 2005-01-11 00:08:42 UTC
hasn't been seen in over 5 months of recovery testing.

Comment 5 Derek Anderson 2005-02-18 18:55:36 UTC
I think this may be the same bug.  Was running a four node cluster
with link-08 running 'while :; do placemaker -d 7 -w 3; find; rm -rf
place_root; done'.  link-10 was running a mount/umount loop, and
link-12 was looping on bonnie++.  After a couple of hours link-08 hit
the assertion.  I will put full logfiles in ~danderso/bugs/128318.

Note: This is a mixed-arch cluster.  link-10,link-11,link12 are i686
and link-08 is an x86_64 opteron.

Note: The placemaker tool is in the sistina-test tree if wanted/needed.

lock_dlm:  Assertion failed on line 352 of file
/usr/src/build/522379-x86_64/BUILD/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4305526027
data1: error=-22 num=2,825c37

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:352
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) parport_pc lp
parport autofs4 dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc mptscsih mptbase sd_mod scsi_mod
Pid: 3380, comm: gfs_glockd Tainted: G   M  2.6.9-5.ELsmp
RIP: 0010:[<ffffffffa0268804>]
<ffffffffa0268804>{:lock_dlm:do_dlm_unlock+189}
RSP: 0018:000001001e9f5de8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000100000000
RDX: ffffffff803c7508 RSI: 0000000000000246 RDI: ffffffff803c7500
RBP: 000001001cbb6dc0 R08: ffffffff803c7508 R09: 00000000ffffffea
R10: 0000000000000097 R11: 0000000000000097 R12: 000001001c637c9c
R13: ffffff000016a000 R14: ffffffffa0264d20 R15: 000001001c637c70
FS:  0000002a95563b00(0000) GS:ffffffff804bf380(0000)
knlGS:00000000f7ff06c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a95557000 CR3: 000000001ffb2000 CR4: 00000000000006e0
Process gfs_glockd (pid: 3380, threadinfo 000001001e9f4000, task
000001001e507030)
Stack: 0000000000000000 ffffff000016a000 000001001c637c70 ffffffffa0268b7e
       0000000000000001 ffffffffa023051f 0000000000000001 ffffffffa022709c
       000001001fdd4500 000001001c637c70
Call Trace:<ffffffffa0268b7e>{:lock_dlm:lm_dlm_unlock+15}
<ffffffffa023051f>{:gfs:gfs_lm_unlock+41}
       <ffffffffa022709c>{:gfs:gfs_glock_drop_th+290}
<ffffffffa0225845>{:gfs:run_queue+314}
       <ffffffffa0225a9a>{:gfs:unlock_on_glock+37}
<ffffffffa0225b90>{:gfs:gfs_reclaim_glock+234}
       <ffffffffa021a61a>{:gfs:gfs_glockd+61}
<ffffffff8013176a>{default_wake_function+0}
       <ffffffff8013176a>{default_wake_function+0}
<ffffffff80110c23>{child_rip+8}
       <ffffffffa021a5dd>{:gfs:gfs_glockd+0}
<ffffffff80110c1b>{child_rip+0}


Code: 0f 0b 13 cf 26 a0 ff ff ff ff 60 01 48 c7 c7 18 cf 26 a0 31
RIP <ffffffffa0268804>{:lock_dlm:do_dlm_unlock+189} RSP <000001001e9f5de8>
 <0>Kernel panic - not syncing: Oops

Comment 6 Derek Anderson 2005-02-18 19:12:51 UTC
When node link-08 was power cycled and was rejoining the cluster this
happened:

Starting cups: [  OK  ]
Starting sshd:[  OK  ]
Starting xinetd: [  OK  ]
Starting sendmail: clvmd move flags 0,1,0 ids 0,2,0
clvmd move use event 2
clvmd recover event 2 (first)
clvmd add nodes
clvmd total nodes 2
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd recover event 2 done
clvmd move flags 0,0,1 ids 0,2,2
clvmd process held requests
clvmd processed 0 requests
clvmd recover event 2 finished
clvmd move flags 1,0,0 ids 2,2,2
clvmd move flags 0,1,0 ids 2,3,2
clvmd move use event 3
clvmd recover event 3
clvmd add node 11
clvmd add_to_requestq cmd 3 fr 11
clvmd total nodes 3
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd purge requests
clvmd purged 0 requests
clvmd mark waiting requests
clvmd marked 0 requests
clvmd recover event 3 done
clvmd move flags 0,0,1 ids 2,3,3
clvmd process held requests
clvmd process_requestq cmd 3 fr 11

DLM:  Assertion failed on line 1129 of file
/usr/src/build/522362-x86_64/BUILD/smp/src/lockqueue.c
DLM:  assertion:  "lkb"
DLM:  time = 4295181729
dlm: request
rh_cmd 3
rh_lkid 103d1
remlkid 103b4
flags 0
status 0
rqmode 255
nodeid 11

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lockqueue:1129
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: parport_pc lp parport autofs4 dlm(U) cman(U) md5
ipv6 sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd
hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
qla2300 qla2xxx scsi_transport_fc mptscsih mptbase sd_mod scsi_mod
Pid: 2282, comm: dlm_recoverd Tainted: G   M  2.6.9-5.ELsmp
RIP: 0010:[<ffffffffa01d7e70>]
<ffffffffa01d7e70>{:dlm:process_cluster_request+4355}
RSP: 0018:000001001ef53dd8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000246
RDX: 0000000000004b32 RSI: 0000000000000246 RDI: ffffffff803c7520
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 000001003ffd1400
R13: 000001001f3d8cd4 R14: 0000000000000000 R15: 000001003ffd1400
FS:  0000002a95563b00(0000) GS:ffffffff804bf380(0000)
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000552ad55cd8 CR3: 000000001ffb2000 CR4: 00000000000006e0
Process dlm_recoverd (pid: 2282, threadinfo 000001001ef52000, task
000001001fac9030)
Stack: 0000000000000000 0000000000000000 0000000b00000000 0000000000000002
       000001001f1fa030 0000000000000069 0000010020712bc0 0000000180130205
       000001001fac9030 0000000000003da6
Call Trace:<ffffffffa01d813c>{:dlm:process_requestqueue+189}
<ffffffffa01e1b42>{:dlm:dlm_recoverd+3086}
       <ffffffffa01e0f34>{:dlm:dlm_recoverd+0}
<ffffffff80148300>{keventd_create_kthread+0}
       <ffffffff801482d7>{kthread+200} <ffffffff80110c23>{child_rip+8}
       <ffffffff80148300>{keventd_create_kthread+0}
<ffffffff8014820f>{kthread+0}
       <ffffffff80110c1b>{child_rip+0}

Code: 0f 0b 85 3c 1e a0 ff ff ff ff 69 04 e9 e9 00 00 00 8b 00 a9
RIP <ffffffffa01d7e70>{:dlm:process_cluster_request+4355} RSP
<000001001ef53dd8>
 <0>Kernel panic - not syncing: Oops

Comment 7 Corey Marthaler 2005-02-21 21:11:12 UTC
I've seen this now too, here's everything that I could grab above the
assert:

dlm: dlm_unlock: lkid 50264 lockspace not found
ror -105 1a0019
gfs0 remote_stage error -105 1e0298
gfs1 remote_stage error -105 160189
gfs0 remote_stage error -105 25004d
gfs1 remote_stage error -105 1801e2
gfs5 remote_stage error -105 1900db
gfs3 remote_stage error -105 170275
gfs4 remote_stage error -105 1801ac
gfs4 remote_stage error -105 160310
gfs9 remote_stage error -105 1d01dd
gfs7 remote_stage error -105 1f00cb
gfs9 remote_stage error -105 140104
gfs0 remote_stage error -105 1f018f
gfs5 remote_stage error -105 1202a1
gfs3 remote_stage error -105 1703d5
gfs3 remote_stage error -105 1800c6
gfs5 remote_stage error -105 180295
gfs2 remote_stage error -105 170117
gfs6 remote_stage error -105 130274
gfs4 remote_stage error -105 1603cf
gfs7 remote_stage error -105 1503a0
gfs9 remote_stage error -105 160071
gfs1 remote_stage error -105 1c0216
gfs2 remote_stage error -105 1b00f7
gfs7 remote_stage error -105 140136
gfs2 remote_stage error -105 2403d5
gfs5 remote_stage error -105 1902e1
gfs3 remote_stage error -105 1401ce
gfs4 remote_stage error -105 13025d
3be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ec187-2ed3d9 lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 1f0358 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 1f0358 5 0
7131 qc 7,37 5,5 id 1f0358 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ed3d9-2ed7dd lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 2403a4 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 2403a4 5 0
7131 qc 7,37 5,5 id 2403a4 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ed7de-2eddd4 lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 1c018a sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 1c018a 5 0
7131 qc 7,37 5,5 id 1c018a sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2eddd5-2edffc lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 250113 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 250113 5 0
7131 qc 7,37 5,5 id 250113 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex e12a9-26b17f lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 200044 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9334 en punlock 7,2d
9334 lk 11,2d id 201d9 0,5 4
7131 qc 11,2d 0,5 id 201d9 sts 0 0
9334 remove 7,2d
9334 un 7,2d 160147 5 0
7131 qc 7,2d 5,5 id 160147 sts -65538 0
9334 lk 11,2d id 201d9 5,0 4
7131 qc 11,2d 5,0 id 201d9 sts 0 0
9334 ex punlock 0
9334 en plock 7,2d
9334 lk 11,2d id 201d9 0,5 4
7131 qc 11,2d 0,5 id 201d9 sts 0 0
9334 req 7,2d ex 0-64da lkf 2000 wait 1
9334 lk 7,2d id 0 -1,5 2000
9334 lk 11,2d id 201d9 5,0 4
7131 qc 7,2d -1,5 id 14038e sts 0 0
7131 qc 11,2d 5,0 id 201d9 sts 0 0
9334 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 200044 5 0
7131 qc 7,37 5,5 id 200044 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ecc8b-2ed161 lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 1e0271 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 1e0271 5 0
7131 qc 7,37 5,5 id 1e0271 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9334 en punlock 7,2d
9334 lk 11,2d id 201d9 0,5 4
9336 en punlock 7,1ffe9
9336 lk 11,1ffe9 id 2006e 0,5 4
8494 un 2,20242 e012f 5 0
8506 un 8,0 100263 5 8
9339 lk 2,20282 id 0 -1,5 0
8429 lk 8,0 id 0 -1,5 8
9315 en punlock 7,10081
9315 lk 11,10081 id 10045 0,5 4
8186 un 2,4ff55 c03b5 5 0
8109 un 2,204a7 12028a 5 0
7964 un 2,400aa a03e0 5 0
8263 un 2,20144 c03c7 5 0
7887 un 2,100a9 200e0 5 0
8340 un 2,20281 50264 5 0

lock_dlm:  Assertion failed on line 352 of file
/usr/src/build/522381-i686/BUILD/gfs-kernel-2.6.9-23/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 1744300
gfs6: error=-22 num=2,20281

------------[ cut here ]------------
kernel BUG at
/usr/src/build/522381-i686/BUILD/gfs-kernel-2.6.9-23/src/dlm/lock.c:352!
invalid operand: 0000 [#13]
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U)
cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc
button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod
scsi_mod
CPU:    0
EIP:    0060:[<f8962ce9>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-5.EL)
EIP is at do_dlm_unlock+0xa2/0xb7 [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: f896854a   edx: c4599f44
esi: f5e4b680   edi: f5e4b680   ebp: f8bc9000   esp: c4599f40
ds: 007b   es: 007b   ss: 0068
Process gfs_glockd (pid: 8340, threadinfo=c4599000 task=c3ffd320)
Stack: f896854a f8bc9000 00000001 f8962ff4 f8b964c0 c67cbd7c f8bc9000
f8bc6640
       f8b89447 c67cbd7c f8bc6640 c4599fb4 f8b87ef9 c67cbd7c 00000001
f8b880b6
       c67cbd7c c67cbd7c f8b8836f c67cbe20 f8b8b9b4 c4599000 c4599fc0
f8b7bbf2
Call Trace:
 [<f8962ff4>] lm_dlm_unlock+0xe/0x16 [lock_dlm]
 [<f8b964c0>] gfs_lm_unlock+0x2b/0x40 [gfs]
 [<f8b89447>] gfs_glock_drop_th+0x17a/0x1b0 [gfs]
 [<f8b87ef9>] rq_demote+0x15c/0x1da [gfs]
 [<f8b880b6>] run_queue+0x5a/0xc1 [gfs]
 [<f8b8836f>] unlock_on_glock+0x6e/0xc8 [gfs]
 [<f8b8b9b4>] gfs_reclaim_glock+0x257/0x2ae [gfs]
 [<f8b7bbf2>] gfs_glockd+0x38/0xde [gfs]
 [<c011b9ea>] default_wake_function+0x0/0xc
 [<c0301b1a>] ret_from_fork+0x6/0x14
 [<c011b9ea>] default_wake_function+0x0/0xc
 [<f8b7bbba>] gfs_glockd+0x0/0xde [gfs]
 [<c01041d9>] kernel_thread_helper+0x5/0xb
Code: e8 72 d3 7b c7 ff 76 08 8b 06 ff 76 04 ff 76 0c 53 ff 70 18 68
6a 86 96 f8 e8 59 d3 7b c7 83 c4 2c 68 4a 85 96 f8 e8 4c d3 7b c7 <0f>
0b 60 01 dc 83 96 f8 68 4c 85 96 f8 e8 98 c7 7b c7 5b 5e c3


Comment 8 David Teigland 2005-02-22 02:48:27 UTC
comments 5 and 7 pertain to a cman bug where cman shuts down while
gfs/dlm are running.  There have been various bugs dealing with this.

comment 6 looks new and interesting, but has nothing to do with the
other information here.

Comment 9 Corey Marthaler 2005-02-22 17:46:16 UTC
I'm adding Dave's email to this bug for future reference. Should this  
get closed then or what state should this bug have since the original 
issue is fixed but people will continue to see the assert message?  

Comment 10 Corey Marthaler 2005-02-22 17:47:11 UTC
We need to clear something up that might be confusing folks.  
Whenever you see the following (there are two forms, one for dlm_lock 
and another for dlm_unlock): 
 
lock_dlm:  Assertion failed on line 352 of file 
cluster/gfs-kernel/src/dlm/lock.c 
lock_dlm:  assertion:  "!error" 
lock_dlm:  time = 38903631 
a: error=-22 num=2,70650 
 
realize that you don't really know anything useful about what went 
wrong yet.  This assert/panic is not the real problem, but just a 
signal that something else went wrong earlier in the dlm. 
 
I know it's simpler to panic right when something goes wrong, but our 
approach with the dlm has been different.  We tend not to panic in 
the dlm but instead print an error message and return the error to 
the caller. 
That means that nearly anything that goes wrong in the dlm will end 
up returning an error to lock_dlm which does this assert [1]. 
 
So, when you get a panic like the one above, you'll need to scroll 
back a bit to identify the real problem.  Capture the dlm debug log 
dump (it's pretty short and comes before the lock_dlm debug dump), 
and prior to the debug dumps, there are often errors/warnings that 
were printed to the console. 
 
Thanks, this should help us resolve these bz's a lot quicker. 

Comment 11 David Teigland 2005-03-01 02:39:00 UTC
this is a duplicate of bug 139738 so should also be fixed

Comment 12 Corey Marthaler 2005-03-31 22:37:22 UTC
closing this as the original problem is fixed and all issues in the last
comments are tracked in 139738. 


Note You need to log in before you can comment on or make changes to this bug.