Bug 218525

Summary: dlm assertion/panic while umounting in lkb->lkb_astaddr != DLM_FAKE_USER_AST
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: kernelAssignee: David Teigland <teigland>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: ccaulfie
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RC Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-08 01:23:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch 1 of 1 none

Description Corey Marthaler 2006-12-05 20:42:09 UTC
Description of problem:
Had a perfectly healthy cluster up and runing with 3 gfs filesystems. I then
attempted to stop clvmd on all nodes (which failed as it should to deactivate
the mounted gfs volumes). After that I attempted to umount one of the gfs on all
nodes and that (or the previous clvmd stop) appeared to have caused this
assertion/panic on taft-04:
 
dlm: clvmd: recover 3
dlm: clvmd: remove member 2
dlm: clvmd: total members 3 error 0
dlm: clvmd: dlm_recover_directory
dlm: clvmd: dlm_recover_directory 0 entries
dlm: clvmd: pre recover waiter lkid 1017b type 3 flags 1

DLM:  Assertion failed on line 41 of file fs/dlm/ast.c
DLM:  assertion:  "lkb->lkb_astaddr != DLM_FAKE_USER_AST"
DLM:  time = 4295671189
lkb: nodeid 2 id 1017b remid 10277 exflags 0 flags 0
     status 0 rqmode -1 grmode -1 wait_type 0 ast_type 0

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/dlm/ast.c:41
invalid opcode: 0000 [1] SMP
last sysfs file: /kernel/dlm/clvmd/control
CPU 3
Modules linked in: lock_nolock gfs(U) autofs4 hidp rfcomm l2cap bluetooth
lock_dlm gfs2 dlm d
Pid: 3424, comm: dlm_recoverd Not tainted 2.6.18-1.2767.el5 #1
RIP: 0010:[<ffffffff884210c7>]  [<ffffffff884210c7>] :dlm:dlm_add_ast+0x62/0xd2
RSP: 0018:ffff81020c6a1e10  EFLAGS: 00010286
RAX: 0000000000000004 RBX: ffff81020c708d40 RCX: ffffffff80355e28
RDX: ffffffff80355e28 RSI: 0000000000000000 RDI: ffffffff80355e20
RBP: 0000000000000001 R08: ffffffff80355e28 R09: 0000000000000046
R10: 0000000000000000 R11: 0000000000000280 R12: 00000000fffefffe
R13: ffff810219e72ac8 R14: ffff8102177dd810 R15: ffffffff8009c378
FS:  0000000000000000(0000) GS:ffff8101fff59640(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaadfb4000 CR3: 0000000214322000 CR4: 00000000000006e0
Process dlm_recoverd (pid: 3424, threadinfo ffff81020c6a0000, task ffff810211bc60c0)
Stack:  ffff8102177dd800 ffff81020c708d40 ffff8102177dd800 ffffffff88424596
 ffff810219e72cbc ffff81020c708d40 ffff810219e72800 ffff810219e72878
 00000001000abd3f ffffffff8842674f 0000000000000000 ffff810219e72cbc
Call Trace:
 [<ffffffff88424596>] :dlm:_receive_unlock_reply+0x57/0x8d
 [<ffffffff8842674f>] :dlm:dlm_recover_waiters_pre+0x194/0x255
 [<ffffffff8842af10>] :dlm:dlm_recoverd+0x18e/0x3bd
 [<ffffffff8003231b>] kthread+0xf6/0x12a
 [<ffffffff8005c2e5>] child_rip+0xa/0x11
DWARF2 unwinder stuck at child_rip+0xa/0x11
Leftover inexact backtrace:
 [<ffffffff8009c378>] keventd_create_kthread+0x0/0x61
 [<ffffffff80032225>] kthread+0x0/0x12a
 [<ffffffff8005c2db>] child_rip+0x0/0x11


Code: 0f 0b 68 ee eb 42 88 c2 29 00 48 c7 c7 57 ec 42 88 31 c0 e8
RIP  [<ffffffff884210c7>] :dlm:dlm_add_ast+0x62/0xd2
 RSP <ffff81020c6a1e10>
 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
[root@taft-04 ~]# uname -ar
Linux taft-04 2.6.18-1.2767.el5 #1 SMP Wed Nov 29 17:38:40 EST 2006 x86_64
x86_64 x86_64 GNU/Linux
[root@taft-04 ~]# rpm -qa | grep gfs
gfs2-utils-0.1.18-1.el5
kmod-gfs-0.1.13-2.2.6.18_1.2767.el5
gfs-utils-0.1.9-2.el5
[root@taft-04 ~]# rpm -qa | grep cman
cman-2.0.44-1.el5


How reproducible:
only once so far

Comment 1 David Teigland 2006-12-05 20:55:34 UTC
had the same thing in bug 203435, probably a similar issue


Comment 2 David Teigland 2006-12-11 16:45:48 UTC
Have patch to fix this, need to give it a quick smoke test before
sending it out.  Flags were not being set in the stub (faked) message
reply, so when the reply was processed, the lower flags were being
wiped out, which includes the USER flag, so the user lkb was half
converted into a kernel lkb.


Comment 3 David Teigland 2006-12-11 22:58:52 UTC
Created attachment 143348 [details]
patch 1 of 1

When the dlm fakes an unlock/cancel reply from a failed node using a stub
message struct, it wasn't setting the flags in the stub message.  So, in
the process of receiving the fake message the lkb flags would be updated
and cleared from the zero flags in the message.  The problem observed in
tests was the loss of the USER flag which caused the dlm to think a user
lock was a kernel lock and subsequently fail an assertion checking the
validity of the ast/callback field.

Comment 4 Don Zickus 2006-12-18 02:58:50 UTC
184896

Comment 5 Don Zickus 2006-12-18 17:46:05 UTC
ignore previous useless comment
in 2.6.18-1.2910.el5

Comment 6 RHEL Program Management 2007-02-08 01:23:48 UTC
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.