Description of problem: Had a perfectly healthy cluster up and runing with 3 gfs filesystems. I then attempted to stop clvmd on all nodes (which failed as it should to deactivate the mounted gfs volumes). After that I attempted to umount one of the gfs on all nodes and that (or the previous clvmd stop) appeared to have caused this assertion/panic on taft-04: dlm: clvmd: recover 3 dlm: clvmd: remove member 2 dlm: clvmd: total members 3 error 0 dlm: clvmd: dlm_recover_directory dlm: clvmd: dlm_recover_directory 0 entries dlm: clvmd: pre recover waiter lkid 1017b type 3 flags 1 DLM: Assertion failed on line 41 of file fs/dlm/ast.c DLM: assertion: "lkb->lkb_astaddr != DLM_FAKE_USER_AST" DLM: time = 4295671189 lkb: nodeid 2 id 1017b remid 10277 exflags 0 flags 0 status 0 rqmode -1 grmode -1 wait_type 0 ast_type 0 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/dlm/ast.c:41 invalid opcode: 0000 [1] SMP last sysfs file: /kernel/dlm/clvmd/control CPU 3 Modules linked in: lock_nolock gfs(U) autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm d Pid: 3424, comm: dlm_recoverd Not tainted 2.6.18-1.2767.el5 #1 RIP: 0010:[<ffffffff884210c7>] [<ffffffff884210c7>] :dlm:dlm_add_ast+0x62/0xd2 RSP: 0018:ffff81020c6a1e10 EFLAGS: 00010286 RAX: 0000000000000004 RBX: ffff81020c708d40 RCX: ffffffff80355e28 RDX: ffffffff80355e28 RSI: 0000000000000000 RDI: ffffffff80355e20 RBP: 0000000000000001 R08: ffffffff80355e28 R09: 0000000000000046 R10: 0000000000000000 R11: 0000000000000280 R12: 00000000fffefffe R13: ffff810219e72ac8 R14: ffff8102177dd810 R15: ffffffff8009c378 FS: 0000000000000000(0000) GS:ffff8101fff59640(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002aaaadfb4000 CR3: 0000000214322000 CR4: 00000000000006e0 Process dlm_recoverd (pid: 3424, threadinfo ffff81020c6a0000, task ffff810211bc60c0) Stack: ffff8102177dd800 ffff81020c708d40 ffff8102177dd800 ffffffff88424596 ffff810219e72cbc ffff81020c708d40 ffff810219e72800 ffff810219e72878 00000001000abd3f ffffffff8842674f 0000000000000000 ffff810219e72cbc Call Trace: [<ffffffff88424596>] :dlm:_receive_unlock_reply+0x57/0x8d [<ffffffff8842674f>] :dlm:dlm_recover_waiters_pre+0x194/0x255 [<ffffffff8842af10>] :dlm:dlm_recoverd+0x18e/0x3bd [<ffffffff8003231b>] kthread+0xf6/0x12a [<ffffffff8005c2e5>] child_rip+0xa/0x11 DWARF2 unwinder stuck at child_rip+0xa/0x11 Leftover inexact backtrace: [<ffffffff8009c378>] keventd_create_kthread+0x0/0x61 [<ffffffff80032225>] kthread+0x0/0x12a [<ffffffff8005c2db>] child_rip+0x0/0x11 Code: 0f 0b 68 ee eb 42 88 c2 29 00 48 c7 c7 57 ec 42 88 31 c0 e8 RIP [<ffffffff884210c7>] :dlm:dlm_add_ast+0x62/0xd2 RSP <ffff81020c6a1e10> <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): [root@taft-04 ~]# uname -ar Linux taft-04 2.6.18-1.2767.el5 #1 SMP Wed Nov 29 17:38:40 EST 2006 x86_64 x86_64 x86_64 GNU/Linux [root@taft-04 ~]# rpm -qa | grep gfs gfs2-utils-0.1.18-1.el5 kmod-gfs-0.1.13-2.2.6.18_1.2767.el5 gfs-utils-0.1.9-2.el5 [root@taft-04 ~]# rpm -qa | grep cman cman-2.0.44-1.el5 How reproducible: only once so far
had the same thing in bug 203435, probably a similar issue
Have patch to fix this, need to give it a quick smoke test before sending it out. Flags were not being set in the stub (faked) message reply, so when the reply was processed, the lower flags were being wiped out, which includes the USER flag, so the user lkb was half converted into a kernel lkb.
Created attachment 143348 [details] patch 1 of 1 When the dlm fakes an unlock/cancel reply from a failed node using a stub message struct, it wasn't setting the flags in the stub message. So, in the process of receiving the fake message the lkb flags would be updated and cleared from the zero flags in the message. The problem observed in tests was the loss of the USER flag which caused the dlm to think a user lock was a kernel lock and subsequently fail an assertion checking the validity of the ast/callback field.
184896
ignore previous useless comment in 2.6.18-1.2910.el5
A package has been built which should help the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you.