From Bugzilla Helper: User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux) Description of problem: I had a heathly 6 node cluster (morph-01 - morph-06) running I/O to one GFS filesystem. I then shot morph-04. This caused bugs 126526 and 126604 on morph-02 and morph-05 and caused morph-06 to trip this assert. SM: send_nodeid_message error -107 to 5 SM: send_nodeid_message error -107 to 2 SM: 00000000 sm_stop: SG still joined SM: 01000002 sm_stop: SG still joined SM: 02000004 sm_stop: SG still joined 61 3819 w 1 ex plock 3819 error 0 en punlock 3817 7,1a007aa4 remove 7,1a007aa4 3817 ex punlock 3817 error 0 en plock 3817 7,1a007aa4 req 7,1a007aa4 ex 17a8c5c-2bd706b 3817 w 1 ex plock 3817 error 0 en punlock 3819 7,1a007aa6 remove 7,1a007aa6 3819 ex punlock 3819 error 0 en plock 3819 7,1a007aa6 req 7,1a007aa6 ex 2513761-2dcbff4 3819 w 1 ex plock 3819 error 0 en punlock 3819 7,1a007aa6 remove 7,1a007aa6 3819 ex punlock 3819 error 0 en plock 3819 7,1a007aa6 req 7,1a007aa6 ex 2dcbff4-2ee613d 3819 w 1 ex plock 3819 error 0 en punlock 3819 7,1a007aa6 remove 7,1a007aa6 3819 ex punlock 3819 error 0 en plock 3819 7,1a007aa6 req 7,1a007aa6 ex 30d38df-30d3ff9 3819 w 1 ex plock 3819 error 0 en punlock 3819 7,1a007aa6 remove 7,1a007aa6 3819 ex punlock 3819 error 0 en plock 3819 7,1a007aa6 req 7,1a007aa6 ex 0-2c6b988 3819 w 1 ex plock 3819 error 0 en punlock 3819 7,1a007aa6 en punlock 3817 7,1a007aa4 start c 5 type 1 e 8 cb_need_recovery jid 3 recovery_done jid 3 msg 309 recovery_done 3,6 f 1b recovery_done start_done 8 lock_dlm: Assertion failed on line 363 of file /usr/src/cluster/gfs-kernel/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 515259 corey0: num=2,19 err=-22 cur=3 req=0 lkf=4 Kernel panic: lock_dlm: Record message above and reboot. How reproducible: Didn't try
a ton of testing and fixes since this was reported and we've not seen it again. we should retry to be sure, but it's probably gone.
this is a duplicate of 127839 which I've just reproduced *** This bug has been marked as a duplicate of 127839 ***
Updating version to the right level in the defects. Sorry for the storm.
Created attachment 114177 [details] log dump from cypher-01
I've just seen something that looks like this bug. Check out the attachment for details.
Running with 10 filesystems, I was getting this bug reliably after one or two rounds of revolver. After knocking the number down to 5, it seems to have gone away.
It appears that cman has shut down on this node, evident from all the ENOTCONN and ENOBUFS errors the threads start getting in the dlm. When cman shuts down it tells the dlm to shut down which means all the dlm locks go away, so when lock_dlm tries to convert one of its locks, the lock isn't there, an error is returned and lock_dlm panics. It's not always clear when cman shuts down, but you can to use kdb to look for the normal cman threads -- see if they exist and if they do check what they're doing. You can also look for cman log messages on the different nodes.
I am currently experiensing the Same problem. I have a 6 node GFS cluster that exports NFS and one of the nodes had died and this is what I found in /var/log/messages. Jul 28 09:34:35 jabbah kernel: lock_dlm: Assertion failed on line 428 of file /usr/src/build/762247-x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c Jul 28 09:34:35 jabbah kernel: lock_dlm: assertion: "!error" Jul 28 09:34:35 jabbah kernel: lock_dlm: time = 4574546183 Jul 28 09:34:35 jabbah kernel: gfs_mail: num=2,1f26f220 err=-22 cur=3 req=5 lkf=44 Jul 28 09:34:35 jabbah kernel: Jul 28 09:34:35 jabbah kernel: ----------- [cut here ] --------- [please bite here ] --------- Jul 28 09:34:35 jabbah kernel: Kernel BUG at lock:428 Jul 28 09:34:35 jabbah kernel: invalid operand: 0000 [1] SMP I have read through the posting and I can not figure out what I should do to solve this. How can I avoid this from happening again?