Bug 127008

Summary: assertion failure in dlm/lock.c "!error"
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gfsAssignee: David Teigland <teigland>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: ccaulfie, kagiso, kanderso
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-11-15 16:48:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log dump from cypher-01 none

Description Corey Marthaler 2004-06-30 16:19:42 UTC
From Bugzilla Helper: 
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux) 
 
Description of problem: 
I had a heathly 6 node cluster (morph-01 - morph-06) running I/O to 
one GFS filesystem. I then shot morph-04. This caused bugs 126526 
and 126604 on morph-02 and morph-05 and caused morph-06 to trip this 
assert. 
 
SM: send_nodeid_message error -107 to 5 
SM: send_nodeid_message error -107 to 2 
SM: 00000000 sm_stop: SG still joined 
SM: 01000002 sm_stop: SG still joined 
SM: 02000004 sm_stop: SG still joined 
61 3819 w 1 
ex plock 3819 error 0 
en punlock 3817 7,1a007aa4 
remove 7,1a007aa4 3817 
ex punlock 3817 error 0 
en plock 3817 7,1a007aa4 
req 7,1a007aa4 ex 17a8c5c-2bd706b 3817 w 1 
ex plock 3817 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 2513761-2dcbff4 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 2dcbff4-2ee613d 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 30d38df-30d3ff9 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 0-2c6b988 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
en punlock 3817 7,1a007aa4 
start c 5 type 1 e 8 
cb_need_recovery jid 3 
recovery_done jid 3 msg 309 
recovery_done 3,6 f 1b 
recovery_done start_done 8 
 
 lock_dlm:  Assertion failed on line 363 of file 
/usr/src/cluster/gfs-kernel/src/dlm/lock.c 
lock_dlm:  assertion:  "!error" 
lock_dlm:  time = 515259 
corey0: num=2,19 err=-22 cur=3 req=0 lkf=4 
 
Kernel panic: lock_dlm:  Record message above and reboot. 
 
 
 
How reproducible: 
Didn't try

Comment 1 David Teigland 2004-08-19 03:57:21 UTC
a ton of testing and fixes since this was reported and we've not seen
it again.  we should retry to be sure, but it's probably gone.

Comment 2 David Teigland 2004-08-19 04:41:04 UTC
this is a duplicate of 127839 which I've just reproduced

*** This bug has been marked as a duplicate of 127839 ***

Comment 3 Kiersten (Kerri) Anderson 2004-11-16 19:06:23 UTC
Updating version to the right level in the defects.  Sorry for the storm.

Comment 4 Ben Marzinski 2005-05-09 19:11:53 UTC
Created attachment 114177 [details]
log dump from cypher-01

Comment 5 Ben Marzinski 2005-05-09 19:13:06 UTC
I've just seen something that looks like this bug. Check out the attachment for
details.

Comment 6 Ben Marzinski 2005-05-09 20:48:32 UTC
Running with 10 filesystems, I was getting this bug reliably after one or two
rounds of revolver. After knocking the number down to 5, it seems to have gone away.

Comment 7 David Teigland 2005-05-10 02:32:15 UTC
It appears that cman has shut down on this node, evident from all the ENOTCONN
and ENOBUFS errors the threads start getting in the dlm.  When cman shuts down
it tells the dlm to shut down which means all the dlm locks go away, so when
lock_dlm tries to convert one of its locks, the lock isn't there, an error
is returned and lock_dlm panics.

It's not always clear when cman shuts down, but you can to use kdb to look for
the normal cman threads -- see if they exist and if they do check what they're
doing.  You can also look for cman log messages on the different nodes.


Comment 8 Kagiso Modise 2006-07-28 11:59:02 UTC
I am currently experiensing the Same problem. I have a 6 node GFS cluster that
exports NFS and one of the nodes had died and this is what I found in
/var/log/messages.


Jul 28 09:34:35 jabbah kernel: lock_dlm:  Assertion failed on line 428 of file
/usr/src/build/762247-x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
Jul 28 09:34:35 jabbah kernel: lock_dlm:  assertion:  "!error"
Jul 28 09:34:35 jabbah kernel: lock_dlm:  time = 4574546183
Jul 28 09:34:35 jabbah kernel: gfs_mail: num=2,1f26f220 err=-22 cur=3 req=5 lkf=44
Jul 28 09:34:35 jabbah kernel:
Jul 28 09:34:35 jabbah kernel: ----------- [cut here ] --------- [please bite
here ] ---------
Jul 28 09:34:35 jabbah kernel: Kernel BUG at lock:428
Jul 28 09:34:35 jabbah kernel: invalid operand: 0000 [1] SMP

I have read through the posting and I can not figure out what I should do to
solve this.

How can I avoid this from happening again?