127008 – assertion failure in dlm/lock.c "!error"

Bug 127008 - assertion failure in dlm/lock.c "!error"

Summary: assertion failure in dlm/lock.c "!error"

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-30 16:19 UTC by Corey Marthaler
Modified:	2010-01-12 02:53 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-11-15 16:48:58 UTC
Embargoed:

Attachments	(Terms of Use)
log dump from cypher-01 (50.93 KB, text/plain) 2005-05-09 19:11 UTC, Ben Marzinski	no flags	Details
View All

Description Corey Marthaler 2004-06-30 16:19:42 UTC

From Bugzilla Helper: 
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux) 
 
Description of problem: 
I had a heathly 6 node cluster (morph-01 - morph-06) running I/O to 
one GFS filesystem. I then shot morph-04. This caused bugs 126526 
and 126604 on morph-02 and morph-05 and caused morph-06 to trip this 
assert. 
 
SM: send_nodeid_message error -107 to 5 
SM: send_nodeid_message error -107 to 2 
SM: 00000000 sm_stop: SG still joined 
SM: 01000002 sm_stop: SG still joined 
SM: 02000004 sm_stop: SG still joined 
61 3819 w 1 
ex plock 3819 error 0 
en punlock 3817 7,1a007aa4 
remove 7,1a007aa4 3817 
ex punlock 3817 error 0 
en plock 3817 7,1a007aa4 
req 7,1a007aa4 ex 17a8c5c-2bd706b 3817 w 1 
ex plock 3817 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 2513761-2dcbff4 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 2dcbff4-2ee613d 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 30d38df-30d3ff9 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
remove 7,1a007aa6 3819 
ex punlock 3819 error 0 
en plock 3819 7,1a007aa6 
req 7,1a007aa6 ex 0-2c6b988 3819 w 1 
ex plock 3819 error 0 
en punlock 3819 7,1a007aa6 
en punlock 3817 7,1a007aa4 
start c 5 type 1 e 8 
cb_need_recovery jid 3 
recovery_done jid 3 msg 309 
recovery_done 3,6 f 1b 
recovery_done start_done 8 
 
 lock_dlm:  Assertion failed on line 363 of file 
/usr/src/cluster/gfs-kernel/src/dlm/lock.c 
lock_dlm:  assertion:  "!error" 
lock_dlm:  time = 515259 
corey0: num=2,19 err=-22 cur=3 req=0 lkf=4 
 
Kernel panic: lock_dlm:  Record message above and reboot. 
 
 
 
How reproducible: 
Didn't try

Comment 1 David Teigland 2004-08-19 03:57:21 UTC

a ton of testing and fixes since this was reported and we've not seen
it again.  we should retry to be sure, but it's probably gone.

Comment 2 David Teigland 2004-08-19 04:41:04 UTC

this is a duplicate of 127839 which I've just reproduced

*** This bug has been marked as a duplicate of 127839 ***

Comment 3 Kiersten (Kerri) Anderson 2004-11-16 19:06:23 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Comment 4 Ben Marzinski 2005-05-09 19:11:53 UTC

Created attachment 114177 [details]
log dump from cypher-01

Comment 5 Ben Marzinski 2005-05-09 19:13:06 UTC

I've just seen something that looks like this bug. Check out the attachment for
details.

Comment 6 Ben Marzinski 2005-05-09 20:48:32 UTC

Running with 10 filesystems, I was getting this bug reliably after one or two
rounds of revolver. After knocking the number down to 5, it seems to have gone away.

Comment 7 David Teigland 2005-05-10 02:32:15 UTC

It appears that cman has shut down on this node, evident from all the ENOTCONN
and ENOBUFS errors the threads start getting in the dlm.  When cman shuts down
it tells the dlm to shut down which means all the dlm locks go away, so when
lock_dlm tries to convert one of its locks, the lock isn't there, an error
is returned and lock_dlm panics.

It's not always clear when cman shuts down, but you can to use kdb to look for
the normal cman threads -- see if they exist and if they do check what they're
doing.  You can also look for cman log messages on the different nodes.

Comment 8 Kagiso Modise 2006-07-28 11:59:02 UTC

I am currently experiensing the Same problem. I have a 6 node GFS cluster that
exports NFS and one of the nodes had died and this is what I found in
/var/log/messages.


Jul 28 09:34:35 jabbah kernel: lock_dlm:  Assertion failed on line 428 of file
/usr/src/build/762247-x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
Jul 28 09:34:35 jabbah kernel: lock_dlm:  assertion:  "!error"
Jul 28 09:34:35 jabbah kernel: lock_dlm:  time = 4574546183
Jul 28 09:34:35 jabbah kernel: gfs_mail: num=2,1f26f220 err=-22 cur=3 req=5 lkf=44
Jul 28 09:34:35 jabbah kernel:
Jul 28 09:34:35 jabbah kernel: ----------- [cut here ] --------- [please bite
here ] ---------
Jul 28 09:34:35 jabbah kernel: Kernel BUG at lock:428
Jul 28 09:34:35 jabbah kernel: invalid operand: 0000 [1] SMP

I have read through the posting and I can not figure out what I should do to
solve this.

How can I avoid this from happening again?

Note You need to log in before you can comment on or make changes to this bug.