Bug 129151

Summary: DLM assertion while running doio/iogen: "rsb->res_nodeid == -1 || rsb->res_nodeid == 0"
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gfsAssignee: David Teigland <teigland>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: ccaulfie
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-21 19:04:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-08-04 14:43:51 UTC
Description of problem:
I started I/O on a 6 node cluster (morph-01 - morph-06) lastnight and
within minutes, an assertion in DLM had been tripped.

/dev/mapper/gfs-lvol0 on /mnt/gfs0 type gfs (rw)
/dev/mapper/gfs-lvol1 on /mnt/gfs1 type gfs (rw)
/dev/mapper/gfs-lvol2 on /mnt/gfs2 type gfs (rw)
/dev/mapper/gfs-lvol3 on /mnt/gfs3 type gfs (rw)

I/O cmdlines:
./iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T
4b 4b:/mnt/gfs0/rwbufsmall | ./doio -n 2 -avk &

./iogen -f buffered -m sequential -s read,write,readv,writev -t 1000b
-T 4000b 4000b:/mnt/gfs1/rwbufsmall | ./doio -n 2 -avk &


messages:
grant queue
000100a7 gr 1 rq -1 flg 0 sts 2 node 2 remid 1032f lq 0,1
name
"5EJAi4TSHeMwUzSD5Ht0Qv5A06bd2z3GY5e6jm6vlbxL33eH7g9gAM8PudrZj5Uf"
flags 0 nodeid 2 ref 1
grant queue
00010178 gr 1 rq -1 flg 0 sts 2 node 2 remid 201cc lq 0,1
    11         688554d"
gfs0 un 10309 ref 2 flg 0 nodeid 1/1 "       7         688554d"
gfs0 send un 10309 to 1
gfs0 un 10309 to 1 rsb nodeid 1
gfs0 rq 5 103c1 "       2         688554d"
gfs0 send lu 103c1 to 2
gfs0 un 1018e ref 1 flg 4 nodeid 0/-1 "      11         688554d
gfs0 lu rep 103c1 fr 2 1
gfs0 send rq 103c1 to 1
gfs0 rq 5 10087 "      11         688554d"
gfs0 send rq 10087 to 2
gfs0 rq rep 10087 fr 2 einval
gfs0 send rq 10087 to 2
gfs0 rq rep 10087 fr 2 einval
gfs0 un 2019d ref 1 flg 0 nodeid 1/-1 "       7         688554d
gfs0 send un 2019d to 1
gfs0 un 10087 ref 1 flg 4 nodeid 0/-1 "      11         688554d
gfs0 rq 5 10170 "      11         688554d"
gfs0 rq 5 103b0 "       7         688554d"
gfs0 send lu 103b0 to 2
gfs0 un 10170 ref 1 flg 4 nodeid 0/-1 "      11         688554d
gfs0 rq 5 103bd "      11         688554d"
gfs0 rq 5 10288 "       7         688554d"
gfs0 send lu 10288 to 2
gfs0 un 103bd ref 1 flg 4 nodeid 0/-1 "      11         688554d
gfs0 lu rep 103b0 fr 2 1
gfs0 send rq 103b0 to 1

DLM:  Assertion failed on line 328 of file
/usr/src/cluster/dlm-kernel/src/lockqueue.c
DLM:  assertion:  "rsb->res_nodeid == -1 || rsb->res_nodeid == 0"
DLM:  time = 834177
dlm: lkb
id 10288
remid 0
flags 0
status 0
rqmode 5
grmode -1
nodeid 4294967295
lqstate 0
lqflags 0
dlm: rsb
name "       7         688554d"
nodeid 1
ref 2
dlm: reply
rh_cmd 5
rh_lkid 10288
lockstate 0
nodeid 1
status 0
lkid f5e0be88

------------[ cut here ]------------
kernel BUG at /usr/src/cluster/dlm-kernel/src/lockqueue.c:328!
invalid operand: 0000 [#1]
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs
lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg
microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3
jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<f8a76b65>]    Not tainted
EFLAGS: 00010286   (2.6.7)
EIP is at process_lockqueue_reply+0x565/0x690 [dlm]
eax: 00000001   ebx: 00000001   ecx: 00000000   edx: f6a6be0c
esi: f6880810   edi: f6a6bef4   ebp: f6883618   esp: f6a6be08
ds: 007b   es: 007b   ss: 0068
Process dlm_recvd (pid: 3716, threadinfo=f6a6a000 task=f6a945b0)
Stack: f8a847f0 00000148 f8a8593c f8a85afc 000cba81 f7d65538 00000002
00000000
       f6880810 f7d65538 f6a6bef4 00000000 f8a77c38 f6a6a000 00000001
00000000
       0000003c f6a7eee0 0000003c f6a7edcc 00000000 00000000 f68837d4
00000002
Call Trace:
 [<f8a77c38>] process_cluster_request+0x6e8/0xd30 [dlm]
 [<c02b0e98>] inet_recvmsg+0x48/0x70
 [<c026bd0c>] sock_recvmsg+0xbc/0xc0
 [<f8a7b8c3>] midcomms_process_incoming_buffer+0x173/0x250 [dlm]
 [<c0105e6c>] common_interrupt+0x18/0x20
 [<c026bd0c>] sock_recvmsg+0xbc/0xc0
 [<f8a795e1>] receive_from_sock+0x141/0x300 [dlm]
 [<c0117e67>] recalc_task_prio+0x97/0x190
 [<f8a7a48b>] process_sockets+0x7b/0xa0 [dlm]
 [<f8a7a6fe>] dlm_recvd+0x9e/0xf0 [dlm]
 [<f8a7a660>] dlm_recvd+0x0/0xf0 [dlm]
 [<c010429d>] kernel_thread_helper+0x5/0x18

Code: 0f 0b 48 01 3c 59 a8 f8 e9 95 fb ff ff e8 f9 f1 ff ff e8 c4
 dlm: gfs0: un 103b0 to 1 rsb nodeid 1


How reproducible:
Didn't try

Comment 1 Corey Marthaler 2004-08-04 14:52:14 UTC

*** This bug has been marked as a duplicate of 128679 ***

Comment 2 Kiersten (Kerri) Anderson 2004-11-16 19:03:50 UTC
Updating version to the right level in the defects.  Sorry for the storm.

Comment 3 Red Hat Bugzilla 2006-02-21 19:04:56 UTC
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.