Bug 172944

Summary:

unlocking lock that doesn't exist

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Alexander Laamanen <alexander.laamanen>

Component:

dlm

Assignee:

David Teigland <teigland>

Status:

CLOSED DUPLICATE

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

medium

Version:

CC:

ccaulfie, cluster-maint, jbacik, lenny

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-09-22 19:10:23 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
node1 /var/log/messages	none
node 2 /var/log/messages	none
node 3 /var/log/messages	none

Description Alexander Laamanen 2005-11-11 15:23:51 UTC

Description of problem:
I got the following kernel panic under normal operation (the io load was not
that high):
Nov 11 12:30:27 p1-is04 kernel: 67dc0091 fr 2 r 2        5
Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2
Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2
Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval 6a18025f fr 2 r 2 
      5
Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval 689401e1 fr 2 r 2 
      5
Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval 6bd90132 fr 2 r 2 
      5
Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2
Nov 11 12:30:27 p1-is04 last message repeated 3 times
Nov 11 12:30:27 p1-is04 kernel: mail grant lock on lockqueue 2
Nov 11 12:30:27 p1-is04 kernel: mail process_lockqueue_reply id 94080216 state 0
Nov 11 12:30:27 p1-is04 kernel: mail grant lock on lockqueue 2
Nov 11 12:30:27 p1-is04 kernel: mail process_lockqueue_reply id 915d0266 state 0
Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2
Nov 11 12:30:27 p1-is04 last message repeated 8 times
Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval e12b03bc fr 2 r 2 
      5
Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval e14003ce fr 2 r 2 
      5
Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval e3ca005d fr 2 r 2 
      5
Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2
Nov 11 12:30:27 p1-is04 last message repeated 3 times
Nov 11 12:30:27 p1-is04 kernel: mail unlock fe6500ed no id
Nov 11 12:30:27 p1-is04 kernel: 2897 pr_start last_stop 0 last_start 4 last_finish 0
Nov 11 12:30:27 p1-is04 kernel: 2897 pr_start count 1 type 2 event 4 flags 250
Nov 11 12:30:27 p1-is04 kernel: 2897 claim_jid 0
Nov 11 12:30:27 p1-is04 kernel: 2897 pr_start 4 done 1
Nov 11 12:30:27 p1-is04 kernel: 2897 pr_finish flags 5b
Nov 11 12:30:27 p1-is04 kernel: 2888 recovery_done jid 0 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done nodeid 1 flg 18
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 1 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 2 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 3 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 4 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 5 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 6 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 7 msg 309 b
Nov 11 12:30:28 p1-is04 kernel: 2888 others_may_mount b
Nov 11 12:30:28 p1-is04 kernel: 2896 pr_start last_stop 4 last_start 8 last_finish 4
Nov 11 12:30:28 p1-is04 kernel: 2896 pr_start count 2 type 2 event 8 flags a1b
Nov 11 12:30:28 p1-is04 kernel: 2896 pr_start 8 done 1
Nov 11 12:30:28 p1-is04 kernel: 2896 pr_finish flags 81b
Nov 11 12:30:28 p1-is04 kernel:
Nov 11 12:30:28 p1-is04 kernel: lock_dlm:  Assertion failed on line 357 of file
/home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c
Nov 11 12:30:28 p1-is04 kernel: lock_dlm:  assertion:  "!error"
Nov 11 12:30:28 p1-is04 kernel: lock_dlm:  time = 1219504465
Nov 11 12:30:28 p1-is04 kernel: mail: error=-22 num=2,1cc3b21c lkf=1 flags=84
Nov 11 12:30:28 p1-is04 kernel:
Nov 11 12:30:28 p1-is04 kernel: ------------[ cut here ]------------
Nov 11 12:30:28 p1-is04 kernel: kernel BUG at
/home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c:357!
Nov 11 12:30:28 p1-is04 kernel: invalid operand: 0000 [#1]
Nov 11 12:30:28 p1-is04 kernel: SMP
Nov 11 12:30:28 p1-is04 kernel: Modules linked in: ip_vs i2o_config autofs4
i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) qla2300 sg
md5 ipv6 ipmi_si ipmi_devintf ipmi_msghandler iptable_filter ipt_REDIRECT
iptable_nat ip_conntrack ip_tables dm_mirror button battery ac uhci_hcd ehci_hcd
shpchp e1000 bonding(U) floppy ext3 jbd dm_mod i2o_block i2o_core qla2xxx
scsi_transport_fc sd_mod scsi_mod
Nov 11 12:30:28 p1-is04 kernel: CPU:    0
Nov 11 12:30:28 p1-is04 kernel: EIP:    0060:[<f9c3e5f3>]    Not tainted VLI
Nov 11 12:30:28 p1-is04 kernel: EFLAGS: 00010216   (2.6.9-22.ELsmp)
Nov 11 12:30:28 p1-is04 kernel: EIP is at do_dlm_unlock+0x8b/0xa0 [lock_dlm]
Nov 11 12:30:28 p1-is04 kernel: eax: 00000001   ebx: d991a680   ecx: c3b6ef2c  
edx: f9c43195
Nov 11 12:30:28 p1-is04 kernel: esi: ffffffea   edi: d991a680   ebp: f9d02000  
esp: c3b6ef28
Nov 11 12:30:28 p1-is04 kernel: ds: 007b   es: 007b   ss: 0068
Nov 11 12:30:28 p1-is04 kernel: Process gfs_glockd (pid: 2899,
threadinfo=c3b6e000 task=f63231b0)
Nov 11 12:30:28 p1-is04 kernel: Stack: f9c43195 f9d02000 00000001 f9c3e893
f9d6f53e 00000001 c401a5c4 c401a5a8
Nov 11 12:30:28 p1-is04 kernel:        f9d65886 f9d9a500 d5fbebc0 c401a5a8
f9d9a500 c401a5a8 f9d64d7f c401a5a8
Nov 11 12:30:28 p1-is04 kernel:        00000001 c401a63c f9d64e36 c401a5a8
c401a5c4 f9d64f59 00000001 c401a5c4
Nov 11 12:30:28 p1-is04 kernel: Call Trace:
Nov 11 12:30:28 p1-is04 kernel:  [<f9c3e893>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
Nov 11 12:30:28 p1-is04 kernel:  [<f9d6f53e>] gfs_lm_unlock+0x2c/0x42 [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<f9d65886>] gfs_glock_drop_th+0xf3/0x12d [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<f9d64d7f>] rq_demote+0x7f/0x98 [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<f9d64e36>] run_queue+0x5a/0xc1 [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<f9d64f59>] unlock_on_glock+0x1f/0x28 [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<f9d66eb8>] gfs_reclaim_glock+0xc3/0x13c [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<f9d59e01>] gfs_glockd+0x39/0xde [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<c011e52f>] default_wake_function+0x0/0xc
Nov 11 12:30:28 p1-is04 kernel:  [<c02d0fee>] ret_from_fork+0x6/0x14
Nov 11 12:30:28 p1-is04 kernel:  [<c011e52f>] default_wake_function+0x0/0xc
Nov 11 12:30:28 p1-is04 kernel:  [<f9d59dc8>] gfs_glockd+0x0/0xde [gfs]
Nov 11 12:30:28 p1-is04 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Nov 11 12:30:28 p1-is04 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff
73 0c 56 ff 70 18 68 8d 32 c4 f9 e8 2c 3d 4e c6 83 c4 34 68 95 31 c4 f9 e8 1f 3d
4e c6 <0f> 0b 65 01 c2 30 c4 f9 68 97 31 c4 f9 e8 da 34 4e c6 5b 5e c3
Nov 11 12:30:28 p1-is04 kernel:  <0>Fatal exception: panic in 5 seconds

Version-Release number of selected component (if applicable):

cman-kernel-smp-2.6.9-39.5
dlm-kernel-smp-2.6.9-37.7
kernel-smp-2.6.9-22.EL
GFS-kernel-smp-2.6.9-42.1

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2005-11-11 15:39:25 UTC

Could you possibly get the log messages going further back?
One thing I'm looking for are any cman messages.  All these
errors about "einval" and "no id" are often a sign that the
dlm has been shut down because cman has lost connection to
the cluster.

Comment 2 Alexander Laamanen 2005-11-11 15:57:39 UTC

Here is some additional log. The next line in the log would be the one posted above.

Nov  3 12:30:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 4a2e0139
state 0
Nov  3 16:04:10 p1-is04 sshd(pam_unix)[1256]: session opened for user root by
root(uid=0)
Nov  3 17:16:26 p1-is04 sshd(pam_unix)[20821]: session opened for user root by
root(uid=0)
Nov  4 08:48:57 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 748e033d
state 0
Nov  4 17:02:59 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 92c80101
state 0
Nov  6 04:02:03 p1-is04 syslogd 1.4.1: restart.
Nov  6 13:52:18 p1-is04 sshd(pam_unix)[21878]: session opened for user root by
root(uid=0)
Nov  6 14:05:02 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 14a00022
state 0
Nov  6 15:26:43 p1-is04 sshd(pam_unix)[2553]: session opened for user root by
root(uid=0)
Nov  7 13:55:25 p1-is04 sshd(pam_unix)[28155]: session opened for user root by
root(uid=0)
Nov  8 06:52:15 p1-is04 sshd(pam_unix)[31921]: session opened for user root by
root(uid=0)
Nov  9 06:28:17 p1-is04 sshd(pam_unix)[23366]: session opened for user root by
root(uid=0)
Nov  9 13:59:44 p1-is04 sshd(pam_unix)[22949]: session opened for user root by
root(uid=0)
Nov 10 14:15:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 94080216
state 0
Nov 10 14:15:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 915d0266
state 0
Nov 10 16:19:33 p1-is04 sshd(pam_unix)[13302]: session opened for user root by
(uid=0)
Nov 10 16:21:52 p1-is04 sshd(pam_unix)[14034]: session opened for user root by
root(uid=0)
Nov 10 16:41:27 p1-is04 sshd(pam_unix)[19844]: session opened for user root by
root(uid=0)
Nov 10 17:49:21 p1-is04 sshd(pam_unix)[6304]: session opened for user root by
root(uid=0)
Nov 10 17:50:31 p1-is04 sshd(pam_unix)[6646]: session opened for user root by
root(uid=0)
Nov 11 08:11:53 p1-is04 sshd(pam_unix)[30479]: session opened for user root by
root(uid=0)

Comment 3 Alexander Laamanen 2005-11-25 10:43:35 UTC

I got the same panic on a different node:

Nov 24 09:10:03 p1-is03 kernel: dlm: mail: process_lockqueue_reply id bb8402cb
state 0
Nov 24 10:50:02 p1-is03 kernel: dlm: mail: process_lockqueue_reply id bf7503fa
state 0
Nov 24 17:43:23 p1-is03 kernel: 0,1,0 ids 0,2,0
Nov 24 17:43:23 p1-is03 kernel: clvmd move use event 2
Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 (first)
Nov 24 17:43:23 p1-is03 kernel: clvmd add nodes
Nov 24 17:43:23 p1-is03 kernel: clvmd total nodes 2
Nov 24 17:43:23 p1-is03 kernel: clvmd rebuild resource directory
Nov 24 17:43:23 p1-is03 kernel: clvmd rebuilt 1 resources
Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 done
Nov 24 17:43:23 p1-is03 kernel: clvmd move flags 0,0,1 ids 0,2,2
Nov 24 17:43:23 p1-is03 kernel: clvmd process held requests
Nov 24 17:43:23 p1-is03 kernel: clvmd processed 0 requests
Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 finished
Nov 24 17:43:23 p1-is03 kernel: mail move flags 0,1,0 ids 0,3,0
Nov 24 17:43:23 p1-is03 kernel: mail move use event 3
Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 (first)
Nov 24 17:43:23 p1-is03 kernel: mail add nodes
Nov 24 17:43:23 p1-is03 kernel: mail total nodes 2
Nov 24 17:43:23 p1-is03 kernel: mail rebuild resource directory
Nov 24 17:43:23 p1-is03 kernel: mail rebuilt 89450 resources
Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 done
Nov 24 17:43:23 p1-is03 kernel: mail move flags 0,0,1 ids 0,3,3
Nov 24 17:43:23 p1-is03 kernel: mail process held requests
Nov 24 17:43:23 p1-is03 kernel: mail processed 0 requests
Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 finished
Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 3
Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id 3f190257 state 0
Nov 24 17:43:23 p1-is03 kernel: mail (5357) req reply einval 7c59008a fr 1 r 1 
      5
Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1
Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 2
Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id bb8402cb state 0
Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 3
Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id bf7503fa state 0
Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1
Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1
Nov 24 17:43:23 p1-is03 kernel: mail unlock 590077 no id
Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start last_stop 0 last_start 4 last_finish 0
Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start count 2 type 2 event 4 flags 250
Nov 24 17:43:23 p1-is03 kernel: 2885 claim_jid 1
Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start 4 done 1
Nov 24 17:43:23 p1-is03 kernel: 2885 pr_finish flags 5a
Nov 24 17:43:23 p1-is03 kernel: 2876 recovery_done jid 1 msg 309 a
Nov 24 17:43:23 p1-is03 kernel: 2876 recovery_done nodeid 2 flg 18
Nov 24 17:43:23 p1-is03 kernel:
Nov 24 17:43:23 p1-is03 kernel: lock_dlm:  Assertion failed on line 357 of file
/home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c
Nov 24 17:43:23 p1-is03 kernel: lock_dlm:  assertion:  "!error"
Nov 24 17:43:23 p1-is03 kernel: lock_dlm:  time = 120394931
Nov 24 17:43:23 p1-is03 kernel: mail: error=-22 num=5,e3d63a1 lkf=801 flags=84
Nov 24 17:43:23 p1-is03 kernel:
Nov 24 17:43:24 p1-is03 kernel: ------------[ cut here ]------------
Nov 24 17:43:24 p1-is03 kernel: kernel BUG at
/home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c:357!
Nov 24 17:43:24 p1-is03 kernel: invalid operand: 0000 [#1]
Nov 24 17:43:24 p1-is03 kernel: SMP
Nov 24 17:43:24 p1-is03 kernel: Modules linked in: ip_vs i2o_config autofs4
i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) qla2300 sg
md5 ipv6 ipmi_si ipmi_devintf ipmi_msghandl
er iptable_filter ipt_REDIRECT iptable_nat ip_conntrack ip_tables dm_mirror
button battery ac uhci_hcd ehci_hcd shpchp e1000 bonding(U) floppy ext3 jbd
dm_mod i2o_block i2o_core qla2xxx scsi_transp
ort_fc sd_mod scsi_mod
Nov 24 17:43:24 p1-is03 kernel: CPU:    3
Nov 24 17:43:24 p1-is03 kernel: EIP:    0060:[<f9c3e5f3>]    Not tainted VLI
Nov 24 17:43:24 p1-is03 kernel: EFLAGS: 00010216   (2.6.9-22.0.1.ELsmp)
Nov 24 17:43:24 p1-is03 kernel: EIP is at do_dlm_unlock+0x8b/0xa0 [lock_dlm]
Nov 24 17:43:24 p1-is03 kernel: eax: 00000001   ebx: dc552180   ecx: f608af04  
edx: f9c43195
Nov 24 17:43:24 p1-is03 kernel: esi: ffffffea   edi: dc552180   ebp: f9d02000  
esp: f608af00
Nov 24 17:43:24 p1-is03 kernel: ds: 007b   es: 007b   ss: 0068
Nov 24 17:43:24 p1-is03 kernel: Process lock_dlm1 (pid: 2884,
threadinfo=f608a000 task=f6460230)
Nov 24 17:43:24 p1-is03 kernel: Stack: f9c43195 f9d02000 00000003 f9c3e893
f9d6f53e 00000001 f0388c7c f0388c60
Nov 24 17:43:24 p1-is03 kernel:        f9d65886 f9d9a5c0 cdd57440 f0388c60
f9d9a5c0 dc552180 f9d64d7f f0388c60
Nov 24 17:43:24 p1-is03 kernel:        00000001 00000001 f9d64e36 f0388c7c
f0388c60 f9d669e3 dc552180 c385a800
Nov 24 17:43:24 p1-is03 kernel: Call Trace:
Nov 24 17:43:24 p1-is03 kernel:  [<f9c3e893>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
Nov 24 17:43:24 p1-is03 kernel:  [<f9d6f53e>] gfs_lm_unlock+0x2c/0x42 [gfs]
Nov 24 17:43:24 p1-is03 kernel:  [<f9d65886>] gfs_glock_drop_th+0xf3/0x12d [gfs]
Nov 24 17:43:24 p1-is03 kernel:  [<f9d64d7f>] rq_demote+0x7f/0x98 [gfs]
Nov 24 17:43:24 p1-is03 kernel:  [<f9d64e36>] run_queue+0x5a/0xc1 [gfs]
Nov 24 17:43:24 p1-is03 kernel:  [<f9d669e3>] blocking_cb+0x39/0x7a [gfs]
Nov 24 17:43:24 p1-is03 kernel:  [<f9c42323>] process_blocking+0x90/0x93 [lock_dlm]
Nov 24 17:43:24 p1-is03 kernel:  [<f9c42970>] dlm_async+0x28b/0x2ff [lock_dlm]
Nov 24 17:43:24 p1-is03 kernel:  [<c011e447>] default_wake_function+0x0/0xc
Nov 24 17:43:24 p1-is03 kernel:  [<c011e447>] default_wake_function+0x0/0xc
Nov 24 17:43:24 p1-is03 kernel:  [<f9c426e5>] dlm_async+0x0/0x2ff [lock_dlm]
Nov 24 17:43:24 p1-is03 kernel:  [<c01338f1>] kthread+0x73/0x9b
Nov 24 17:43:24 p1-is03 kernel:  [<c013387e>] kthread+0x0/0x9b
Nov 24 17:43:24 p1-is03 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Nov 24 17:43:24 p1-is03 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff
73 0c 56 ff 70 18 68 8d 32 c4 f9 e8 44 3c 4e c6 83 c4 34 68 95 31 c4 f9 e8 37 3c
4e c6 <0f> 0b 65 01 c2 30 c4 f9 68 9
7 31 c4 f9 e8 f2 33 4e c6 5b 5e c3
Nov 24 17:43:24 p1-is03 kernel:  <0>Fatal exception: panic in 5 seconds

Comment 4 David Teigland 2006-01-04 17:54:15 UTC

Based on the "unlock 590077 no id" message I suspect that
the dlm is being shut down by cman, although I'd expect
to see some cman messages to that effect.  There are also
some new printk's in the dlm that would help confirm that.

Comment 5 Alexander Laamanen 2006-01-10 12:19:26 UTC

(In reply to comment #4)
> to see some cman messages to that effect.  There are also
> some new printk's in the dlm that would help confirm that.

What version of gfs contains these printk changes? cluster/dlm-kernel/src/ has
not changed in 7 months.

Comment 6 David Teigland 2006-01-10 16:21:22 UTC

STABLE and RHEL4 versions of dlm-kernel/src/lockspace.c

Comment 7 Henry Harris 2006-01-30 16:47:00 UTC

We have seen this bug as well by disconnecting the FC switch as follows:

###############################################################################
#
Description:
 Disabling one of the FC switches appears to cause lock_dlm assertion erros

###############################################################################
#
Steps to reproduce:
1. Create or use existing exports
2. Mount the exports to linux clients
3. Run read/write/compare IO to exports
4. While the I/O is in progress, disable all ports or power off one of the FC
switches 
5. Check I/O
6. Check syslog for errors
7. Reenable or power on FC switch 

###############################################################################
#
Expected Results:
1. I/O should continue using alternate paths.
2. I/O path errors should be retried and redirected thru a different path.
3. I/O should complete without any errors.
###############################################################################
#
Observed:
1. After a minute or so, node 1 asserted as shown below:
  Node1:

lock_dlm:  Assertion failed on line 357 of file fs/gfs_locking/lock_dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4307924940
JinVol: error=-22 num=2,1a lkf=10000 flags=84

cwalk_reset_bcm_chip: enter
cwalk_reset_bcm_chip: i=0 ptr=000001007f281380
cwalk_reset_bcm_chip: i=1 ptr=00000100e6848380
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:357
invalid operand: 0000 [1] SMP

Entering kdb (current=0x00000100e31307f0, pid 16132) on processor 0 Oops: 
<NULL>
due to oops @ 0xffffffffa0239cad
     r15 = 0x00000100db871cb0      r14 = 0xffffffffa02357e0
     r13 = 0xffffff0000cad000      r12 = 0x00000100db871cdc
     rbp = 0x00000000ffffffea      rbx = 0x000001007794a480
     r11 = 0x0000000d80391c00      r10 = 0x0000000000000000
      r9 = 0x000001007794a480       r8 = 0x000000000000000d
     rax = 0x0000000000000001      rcx = 0x0000000000000246
     rdx = 0x00000000002a2876      rsi = 0x0000000000000246
     rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff
     rip = 0xffffffffa0239cad       cs = 0x0000000000000010
  eflags = 0x0000000000010216      rsp = 0x00000100db181db0
      ss = 0x00000100db180000 &regs = 0x00000100db181d18
[0]kdb>
[0]kdb> bt
Stack traceback for pid 16132
0x00000100e31307f0    16132        1  1    0   R  0x00000100e3130bf0 
*gfs_glockd
RSP           RIP                Function (args)
0x100db181db0 0xffffffffa0239cad [lock_dlm]do_dlm_unlock+0xa5 (0x1,
0xffffffffa02088c9, 0x1, 0xffffffffa01ff36a, 0x100db871cb0)
0x100db181df8 0xffffffffa023a02b [lock_dlm]lm_dlm_unlock+0x15
0x100db181f58 0xffffffff8010fbaf child_rip+0x8


2. After another couple of minutes, node 2 asserted the same way:

Node 2:

lock_dlm:  Assertion failed on line 357 of file fs/gfs_locking/lock_dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4307886580
crosswalk: error=-22 num=2,28 lkf=44 flags=84

cwalk_reset_bcm_chip: enter
cwalk_reset_bcm_chip: i=0 ptr=000001007f23b380
cwalk_reset_bcm_chip: i=1 ptr=00000100e6803380
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:357
invalid operand: 0000 [1] SMP

Entering kdb (current=0x000001007619f030, pid 12214) on processor 0 Oops: 
<NULL>
due to oops @ 0xffffffffa0239cad
     r15 = 0x00000100ddc178c8      r14 = 0xffffffffa02357e0
     r13 = 0xffffff0000c73000      r12 = 0x00000100ddc178f4
     rbp = 0x00000000ffffffea      rbx = 0x00000100e32b3e80
     r11 = 0x0000000d80391c00      r10 = 0x0000000000000000
      r9 = 0x00000100e32b3e80       r8 = 0x000000000000000d
     rax = 0x0000000000000001      rcx = 0x0000000000000246
     rdx = 0x0000000000716525      rsi = 0x0000000000000246
     rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff
     rip = 0xffffffffa0239cad       cs = 0x0000000000000010
  eflags = 0x0000000000010216      rsp = 0x00000100758a1db0
      ss = 0x00000100758a0000 &regs = 0x00000100758a1d18
[0]kdb> bt
Stack traceback for pid 12214
0x000001007619f030    12214        1  1    0   R  0x000001007619f430 
*gfs_glockd
RSP           RIP                Function (args)
0x100758a1db0 0xffffffffa0239cad [lock_dlm]do_dlm_unlock+0xa5 (0x2,
0xffffffffa02088c9, 0x1, 0xffffffffa01ff36a, 0x100ddc178c8)
0x100758a1df8 0xffffffffa023a02b [lock_dlm]lm_dlm_unlock+0x15
0x100758a1f58 0xffffffff8010fbaf child_rip+0x8
[0]kdb>

Comment 8 David Teigland 2006-01-30 19:16:51 UTC

Any other messages before the assertion failure?
If not, the RHEL4U3 update provides some additional
debugging output that may be useful here.

Comment 9 Henry Harris 2006-03-15 16:25:06 UTC

Reproduced this problem using Update3.
 
Performed the disabling an active port from an hba to fc switch test cases with
the follwoing results:
 
  * No scsi 20000 errors were seen
  * There were SCSI 10000 errors
  * A bunch of retried paths were seen until one of the volumes was corrupted:
  qla2x00: cwalk_release: releasing module, cnt=1
 
GFS: Trying to join cluster "lock_dlm", "sqaone:TestVol_01"
 
GFS: fsid=sqaone:TestVol_01.1: Joined cluster. Now mounting FS...
 
GFS: fsid=sqaone:TestVol_01.1: jid=1: Trying to acquire journal lock...
 
GFS: fsid=sqaone:TestVol_01.1: jid=1: Looking at journal...
 
GFS: fsid=sqaone:TestVol_01.1: jid=1: Done
 
GFS: Trying to join cluster "lock_dlm", "sqaone:TestVol_02"
 
GFS: fsid=sqaone:TestVol_02.1: Joined cluster. Now mounting FS...
 
GFS: fsid=sqaone:TestVol_02.1: jid=1: Trying to acquire journal lock...
 
GFS: fsid=sqaone:TestVol_02.1: jid=1: Looking at journal...
 
GFS: fsid=sqaone:TestVol_02.1: jid=1: Done
 
GFS: fsid=sqaone:TestVol_02.1: fatal: filesystem consistency error
 
GFS: fsid=sqaone:TestVol_02.1:   inode = 36/36
 
GFS: fsid=sqaone:TestVol_02.1:   function = gfs_copyin_dinode
 
GFS: fsid=sqaone:TestVol_02.1:   file = fs/gfs/inode.c, line = 266
 
GFS: fsid=sqaone:TestVol_02.1:   time = 1142296033
 
GFS: fsid=sqaone:TestVol_02.1: about to withdraw from the cluster
 
GFS: fsid=sqaone:TestVol_02.1: waiting for outstanding I/O
 
GFS: fsid=sqaone:TestVol_02.1: telling LM to withdraw
 
lock_dlm: withdraw abandoned memory
 
GFS: fsid=sqaone:TestVol_02.1: withdrawn
 
  mh_magic = 0x01161970
 
  mh_type = 4
 
  mh_generation = 49412
 
  mh_format = 400
 
  mh_incarn = 0
 
  no_formal_ino = 52
 
  no_addr = 52
 
  di_mode = 03545
 
  di_uid = 0
 
  di_gid = 0
 
  di_nlink = 1
 
  di_size = 800
 
  di_blocks = 1
 
  di_atime = 1142291438
 
  di_mtime = 1142293185
 
  di_ctime = 1142293185
 
  di_major = 0
 
  di_minor = 0
 
  di_rgrp = 17
 
  di_goal_rgrp = 17
 
  di_goal_dblk = 30
 
  di_goal_mblk = 30
 
  di_flags = 0x00000000
 
  di_payload_format = 0
 
  di_type = 1
 
  di_height = 0
 
  di_incarn = 0
 
  di_pad = 0
 
  di_depth = 0
 
  di_entries = 0
 
  no_formal_ino = 0
 
  no_addr = 0
 
  di_eattr = 0
 
  di_reserved =
 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
00 00 00 00 00 00 00 00
 
[  OK  ]
 
Red Hat Enterprise Linux AS release 4 (Nahant Update 3)
 
Kernel 2.6.9-34.ELsmp on an x86_64

Comment 10 David Teigland 2006-03-15 17:44:28 UTC

Yes, the fs appears to be corrupted (although
it has nothing to do with the subject of this
bugzilla.)

If the fs hasn't been destroyed I'd like to poke
around on the disk to gather more clues about
what/where/how things got corrupted.  I can't
come up with any explanation, actually, apart
from a bad buffer in memory, for the inconsistency
that's printed out above.

Does any new node mounting the fs hit this inconsistency
right away?

Comment 11 David Teigland 2006-04-10 14:38:39 UTC

Similar thing reported here:
https://www.redhat.com/archives/linux-cluster/2006-April/msg00004.html

Comment 13 Scott Cannata 2006-07-26 21:37:34 UTC

Created attachment 133109 [details]
node1 /var/log/messages

node1 log tar/gz

Comment 14 Scott Cannata 2006-07-26 21:39:11 UTC

Created attachment 133110 [details]
node 2 /var/log/messages

node2 logs tar/gz

Comment 15 Scott Cannata 2006-07-26 21:40:55 UTC

Created attachment 133111 [details]
node 3 /var/log/messages

node 3 logs tar/gz

Comment 16 Scott Cannata 2006-07-26 21:48:01 UTC

I attached all the /var/log/messages for a 3 node cluster running 2.6.9-34 
version of kernel on all nodes.

Note, the pr_start preceded node fencings for the following dates/time when 
looking at logs (i.e. these are the fence events we care about for this bug):


- node1 fenced node3, June 18, 4:46  
- node1 fenced node2, June 29, 8:22  
- node2 fenced node1, Jul  8  07:57

There are other fence events we in there but with these ones, the node that
was fenced displayed the pr_start* strings in it's messages file before being 
fenced/rebooted.  We suspect we were in kdb at these nodes and in fact we 
think from what we've seen in house that gfs and dlm function are in the stack 
and that dlm is the bottom/last function called. Still verifying and will 
append a stack if/when we reproduce.

Comment 17 David Teigland 2006-07-31 16:44:07 UTC

comments 13-16 are not related to the topic of this bz.
The pr_start messages are routine output when a node is
mounting, or processing another mount/unmount or recovery.

Comment 18 Lenny Maiorani 2006-08-03 18:51:54 UTC

We have found that we had invalid bond mode configuration. We were supposed to
be using mode 3 and were using mode 0. This was causing some packet loss over
the bond interface(s). 

Could packet loss have led down this path?

Comment 19 Lenny Maiorani 2006-08-30 15:33:27 UTC

We have changed the bond to mode 3 and we had this happen again. Looks like it
isn't just a bond mode problem.

Comment 20 David Teigland 2006-08-30 16:13:24 UTC

I'm relatively confident this is a duplicate of bug 199673

Comment 22 David Teigland 2006-09-21 15:57:15 UTC

Do people see this any longer after the fix for bug 199673?

Comment 23 Alexander Laamanen 2006-09-22 06:40:02 UTC

I haven't seen this happen after we applied the patch from bug 199673. I'm
pretty confident that this issue has been solved.

Comment 25 David Teigland 2006-09-22 19:10:23 UTC


*** This bug has been marked as a duplicate of 199673 ***