Bug 172944
Summary: | unlocking lock that doesn't exist | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Alexander Laamanen <alexander.laamanen> | ||||||||
Component: | dlm | Assignee: | David Teigland <teigland> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4 | CC: | ccaulfie, cluster-maint, jbacik, lenny | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2006-09-22 19:10:23 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Alexander Laamanen
2005-11-11 15:23:51 UTC
Could you possibly get the log messages going further back? One thing I'm looking for are any cman messages. All these errors about "einval" and "no id" are often a sign that the dlm has been shut down because cman has lost connection to the cluster. Here is some additional log. The next line in the log would be the one posted above. Nov 3 12:30:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 4a2e0139 state 0 Nov 3 16:04:10 p1-is04 sshd(pam_unix)[1256]: session opened for user root by root(uid=0) Nov 3 17:16:26 p1-is04 sshd(pam_unix)[20821]: session opened for user root by root(uid=0) Nov 4 08:48:57 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 748e033d state 0 Nov 4 17:02:59 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 92c80101 state 0 Nov 6 04:02:03 p1-is04 syslogd 1.4.1: restart. Nov 6 13:52:18 p1-is04 sshd(pam_unix)[21878]: session opened for user root by root(uid=0) Nov 6 14:05:02 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 14a00022 state 0 Nov 6 15:26:43 p1-is04 sshd(pam_unix)[2553]: session opened for user root by root(uid=0) Nov 7 13:55:25 p1-is04 sshd(pam_unix)[28155]: session opened for user root by root(uid=0) Nov 8 06:52:15 p1-is04 sshd(pam_unix)[31921]: session opened for user root by root(uid=0) Nov 9 06:28:17 p1-is04 sshd(pam_unix)[23366]: session opened for user root by root(uid=0) Nov 9 13:59:44 p1-is04 sshd(pam_unix)[22949]: session opened for user root by root(uid=0) Nov 10 14:15:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 94080216 state 0 Nov 10 14:15:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 915d0266 state 0 Nov 10 16:19:33 p1-is04 sshd(pam_unix)[13302]: session opened for user root by (uid=0) Nov 10 16:21:52 p1-is04 sshd(pam_unix)[14034]: session opened for user root by root(uid=0) Nov 10 16:41:27 p1-is04 sshd(pam_unix)[19844]: session opened for user root by root(uid=0) Nov 10 17:49:21 p1-is04 sshd(pam_unix)[6304]: session opened for user root by root(uid=0) Nov 10 17:50:31 p1-is04 sshd(pam_unix)[6646]: session opened for user root by root(uid=0) Nov 11 08:11:53 p1-is04 sshd(pam_unix)[30479]: session opened for user root by root(uid=0) I got the same panic on a different node: Nov 24 09:10:03 p1-is03 kernel: dlm: mail: process_lockqueue_reply id bb8402cb state 0 Nov 24 10:50:02 p1-is03 kernel: dlm: mail: process_lockqueue_reply id bf7503fa state 0 Nov 24 17:43:23 p1-is03 kernel: 0,1,0 ids 0,2,0 Nov 24 17:43:23 p1-is03 kernel: clvmd move use event 2 Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 (first) Nov 24 17:43:23 p1-is03 kernel: clvmd add nodes Nov 24 17:43:23 p1-is03 kernel: clvmd total nodes 2 Nov 24 17:43:23 p1-is03 kernel: clvmd rebuild resource directory Nov 24 17:43:23 p1-is03 kernel: clvmd rebuilt 1 resources Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 done Nov 24 17:43:23 p1-is03 kernel: clvmd move flags 0,0,1 ids 0,2,2 Nov 24 17:43:23 p1-is03 kernel: clvmd process held requests Nov 24 17:43:23 p1-is03 kernel: clvmd processed 0 requests Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 finished Nov 24 17:43:23 p1-is03 kernel: mail move flags 0,1,0 ids 0,3,0 Nov 24 17:43:23 p1-is03 kernel: mail move use event 3 Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 (first) Nov 24 17:43:23 p1-is03 kernel: mail add nodes Nov 24 17:43:23 p1-is03 kernel: mail total nodes 2 Nov 24 17:43:23 p1-is03 kernel: mail rebuild resource directory Nov 24 17:43:23 p1-is03 kernel: mail rebuilt 89450 resources Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 done Nov 24 17:43:23 p1-is03 kernel: mail move flags 0,0,1 ids 0,3,3 Nov 24 17:43:23 p1-is03 kernel: mail process held requests Nov 24 17:43:23 p1-is03 kernel: mail processed 0 requests Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 finished Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 3 Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id 3f190257 state 0 Nov 24 17:43:23 p1-is03 kernel: mail (5357) req reply einval 7c59008a fr 1 r 1 5 Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1 Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 2 Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id bb8402cb state 0 Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 3 Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id bf7503fa state 0 Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1 Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1 Nov 24 17:43:23 p1-is03 kernel: mail unlock 590077 no id Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start last_stop 0 last_start 4 last_finish 0 Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start count 2 type 2 event 4 flags 250 Nov 24 17:43:23 p1-is03 kernel: 2885 claim_jid 1 Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start 4 done 1 Nov 24 17:43:23 p1-is03 kernel: 2885 pr_finish flags 5a Nov 24 17:43:23 p1-is03 kernel: 2876 recovery_done jid 1 msg 309 a Nov 24 17:43:23 p1-is03 kernel: 2876 recovery_done nodeid 2 flg 18 Nov 24 17:43:23 p1-is03 kernel: Nov 24 17:43:23 p1-is03 kernel: lock_dlm: Assertion failed on line 357 of file /home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c Nov 24 17:43:23 p1-is03 kernel: lock_dlm: assertion: "!error" Nov 24 17:43:23 p1-is03 kernel: lock_dlm: time = 120394931 Nov 24 17:43:23 p1-is03 kernel: mail: error=-22 num=5,e3d63a1 lkf=801 flags=84 Nov 24 17:43:23 p1-is03 kernel: Nov 24 17:43:24 p1-is03 kernel: ------------[ cut here ]------------ Nov 24 17:43:24 p1-is03 kernel: kernel BUG at /home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c:357! Nov 24 17:43:24 p1-is03 kernel: invalid operand: 0000 [#1] Nov 24 17:43:24 p1-is03 kernel: SMP Nov 24 17:43:24 p1-is03 kernel: Modules linked in: ip_vs i2o_config autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) qla2300 sg md5 ipv6 ipmi_si ipmi_devintf ipmi_msghandl er iptable_filter ipt_REDIRECT iptable_nat ip_conntrack ip_tables dm_mirror button battery ac uhci_hcd ehci_hcd shpchp e1000 bonding(U) floppy ext3 jbd dm_mod i2o_block i2o_core qla2xxx scsi_transp ort_fc sd_mod scsi_mod Nov 24 17:43:24 p1-is03 kernel: CPU: 3 Nov 24 17:43:24 p1-is03 kernel: EIP: 0060:[<f9c3e5f3>] Not tainted VLI Nov 24 17:43:24 p1-is03 kernel: EFLAGS: 00010216 (2.6.9-22.0.1.ELsmp) Nov 24 17:43:24 p1-is03 kernel: EIP is at do_dlm_unlock+0x8b/0xa0 [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: eax: 00000001 ebx: dc552180 ecx: f608af04 edx: f9c43195 Nov 24 17:43:24 p1-is03 kernel: esi: ffffffea edi: dc552180 ebp: f9d02000 esp: f608af00 Nov 24 17:43:24 p1-is03 kernel: ds: 007b es: 007b ss: 0068 Nov 24 17:43:24 p1-is03 kernel: Process lock_dlm1 (pid: 2884, threadinfo=f608a000 task=f6460230) Nov 24 17:43:24 p1-is03 kernel: Stack: f9c43195 f9d02000 00000003 f9c3e893 f9d6f53e 00000001 f0388c7c f0388c60 Nov 24 17:43:24 p1-is03 kernel: f9d65886 f9d9a5c0 cdd57440 f0388c60 f9d9a5c0 dc552180 f9d64d7f f0388c60 Nov 24 17:43:24 p1-is03 kernel: 00000001 00000001 f9d64e36 f0388c7c f0388c60 f9d669e3 dc552180 c385a800 Nov 24 17:43:24 p1-is03 kernel: Call Trace: Nov 24 17:43:24 p1-is03 kernel: [<f9c3e893>] lm_dlm_unlock+0x14/0x1c [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<f9d6f53e>] gfs_lm_unlock+0x2c/0x42 [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d65886>] gfs_glock_drop_th+0xf3/0x12d [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d64d7f>] rq_demote+0x7f/0x98 [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d64e36>] run_queue+0x5a/0xc1 [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d669e3>] blocking_cb+0x39/0x7a [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9c42323>] process_blocking+0x90/0x93 [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<f9c42970>] dlm_async+0x28b/0x2ff [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<c011e447>] default_wake_function+0x0/0xc Nov 24 17:43:24 p1-is03 kernel: [<c011e447>] default_wake_function+0x0/0xc Nov 24 17:43:24 p1-is03 kernel: [<f9c426e5>] dlm_async+0x0/0x2ff [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<c01338f1>] kthread+0x73/0x9b Nov 24 17:43:24 p1-is03 kernel: [<c013387e>] kthread+0x0/0x9b Nov 24 17:43:24 p1-is03 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Nov 24 17:43:24 p1-is03 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 ff 70 18 68 8d 32 c4 f9 e8 44 3c 4e c6 83 c4 34 68 95 31 c4 f9 e8 37 3c 4e c6 <0f> 0b 65 01 c2 30 c4 f9 68 9 7 31 c4 f9 e8 f2 33 4e c6 5b 5e c3 Nov 24 17:43:24 p1-is03 kernel: <0>Fatal exception: panic in 5 seconds Based on the "unlock 590077 no id" message I suspect that the dlm is being shut down by cman, although I'd expect to see some cman messages to that effect. There are also some new printk's in the dlm that would help confirm that. (In reply to comment #4) > to see some cman messages to that effect. There are also > some new printk's in the dlm that would help confirm that. What version of gfs contains these printk changes? cluster/dlm-kernel/src/ has not changed in 7 months. STABLE and RHEL4 versions of dlm-kernel/src/lockspace.c We have seen this bug as well by disconnecting the FC switch as follows: ############################################################################### # Description: Disabling one of the FC switches appears to cause lock_dlm assertion erros ############################################################################### # Steps to reproduce: 1. Create or use existing exports 2. Mount the exports to linux clients 3. Run read/write/compare IO to exports 4. While the I/O is in progress, disable all ports or power off one of the FC switches 5. Check I/O 6. Check syslog for errors 7. Reenable or power on FC switch ############################################################################### # Expected Results: 1. I/O should continue using alternate paths. 2. I/O path errors should be retried and redirected thru a different path. 3. I/O should complete without any errors. ############################################################################### # Observed: 1. After a minute or so, node 1 asserted as shown below: Node1: lock_dlm: Assertion failed on line 357 of file fs/gfs_locking/lock_dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4307924940 JinVol: error=-22 num=2,1a lkf=10000 flags=84 cwalk_reset_bcm_chip: enter cwalk_reset_bcm_chip: i=0 ptr=000001007f281380 cwalk_reset_bcm_chip: i=1 ptr=00000100e6848380 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:357 invalid operand: 0000 [1] SMP Entering kdb (current=0x00000100e31307f0, pid 16132) on processor 0 Oops: <NULL> due to oops @ 0xffffffffa0239cad r15 = 0x00000100db871cb0 r14 = 0xffffffffa02357e0 r13 = 0xffffff0000cad000 r12 = 0x00000100db871cdc rbp = 0x00000000ffffffea rbx = 0x000001007794a480 r11 = 0x0000000d80391c00 r10 = 0x0000000000000000 r9 = 0x000001007794a480 r8 = 0x000000000000000d rax = 0x0000000000000001 rcx = 0x0000000000000246 rdx = 0x00000000002a2876 rsi = 0x0000000000000246 rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff rip = 0xffffffffa0239cad cs = 0x0000000000000010 eflags = 0x0000000000010216 rsp = 0x00000100db181db0 ss = 0x00000100db180000 ®s = 0x00000100db181d18 [0]kdb> [0]kdb> bt Stack traceback for pid 16132 0x00000100e31307f0 16132 1 1 0 R 0x00000100e3130bf0 *gfs_glockd RSP RIP Function (args) 0x100db181db0 0xffffffffa0239cad [lock_dlm]do_dlm_unlock+0xa5 (0x1, 0xffffffffa02088c9, 0x1, 0xffffffffa01ff36a, 0x100db871cb0) 0x100db181df8 0xffffffffa023a02b [lock_dlm]lm_dlm_unlock+0x15 0x100db181f58 0xffffffff8010fbaf child_rip+0x8 2. After another couple of minutes, node 2 asserted the same way: Node 2: lock_dlm: Assertion failed on line 357 of file fs/gfs_locking/lock_dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4307886580 crosswalk: error=-22 num=2,28 lkf=44 flags=84 cwalk_reset_bcm_chip: enter cwalk_reset_bcm_chip: i=0 ptr=000001007f23b380 cwalk_reset_bcm_chip: i=1 ptr=00000100e6803380 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:357 invalid operand: 0000 [1] SMP Entering kdb (current=0x000001007619f030, pid 12214) on processor 0 Oops: <NULL> due to oops @ 0xffffffffa0239cad r15 = 0x00000100ddc178c8 r14 = 0xffffffffa02357e0 r13 = 0xffffff0000c73000 r12 = 0x00000100ddc178f4 rbp = 0x00000000ffffffea rbx = 0x00000100e32b3e80 r11 = 0x0000000d80391c00 r10 = 0x0000000000000000 r9 = 0x00000100e32b3e80 r8 = 0x000000000000000d rax = 0x0000000000000001 rcx = 0x0000000000000246 rdx = 0x0000000000716525 rsi = 0x0000000000000246 rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff rip = 0xffffffffa0239cad cs = 0x0000000000000010 eflags = 0x0000000000010216 rsp = 0x00000100758a1db0 ss = 0x00000100758a0000 ®s = 0x00000100758a1d18 [0]kdb> bt Stack traceback for pid 12214 0x000001007619f030 12214 1 1 0 R 0x000001007619f430 *gfs_glockd RSP RIP Function (args) 0x100758a1db0 0xffffffffa0239cad [lock_dlm]do_dlm_unlock+0xa5 (0x2, 0xffffffffa02088c9, 0x1, 0xffffffffa01ff36a, 0x100ddc178c8) 0x100758a1df8 0xffffffffa023a02b [lock_dlm]lm_dlm_unlock+0x15 0x100758a1f58 0xffffffff8010fbaf child_rip+0x8 [0]kdb> Any other messages before the assertion failure? If not, the RHEL4U3 update provides some additional debugging output that may be useful here. Reproduced this problem using Update3. Performed the disabling an active port from an hba to fc switch test cases with the follwoing results: * No scsi 20000 errors were seen * There were SCSI 10000 errors * A bunch of retried paths were seen until one of the volumes was corrupted: qla2x00: cwalk_release: releasing module, cnt=1 GFS: Trying to join cluster "lock_dlm", "sqaone:TestVol_01" GFS: fsid=sqaone:TestVol_01.1: Joined cluster. Now mounting FS... GFS: fsid=sqaone:TestVol_01.1: jid=1: Trying to acquire journal lock... GFS: fsid=sqaone:TestVol_01.1: jid=1: Looking at journal... GFS: fsid=sqaone:TestVol_01.1: jid=1: Done GFS: Trying to join cluster "lock_dlm", "sqaone:TestVol_02" GFS: fsid=sqaone:TestVol_02.1: Joined cluster. Now mounting FS... GFS: fsid=sqaone:TestVol_02.1: jid=1: Trying to acquire journal lock... GFS: fsid=sqaone:TestVol_02.1: jid=1: Looking at journal... GFS: fsid=sqaone:TestVol_02.1: jid=1: Done GFS: fsid=sqaone:TestVol_02.1: fatal: filesystem consistency error GFS: fsid=sqaone:TestVol_02.1: inode = 36/36 GFS: fsid=sqaone:TestVol_02.1: function = gfs_copyin_dinode GFS: fsid=sqaone:TestVol_02.1: file = fs/gfs/inode.c, line = 266 GFS: fsid=sqaone:TestVol_02.1: time = 1142296033 GFS: fsid=sqaone:TestVol_02.1: about to withdraw from the cluster GFS: fsid=sqaone:TestVol_02.1: waiting for outstanding I/O GFS: fsid=sqaone:TestVol_02.1: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=sqaone:TestVol_02.1: withdrawn mh_magic = 0x01161970 mh_type = 4 mh_generation = 49412 mh_format = 400 mh_incarn = 0 no_formal_ino = 52 no_addr = 52 di_mode = 03545 di_uid = 0 di_gid = 0 di_nlink = 1 di_size = 800 di_blocks = 1 di_atime = 1142291438 di_mtime = 1142293185 di_ctime = 1142293185 di_major = 0 di_minor = 0 di_rgrp = 17 di_goal_rgrp = 17 di_goal_dblk = 30 di_goal_mblk = 30 di_flags = 0x00000000 di_payload_format = 0 di_type = 1 di_height = 0 di_incarn = 0 di_pad = 0 di_depth = 0 di_entries = 0 no_formal_ino = 0 no_addr = 0 di_eattr = 0 di_reserved = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ OK ] Red Hat Enterprise Linux AS release 4 (Nahant Update 3) Kernel 2.6.9-34.ELsmp on an x86_64 Yes, the fs appears to be corrupted (although it has nothing to do with the subject of this bugzilla.) If the fs hasn't been destroyed I'd like to poke around on the disk to gather more clues about what/where/how things got corrupted. I can't come up with any explanation, actually, apart from a bad buffer in memory, for the inconsistency that's printed out above. Does any new node mounting the fs hit this inconsistency right away? Similar thing reported here: https://www.redhat.com/archives/linux-cluster/2006-April/msg00004.html Created attachment 133109 [details]
node1 /var/log/messages
node1 log tar/gz
Created attachment 133110 [details]
node 2 /var/log/messages
node2 logs tar/gz
Created attachment 133111 [details]
node 3 /var/log/messages
node 3 logs tar/gz
I attached all the /var/log/messages for a 3 node cluster running 2.6.9-34 version of kernel on all nodes. Note, the pr_start preceded node fencings for the following dates/time when looking at logs (i.e. these are the fence events we care about for this bug): - node1 fenced node3, June 18, 4:46 - node1 fenced node2, June 29, 8:22 - node2 fenced node1, Jul 8 07:57 There are other fence events we in there but with these ones, the node that was fenced displayed the pr_start* strings in it's messages file before being fenced/rebooted. We suspect we were in kdb at these nodes and in fact we think from what we've seen in house that gfs and dlm function are in the stack and that dlm is the bottom/last function called. Still verifying and will append a stack if/when we reproduce. comments 13-16 are not related to the topic of this bz. The pr_start messages are routine output when a node is mounting, or processing another mount/unmount or recovery. We have found that we had invalid bond mode configuration. We were supposed to be using mode 3 and were using mode 0. This was causing some packet loss over the bond interface(s). Could packet loss have led down this path? We have changed the bond to mode 3 and we had this happen again. Looks like it isn't just a bond mode problem. I'm relatively confident this is a duplicate of bug 199673 Do people see this any longer after the fix for bug 199673? I haven't seen this happen after we applied the patch from bug 199673. I'm pretty confident that this issue has been solved. *** This bug has been marked as a duplicate of 199673 *** |