Description of problem: I got the following kernel panic under normal operation (the io load was not that high): Nov 11 12:30:27 p1-is04 kernel: 67dc0091 fr 2 r 2 5 Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2 Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2 Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval 6a18025f fr 2 r 2 5 Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval 689401e1 fr 2 r 2 5 Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval 6bd90132 fr 2 r 2 5 Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2 Nov 11 12:30:27 p1-is04 last message repeated 3 times Nov 11 12:30:27 p1-is04 kernel: mail grant lock on lockqueue 2 Nov 11 12:30:27 p1-is04 kernel: mail process_lockqueue_reply id 94080216 state 0 Nov 11 12:30:27 p1-is04 kernel: mail grant lock on lockqueue 2 Nov 11 12:30:27 p1-is04 kernel: mail process_lockqueue_reply id 915d0266 state 0 Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2 Nov 11 12:30:27 p1-is04 last message repeated 8 times Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval e12b03bc fr 2 r 2 5 Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval e14003ce fr 2 r 2 5 Nov 11 12:30:27 p1-is04 kernel: mail (2903) req reply einval e3ca005d fr 2 r 2 5 Nov 11 12:30:27 p1-is04 kernel: mail send einval to 2 Nov 11 12:30:27 p1-is04 last message repeated 3 times Nov 11 12:30:27 p1-is04 kernel: mail unlock fe6500ed no id Nov 11 12:30:27 p1-is04 kernel: 2897 pr_start last_stop 0 last_start 4 last_finish 0 Nov 11 12:30:27 p1-is04 kernel: 2897 pr_start count 1 type 2 event 4 flags 250 Nov 11 12:30:27 p1-is04 kernel: 2897 claim_jid 0 Nov 11 12:30:27 p1-is04 kernel: 2897 pr_start 4 done 1 Nov 11 12:30:27 p1-is04 kernel: 2897 pr_finish flags 5b Nov 11 12:30:27 p1-is04 kernel: 2888 recovery_done jid 0 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done nodeid 1 flg 18 Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 1 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 2 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 3 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 4 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 5 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 6 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 recovery_done jid 7 msg 309 b Nov 11 12:30:28 p1-is04 kernel: 2888 others_may_mount b Nov 11 12:30:28 p1-is04 kernel: 2896 pr_start last_stop 4 last_start 8 last_finish 4 Nov 11 12:30:28 p1-is04 kernel: 2896 pr_start count 2 type 2 event 8 flags a1b Nov 11 12:30:28 p1-is04 kernel: 2896 pr_start 8 done 1 Nov 11 12:30:28 p1-is04 kernel: 2896 pr_finish flags 81b Nov 11 12:30:28 p1-is04 kernel: Nov 11 12:30:28 p1-is04 kernel: lock_dlm: Assertion failed on line 357 of file /home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c Nov 11 12:30:28 p1-is04 kernel: lock_dlm: assertion: "!error" Nov 11 12:30:28 p1-is04 kernel: lock_dlm: time = 1219504465 Nov 11 12:30:28 p1-is04 kernel: mail: error=-22 num=2,1cc3b21c lkf=1 flags=84 Nov 11 12:30:28 p1-is04 kernel: Nov 11 12:30:28 p1-is04 kernel: ------------[ cut here ]------------ Nov 11 12:30:28 p1-is04 kernel: kernel BUG at /home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c:357! Nov 11 12:30:28 p1-is04 kernel: invalid operand: 0000 [#1] Nov 11 12:30:28 p1-is04 kernel: SMP Nov 11 12:30:28 p1-is04 kernel: Modules linked in: ip_vs i2o_config autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) qla2300 sg md5 ipv6 ipmi_si ipmi_devintf ipmi_msghandler iptable_filter ipt_REDIRECT iptable_nat ip_conntrack ip_tables dm_mirror button battery ac uhci_hcd ehci_hcd shpchp e1000 bonding(U) floppy ext3 jbd dm_mod i2o_block i2o_core qla2xxx scsi_transport_fc sd_mod scsi_mod Nov 11 12:30:28 p1-is04 kernel: CPU: 0 Nov 11 12:30:28 p1-is04 kernel: EIP: 0060:[<f9c3e5f3>] Not tainted VLI Nov 11 12:30:28 p1-is04 kernel: EFLAGS: 00010216 (2.6.9-22.ELsmp) Nov 11 12:30:28 p1-is04 kernel: EIP is at do_dlm_unlock+0x8b/0xa0 [lock_dlm] Nov 11 12:30:28 p1-is04 kernel: eax: 00000001 ebx: d991a680 ecx: c3b6ef2c edx: f9c43195 Nov 11 12:30:28 p1-is04 kernel: esi: ffffffea edi: d991a680 ebp: f9d02000 esp: c3b6ef28 Nov 11 12:30:28 p1-is04 kernel: ds: 007b es: 007b ss: 0068 Nov 11 12:30:28 p1-is04 kernel: Process gfs_glockd (pid: 2899, threadinfo=c3b6e000 task=f63231b0) Nov 11 12:30:28 p1-is04 kernel: Stack: f9c43195 f9d02000 00000001 f9c3e893 f9d6f53e 00000001 c401a5c4 c401a5a8 Nov 11 12:30:28 p1-is04 kernel: f9d65886 f9d9a500 d5fbebc0 c401a5a8 f9d9a500 c401a5a8 f9d64d7f c401a5a8 Nov 11 12:30:28 p1-is04 kernel: 00000001 c401a63c f9d64e36 c401a5a8 c401a5c4 f9d64f59 00000001 c401a5c4 Nov 11 12:30:28 p1-is04 kernel: Call Trace: Nov 11 12:30:28 p1-is04 kernel: [<f9c3e893>] lm_dlm_unlock+0x14/0x1c [lock_dlm] Nov 11 12:30:28 p1-is04 kernel: [<f9d6f53e>] gfs_lm_unlock+0x2c/0x42 [gfs] Nov 11 12:30:28 p1-is04 kernel: [<f9d65886>] gfs_glock_drop_th+0xf3/0x12d [gfs] Nov 11 12:30:28 p1-is04 kernel: [<f9d64d7f>] rq_demote+0x7f/0x98 [gfs] Nov 11 12:30:28 p1-is04 kernel: [<f9d64e36>] run_queue+0x5a/0xc1 [gfs] Nov 11 12:30:28 p1-is04 kernel: [<f9d64f59>] unlock_on_glock+0x1f/0x28 [gfs] Nov 11 12:30:28 p1-is04 kernel: [<f9d66eb8>] gfs_reclaim_glock+0xc3/0x13c [gfs] Nov 11 12:30:28 p1-is04 kernel: [<f9d59e01>] gfs_glockd+0x39/0xde [gfs] Nov 11 12:30:28 p1-is04 kernel: [<c011e52f>] default_wake_function+0x0/0xc Nov 11 12:30:28 p1-is04 kernel: [<c02d0fee>] ret_from_fork+0x6/0x14 Nov 11 12:30:28 p1-is04 kernel: [<c011e52f>] default_wake_function+0x0/0xc Nov 11 12:30:28 p1-is04 kernel: [<f9d59dc8>] gfs_glockd+0x0/0xde [gfs] Nov 11 12:30:28 p1-is04 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Nov 11 12:30:28 p1-is04 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 ff 70 18 68 8d 32 c4 f9 e8 2c 3d 4e c6 83 c4 34 68 95 31 c4 f9 e8 1f 3d 4e c6 <0f> 0b 65 01 c2 30 c4 f9 68 97 31 c4 f9 e8 da 34 4e c6 5b 5e c3 Nov 11 12:30:28 p1-is04 kernel: <0>Fatal exception: panic in 5 seconds Version-Release number of selected component (if applicable): cman-kernel-smp-2.6.9-39.5 dlm-kernel-smp-2.6.9-37.7 kernel-smp-2.6.9-22.EL GFS-kernel-smp-2.6.9-42.1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Could you possibly get the log messages going further back? One thing I'm looking for are any cman messages. All these errors about "einval" and "no id" are often a sign that the dlm has been shut down because cman has lost connection to the cluster.
Here is some additional log. The next line in the log would be the one posted above. Nov 3 12:30:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 4a2e0139 state 0 Nov 3 16:04:10 p1-is04 sshd(pam_unix)[1256]: session opened for user root by root(uid=0) Nov 3 17:16:26 p1-is04 sshd(pam_unix)[20821]: session opened for user root by root(uid=0) Nov 4 08:48:57 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 748e033d state 0 Nov 4 17:02:59 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 92c80101 state 0 Nov 6 04:02:03 p1-is04 syslogd 1.4.1: restart. Nov 6 13:52:18 p1-is04 sshd(pam_unix)[21878]: session opened for user root by root(uid=0) Nov 6 14:05:02 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 14a00022 state 0 Nov 6 15:26:43 p1-is04 sshd(pam_unix)[2553]: session opened for user root by root(uid=0) Nov 7 13:55:25 p1-is04 sshd(pam_unix)[28155]: session opened for user root by root(uid=0) Nov 8 06:52:15 p1-is04 sshd(pam_unix)[31921]: session opened for user root by root(uid=0) Nov 9 06:28:17 p1-is04 sshd(pam_unix)[23366]: session opened for user root by root(uid=0) Nov 9 13:59:44 p1-is04 sshd(pam_unix)[22949]: session opened for user root by root(uid=0) Nov 10 14:15:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 94080216 state 0 Nov 10 14:15:09 p1-is04 kernel: dlm: mail: process_lockqueue_reply id 915d0266 state 0 Nov 10 16:19:33 p1-is04 sshd(pam_unix)[13302]: session opened for user root by (uid=0) Nov 10 16:21:52 p1-is04 sshd(pam_unix)[14034]: session opened for user root by root(uid=0) Nov 10 16:41:27 p1-is04 sshd(pam_unix)[19844]: session opened for user root by root(uid=0) Nov 10 17:49:21 p1-is04 sshd(pam_unix)[6304]: session opened for user root by root(uid=0) Nov 10 17:50:31 p1-is04 sshd(pam_unix)[6646]: session opened for user root by root(uid=0) Nov 11 08:11:53 p1-is04 sshd(pam_unix)[30479]: session opened for user root by root(uid=0)
I got the same panic on a different node: Nov 24 09:10:03 p1-is03 kernel: dlm: mail: process_lockqueue_reply id bb8402cb state 0 Nov 24 10:50:02 p1-is03 kernel: dlm: mail: process_lockqueue_reply id bf7503fa state 0 Nov 24 17:43:23 p1-is03 kernel: 0,1,0 ids 0,2,0 Nov 24 17:43:23 p1-is03 kernel: clvmd move use event 2 Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 (first) Nov 24 17:43:23 p1-is03 kernel: clvmd add nodes Nov 24 17:43:23 p1-is03 kernel: clvmd total nodes 2 Nov 24 17:43:23 p1-is03 kernel: clvmd rebuild resource directory Nov 24 17:43:23 p1-is03 kernel: clvmd rebuilt 1 resources Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 done Nov 24 17:43:23 p1-is03 kernel: clvmd move flags 0,0,1 ids 0,2,2 Nov 24 17:43:23 p1-is03 kernel: clvmd process held requests Nov 24 17:43:23 p1-is03 kernel: clvmd processed 0 requests Nov 24 17:43:23 p1-is03 kernel: clvmd recover event 2 finished Nov 24 17:43:23 p1-is03 kernel: mail move flags 0,1,0 ids 0,3,0 Nov 24 17:43:23 p1-is03 kernel: mail move use event 3 Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 (first) Nov 24 17:43:23 p1-is03 kernel: mail add nodes Nov 24 17:43:23 p1-is03 kernel: mail total nodes 2 Nov 24 17:43:23 p1-is03 kernel: mail rebuild resource directory Nov 24 17:43:23 p1-is03 kernel: mail rebuilt 89450 resources Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 done Nov 24 17:43:23 p1-is03 kernel: mail move flags 0,0,1 ids 0,3,3 Nov 24 17:43:23 p1-is03 kernel: mail process held requests Nov 24 17:43:23 p1-is03 kernel: mail processed 0 requests Nov 24 17:43:23 p1-is03 kernel: mail recover event 3 finished Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 3 Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id 3f190257 state 0 Nov 24 17:43:23 p1-is03 kernel: mail (5357) req reply einval 7c59008a fr 1 r 1 5 Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1 Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 2 Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id bb8402cb state 0 Nov 24 17:43:23 p1-is03 kernel: mail grant lock on lockqueue 3 Nov 24 17:43:23 p1-is03 kernel: mail process_lockqueue_reply id bf7503fa state 0 Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1 Nov 24 17:43:23 p1-is03 kernel: mail send einval to 1 Nov 24 17:43:23 p1-is03 kernel: mail unlock 590077 no id Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start last_stop 0 last_start 4 last_finish 0 Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start count 2 type 2 event 4 flags 250 Nov 24 17:43:23 p1-is03 kernel: 2885 claim_jid 1 Nov 24 17:43:23 p1-is03 kernel: 2885 pr_start 4 done 1 Nov 24 17:43:23 p1-is03 kernel: 2885 pr_finish flags 5a Nov 24 17:43:23 p1-is03 kernel: 2876 recovery_done jid 1 msg 309 a Nov 24 17:43:23 p1-is03 kernel: 2876 recovery_done nodeid 2 flg 18 Nov 24 17:43:23 p1-is03 kernel: Nov 24 17:43:23 p1-is03 kernel: lock_dlm: Assertion failed on line 357 of file /home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c Nov 24 17:43:23 p1-is03 kernel: lock_dlm: assertion: "!error" Nov 24 17:43:23 p1-is03 kernel: lock_dlm: time = 120394931 Nov 24 17:43:23 p1-is03 kernel: mail: error=-22 num=5,e3d63a1 lkf=801 flags=84 Nov 24 17:43:23 p1-is03 kernel: Nov 24 17:43:24 p1-is03 kernel: ------------[ cut here ]------------ Nov 24 17:43:24 p1-is03 kernel: kernel BUG at /home/klaamane/NGMPLAT/packages/gfs/target/BUILD/gfs-kernel-2.6.9-42/smp/src/dlm/lock.c:357! Nov 24 17:43:24 p1-is03 kernel: invalid operand: 0000 [#1] Nov 24 17:43:24 p1-is03 kernel: SMP Nov 24 17:43:24 p1-is03 kernel: Modules linked in: ip_vs i2o_config autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) qla2300 sg md5 ipv6 ipmi_si ipmi_devintf ipmi_msghandl er iptable_filter ipt_REDIRECT iptable_nat ip_conntrack ip_tables dm_mirror button battery ac uhci_hcd ehci_hcd shpchp e1000 bonding(U) floppy ext3 jbd dm_mod i2o_block i2o_core qla2xxx scsi_transp ort_fc sd_mod scsi_mod Nov 24 17:43:24 p1-is03 kernel: CPU: 3 Nov 24 17:43:24 p1-is03 kernel: EIP: 0060:[<f9c3e5f3>] Not tainted VLI Nov 24 17:43:24 p1-is03 kernel: EFLAGS: 00010216 (2.6.9-22.0.1.ELsmp) Nov 24 17:43:24 p1-is03 kernel: EIP is at do_dlm_unlock+0x8b/0xa0 [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: eax: 00000001 ebx: dc552180 ecx: f608af04 edx: f9c43195 Nov 24 17:43:24 p1-is03 kernel: esi: ffffffea edi: dc552180 ebp: f9d02000 esp: f608af00 Nov 24 17:43:24 p1-is03 kernel: ds: 007b es: 007b ss: 0068 Nov 24 17:43:24 p1-is03 kernel: Process lock_dlm1 (pid: 2884, threadinfo=f608a000 task=f6460230) Nov 24 17:43:24 p1-is03 kernel: Stack: f9c43195 f9d02000 00000003 f9c3e893 f9d6f53e 00000001 f0388c7c f0388c60 Nov 24 17:43:24 p1-is03 kernel: f9d65886 f9d9a5c0 cdd57440 f0388c60 f9d9a5c0 dc552180 f9d64d7f f0388c60 Nov 24 17:43:24 p1-is03 kernel: 00000001 00000001 f9d64e36 f0388c7c f0388c60 f9d669e3 dc552180 c385a800 Nov 24 17:43:24 p1-is03 kernel: Call Trace: Nov 24 17:43:24 p1-is03 kernel: [<f9c3e893>] lm_dlm_unlock+0x14/0x1c [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<f9d6f53e>] gfs_lm_unlock+0x2c/0x42 [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d65886>] gfs_glock_drop_th+0xf3/0x12d [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d64d7f>] rq_demote+0x7f/0x98 [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d64e36>] run_queue+0x5a/0xc1 [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9d669e3>] blocking_cb+0x39/0x7a [gfs] Nov 24 17:43:24 p1-is03 kernel: [<f9c42323>] process_blocking+0x90/0x93 [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<f9c42970>] dlm_async+0x28b/0x2ff [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<c011e447>] default_wake_function+0x0/0xc Nov 24 17:43:24 p1-is03 kernel: [<c011e447>] default_wake_function+0x0/0xc Nov 24 17:43:24 p1-is03 kernel: [<f9c426e5>] dlm_async+0x0/0x2ff [lock_dlm] Nov 24 17:43:24 p1-is03 kernel: [<c01338f1>] kthread+0x73/0x9b Nov 24 17:43:24 p1-is03 kernel: [<c013387e>] kthread+0x0/0x9b Nov 24 17:43:24 p1-is03 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Nov 24 17:43:24 p1-is03 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 ff 70 18 68 8d 32 c4 f9 e8 44 3c 4e c6 83 c4 34 68 95 31 c4 f9 e8 37 3c 4e c6 <0f> 0b 65 01 c2 30 c4 f9 68 9 7 31 c4 f9 e8 f2 33 4e c6 5b 5e c3 Nov 24 17:43:24 p1-is03 kernel: <0>Fatal exception: panic in 5 seconds
Based on the "unlock 590077 no id" message I suspect that the dlm is being shut down by cman, although I'd expect to see some cman messages to that effect. There are also some new printk's in the dlm that would help confirm that.
(In reply to comment #4) > to see some cman messages to that effect. There are also > some new printk's in the dlm that would help confirm that. What version of gfs contains these printk changes? cluster/dlm-kernel/src/ has not changed in 7 months.
STABLE and RHEL4 versions of dlm-kernel/src/lockspace.c
We have seen this bug as well by disconnecting the FC switch as follows: ############################################################################### # Description: Disabling one of the FC switches appears to cause lock_dlm assertion erros ############################################################################### # Steps to reproduce: 1. Create or use existing exports 2. Mount the exports to linux clients 3. Run read/write/compare IO to exports 4. While the I/O is in progress, disable all ports or power off one of the FC switches 5. Check I/O 6. Check syslog for errors 7. Reenable or power on FC switch ############################################################################### # Expected Results: 1. I/O should continue using alternate paths. 2. I/O path errors should be retried and redirected thru a different path. 3. I/O should complete without any errors. ############################################################################### # Observed: 1. After a minute or so, node 1 asserted as shown below: Node1: lock_dlm: Assertion failed on line 357 of file fs/gfs_locking/lock_dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4307924940 JinVol: error=-22 num=2,1a lkf=10000 flags=84 cwalk_reset_bcm_chip: enter cwalk_reset_bcm_chip: i=0 ptr=000001007f281380 cwalk_reset_bcm_chip: i=1 ptr=00000100e6848380 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:357 invalid operand: 0000 [1] SMP Entering kdb (current=0x00000100e31307f0, pid 16132) on processor 0 Oops: <NULL> due to oops @ 0xffffffffa0239cad r15 = 0x00000100db871cb0 r14 = 0xffffffffa02357e0 r13 = 0xffffff0000cad000 r12 = 0x00000100db871cdc rbp = 0x00000000ffffffea rbx = 0x000001007794a480 r11 = 0x0000000d80391c00 r10 = 0x0000000000000000 r9 = 0x000001007794a480 r8 = 0x000000000000000d rax = 0x0000000000000001 rcx = 0x0000000000000246 rdx = 0x00000000002a2876 rsi = 0x0000000000000246 rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff rip = 0xffffffffa0239cad cs = 0x0000000000000010 eflags = 0x0000000000010216 rsp = 0x00000100db181db0 ss = 0x00000100db180000 ®s = 0x00000100db181d18 [0]kdb> [0]kdb> bt Stack traceback for pid 16132 0x00000100e31307f0 16132 1 1 0 R 0x00000100e3130bf0 *gfs_glockd RSP RIP Function (args) 0x100db181db0 0xffffffffa0239cad [lock_dlm]do_dlm_unlock+0xa5 (0x1, 0xffffffffa02088c9, 0x1, 0xffffffffa01ff36a, 0x100db871cb0) 0x100db181df8 0xffffffffa023a02b [lock_dlm]lm_dlm_unlock+0x15 0x100db181f58 0xffffffff8010fbaf child_rip+0x8 2. After another couple of minutes, node 2 asserted the same way: Node 2: lock_dlm: Assertion failed on line 357 of file fs/gfs_locking/lock_dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4307886580 crosswalk: error=-22 num=2,28 lkf=44 flags=84 cwalk_reset_bcm_chip: enter cwalk_reset_bcm_chip: i=0 ptr=000001007f23b380 cwalk_reset_bcm_chip: i=1 ptr=00000100e6803380 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:357 invalid operand: 0000 [1] SMP Entering kdb (current=0x000001007619f030, pid 12214) on processor 0 Oops: <NULL> due to oops @ 0xffffffffa0239cad r15 = 0x00000100ddc178c8 r14 = 0xffffffffa02357e0 r13 = 0xffffff0000c73000 r12 = 0x00000100ddc178f4 rbp = 0x00000000ffffffea rbx = 0x00000100e32b3e80 r11 = 0x0000000d80391c00 r10 = 0x0000000000000000 r9 = 0x00000100e32b3e80 r8 = 0x000000000000000d rax = 0x0000000000000001 rcx = 0x0000000000000246 rdx = 0x0000000000716525 rsi = 0x0000000000000246 rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff rip = 0xffffffffa0239cad cs = 0x0000000000000010 eflags = 0x0000000000010216 rsp = 0x00000100758a1db0 ss = 0x00000100758a0000 ®s = 0x00000100758a1d18 [0]kdb> bt Stack traceback for pid 12214 0x000001007619f030 12214 1 1 0 R 0x000001007619f430 *gfs_glockd RSP RIP Function (args) 0x100758a1db0 0xffffffffa0239cad [lock_dlm]do_dlm_unlock+0xa5 (0x2, 0xffffffffa02088c9, 0x1, 0xffffffffa01ff36a, 0x100ddc178c8) 0x100758a1df8 0xffffffffa023a02b [lock_dlm]lm_dlm_unlock+0x15 0x100758a1f58 0xffffffff8010fbaf child_rip+0x8 [0]kdb>
Any other messages before the assertion failure? If not, the RHEL4U3 update provides some additional debugging output that may be useful here.
Reproduced this problem using Update3. Performed the disabling an active port from an hba to fc switch test cases with the follwoing results: * No scsi 20000 errors were seen * There were SCSI 10000 errors * A bunch of retried paths were seen until one of the volumes was corrupted: qla2x00: cwalk_release: releasing module, cnt=1 GFS: Trying to join cluster "lock_dlm", "sqaone:TestVol_01" GFS: fsid=sqaone:TestVol_01.1: Joined cluster. Now mounting FS... GFS: fsid=sqaone:TestVol_01.1: jid=1: Trying to acquire journal lock... GFS: fsid=sqaone:TestVol_01.1: jid=1: Looking at journal... GFS: fsid=sqaone:TestVol_01.1: jid=1: Done GFS: Trying to join cluster "lock_dlm", "sqaone:TestVol_02" GFS: fsid=sqaone:TestVol_02.1: Joined cluster. Now mounting FS... GFS: fsid=sqaone:TestVol_02.1: jid=1: Trying to acquire journal lock... GFS: fsid=sqaone:TestVol_02.1: jid=1: Looking at journal... GFS: fsid=sqaone:TestVol_02.1: jid=1: Done GFS: fsid=sqaone:TestVol_02.1: fatal: filesystem consistency error GFS: fsid=sqaone:TestVol_02.1: inode = 36/36 GFS: fsid=sqaone:TestVol_02.1: function = gfs_copyin_dinode GFS: fsid=sqaone:TestVol_02.1: file = fs/gfs/inode.c, line = 266 GFS: fsid=sqaone:TestVol_02.1: time = 1142296033 GFS: fsid=sqaone:TestVol_02.1: about to withdraw from the cluster GFS: fsid=sqaone:TestVol_02.1: waiting for outstanding I/O GFS: fsid=sqaone:TestVol_02.1: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=sqaone:TestVol_02.1: withdrawn mh_magic = 0x01161970 mh_type = 4 mh_generation = 49412 mh_format = 400 mh_incarn = 0 no_formal_ino = 52 no_addr = 52 di_mode = 03545 di_uid = 0 di_gid = 0 di_nlink = 1 di_size = 800 di_blocks = 1 di_atime = 1142291438 di_mtime = 1142293185 di_ctime = 1142293185 di_major = 0 di_minor = 0 di_rgrp = 17 di_goal_rgrp = 17 di_goal_dblk = 30 di_goal_mblk = 30 di_flags = 0x00000000 di_payload_format = 0 di_type = 1 di_height = 0 di_incarn = 0 di_pad = 0 di_depth = 0 di_entries = 0 no_formal_ino = 0 no_addr = 0 di_eattr = 0 di_reserved = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ OK ] Red Hat Enterprise Linux AS release 4 (Nahant Update 3) Kernel 2.6.9-34.ELsmp on an x86_64
Yes, the fs appears to be corrupted (although it has nothing to do with the subject of this bugzilla.) If the fs hasn't been destroyed I'd like to poke around on the disk to gather more clues about what/where/how things got corrupted. I can't come up with any explanation, actually, apart from a bad buffer in memory, for the inconsistency that's printed out above. Does any new node mounting the fs hit this inconsistency right away?
Similar thing reported here: https://www.redhat.com/archives/linux-cluster/2006-April/msg00004.html
Created attachment 133109 [details] node1 /var/log/messages node1 log tar/gz
Created attachment 133110 [details] node 2 /var/log/messages node2 logs tar/gz
Created attachment 133111 [details] node 3 /var/log/messages node 3 logs tar/gz
I attached all the /var/log/messages for a 3 node cluster running 2.6.9-34 version of kernel on all nodes. Note, the pr_start preceded node fencings for the following dates/time when looking at logs (i.e. these are the fence events we care about for this bug): - node1 fenced node3, June 18, 4:46 - node1 fenced node2, June 29, 8:22 - node2 fenced node1, Jul 8 07:57 There are other fence events we in there but with these ones, the node that was fenced displayed the pr_start* strings in it's messages file before being fenced/rebooted. We suspect we were in kdb at these nodes and in fact we think from what we've seen in house that gfs and dlm function are in the stack and that dlm is the bottom/last function called. Still verifying and will append a stack if/when we reproduce.
comments 13-16 are not related to the topic of this bz. The pr_start messages are routine output when a node is mounting, or processing another mount/unmount or recovery.
We have found that we had invalid bond mode configuration. We were supposed to be using mode 3 and were using mode 0. This was causing some packet loss over the bond interface(s). Could packet loss have led down this path?
We have changed the bond to mode 3 and we had this happen again. Looks like it isn't just a bond mode problem.
I'm relatively confident this is a duplicate of bug 199673
Do people see this any longer after the fix for bug 199673?
I haven't seen this happen after we applied the patch from bug 199673. I'm pretty confident that this issue has been solved.
*** This bug has been marked as a duplicate of 199673 ***