Description of problem: This happened at the same time as bz 228067, but on link-02. I had 4 (3 legged) cluster mirrors (created out of 2 seperate VGs) each with GFS on top and all running filesystem I/O. I then failed /dev/hda which was set up to be one of the three legs in all 4 mirrors. Just after the leg failed I hit the following panic on link-02. Feb 9 08:34:30 link-02 ccsd[4570]: Unable to connect to cluster infrastructure after 270 seconds. Feb 9 08:34:30 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 768 of 453358 (1%) Feb 9 08:34:38 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 800 of 453390 (1%) Feb 9 08:34:46 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 832 of 453422 (1%) Feb 9 08:34:54 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 864 of 453454 (1%) Feb 9 08:35:00 link-02 ccsd[4570]: Unable to connect to cluster infrastructure after 300 seconds. Feb 9 08:35:02 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 896 of 453486 (1%) Feb 9 08:35:10 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 928 of 453518 (1%) Feb 9 08:35:18 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 960 of 453550 (1%) Feb 9 08:35:26 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 992 of 453582 (1%) Feb 9 08:35:30 link-02 ccsd[4570]: Unable to connect to cluster infrastructure after 330 seconds. Feb 9 08:35:34 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 1024 of 453614 (1%) Feb 9 08:35:42 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 1056 of 453646 (1%) Feb 9 08:35:50 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 1088 of 453678 (1%) dlm: dlm_unlock: lkid 20196 lockspace not found fs2 recover event 29 finished gfs2 move flags 1,0,0 ids 29,29,29 gfs2 move flags 0,1,0 ids 29,31,29 gfs2 move use event 31 gfs2 recover event 31 gfs2 add node 3 gfs2 total nodes 3 gfs2 rebuild resource directory gfs2 rebuilt 5 resources gfs2 purge requests gfs2 purged 0 requests gfs2 mark waiting requests gfs2 marked 0 requests gfs2 recover event 31 done gfs2 move flags 0,0,1 ids 29,31,31 gfs2 process held requests gfs2 processed 0 requests gfs2 resend marked requests gfs2 resent 0 requests gfs2 recover event 31 finished gfs2 move flags 1,0,0 ids 31,31,31 gfs2 move flags 0,1,0 ids 31,33,31 gfs2 move use event 33 gfs2 recover event 33 gfs2 add node 4 gfs2 total nodes 4 gfs2 rebuild resource directory gfs2 rebuilt 6 resources gfs2 purge requests gfs2 purged 0 requests gfs2 mark waiting requests gfs2 marked 0 requests gfs2 recover event 33 done gfs2 move flags 0,0,1 ids 31,33,33 gfs2 process held requests gfs2 processed 0 requests gfs2 resend marked requests gfs2 resent 0 requests gfs2 recover event 33 finished 5069 pr_start last_stop 0 last_start 12 last_finish 0 5069 pr_start count 1 type 2 event 12 flags 250 5069 claim_jid 0 5069 pr_start 12 done 1 5069 pr_finish flags 5b 5054 recovery_done jid 0 msg 309 b 5054 recovery_done nodeid 1 flg 18 5069 pr_start last_stop 12 last_start 14 last_finish 12 5069 pr_start count 2 type 2 event 14 flags 21b 5054 recovery_done jid 1 msg 309 21b 5054 recovery_done jid 2 msg 309 21b 5054 recovery_done jid 3 msg 309 21b 5054 others_may_mount 21b 5069 pr_start delay done before omm 21b 5069 pr_start 14 done 0 5054 others_may_mount start_done 14 181b 5069 pr_finish flags 181b 5069 pr_start last_stop 14 last_start 16 last_finish 14 5069 pr_start count 3 type 2 event 16 flags 1a1b 5069 pr_start 16 done 1 5068 pr_finish flags 181b 5069 pr_start last_stop 16 last_start 18 last_finish 16 5069 pr_start count 4 type 2 event 18 flags 1a1b 5069 pr_start 18 done 1 5068 pr_finish flags 181b 5114 pr_start last_stop 0 last_start 20 last_finish 0 5114 pr_start count 1 type 2 event 20 flags 250 5114 claim_jid 0 5114 pr_start 20 done 1 5114 pr_finish flags 5b 5109 recovery_done jid 0 msg 309 b 5109 recovery_done nodeid 1 flg 18 5109 recovery_done jid 1 msg 309 b 5109 recovery_done jid 2 msg 309 b 5109 recovery_done jid 3 msg 309 b 5109 others_may_mount b 5113 pr_start last_stop 20 last_start 22 last_finish 20 5113 pr_start count 2 type 2 event 22 flags a1b 5113 pr_start 22 done 1 5113 pr_finish flags 81b 5114 pr_start last_stop 22 last_start 24 last_finish 22 5114 pr_start count 3 type 2 event 24 flags a1b 5114 pr_start 24 done 1 5114 pr_finish flags 81b 5113 pr_start last_stop 24 last_start 26 last_finish 24 5113 pr_start count 4 type 2 event 26 flags a1b 5113 pr_start 26 done 1 5113 pr_finish flags 81b 5136 pr_start last_stop 0 last_start 28 last_finish 0 5136 pr_start count 1 type 2 event 28 flags 250 5136 claim_jid 0 5136 pr_start 28 done 1 5136 pr_finish flags 5b 5131 recovery_done jid 0 msg 309 b 5131 recovery_done nodeid 1 flg 18 5131 recovery_done jid 1 msg 309 b 5131 recovery_done jid 2 msg 309 b 5131 recovery_done jid 3 msg 309 b 5131 others_may_mount b 5135 pr_start last_stop 28 last_start 30 last_finish 28 5135 pr_start count 2 type 2 event 30 flags a1b 5135 pr_start 30 done 1 5135 pr_finish flags 81b 5136 pr_start last_stop 30 last_start 32 last_finish 30 5136 pr_start count 3 type 2 event 32 flags a1b 5136 pr_start 32 done 1 5136 pr_finish flags 81b 5135 pr_start last_stop 32 last_start 34 last_finish 32 5135 pr_start count 4 type 2 event 34 flags a1b 5135 pr_start 34 done 1 5135 pr_finish flags 81b lock_dlm: Assertion failed on line 361 of file /builddir/build/BUILD/gfs-kernel-2.6.9-67/smp/src/dlm/lock.c lock_dlm: assertion: "!error || (plock && error == -EINPROGRESS)" lock_dlm: time = 4306154010 gfs1: error=-22 num=2,1a lkf=10000 flags=84 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:361 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random k8_edac edac_mc tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 5071, comm: gfs_glockd Not tainted 2.6.9-46.ELsmp RIP: 0010:[<ffffffffa029183d>] <ffffffffa029183d>{:lock_dlm:do_dlm_unlock+203} RSP: 0000:0000010032ab9dd8 EFLAGS: 00010212 RAX: 0000000000000001 RBX: 000001003beded80 RCX: 0000000000000246 RDX: 00000000001033e8 RSI: 0000000000000246 RDI: ffffffff803e5600 RBP: 00000000ffffffea R08: 00000000000927c0 R09: 000001003beded80 R10: ffffffff80318d00 R11: 0000ffff80400460 R12: 0000000000000000 R13: ffffff00001a2000 R14: ffffffffa028d260 R15: 0000010032b65520 FS: 0000002a95562b00(0000) GS:ffffffff804ed500(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a95628000 CR3: 0000000000101000 CR4: 00000000000006e0 Process gfs_glockd (pid: 5071, threadinfo 0000010032ab8000, task 00000100391d67f0) Stack: 0000000000000084 000001003beded80 ffffff00001a2000 0000010032b65520 0000010032b6554c ffffffffa0291bba 0000000000000001 ffffffffa0256fd4 0000000000000001 ffffffffa024d7f7 Call Trace:<ffffffffa0291bba>{:lock_dlm:lm_dlm_unlock+21} <ffffffffa0256fd4>{:gfs:gfs_lm_unlock+41} <ffffffffa024d7f7>{:gfs:gfs_glock_drop_th+290} <ffffffffa024bf49>{:gfs:run_queue+314} <ffffffffa024c19d>{:gfs:unlock_on_glock+37} <ffffffffa024c293>{:gfs:gfs_reclaim_glock+234} <ffffffffa024075a>{:gfs:gfs_glockd+61} <ffffffff80134660>{default_wake_function+0} <ffffffff80134660>{default_wake_function+0} <ffffffff80110f47>{child_rip+8} <ffffffffa024071d>{:gfs:gfs_glockd+0} <ffffffff80110f3f>{child_rip+0} Code: 0f 0b 08 62 29 a0 ff ff ff ff 69 01 48 c7 c7 0d 62 29 a0 31 RIP <ffffffffa029183d>{:lock_dlm:do_dlm_unlock+203} RSP <0000010032ab9dd8> <0>Kernel panic - not syncing: Oops Version-Release number of selected component (if applicable): 2.6.9-46.ELsmp dlm-kernel-smp-2.6.9-46.10 dlm-1.0.1-2
Looks like the node was probably kicked out of the cluster dlm: dlm_unlock: lkid 20196 lockspace not found We have the "emergency shutdown" messages to make it more obvious when that happens. There are none of those or any cman messages about removing the node from the cluster?
Feb 9 08:58:46 link-04 kernel: CMAN: removing node link-02 from the cluster : Missed too many heartbeats
Feb 9 08:31:36 link-02 kernel: CMAN: Being told to leave the cluster by node 3 Feb 9 08:31:36 link-02 kernel: CMAN: we are leaving the cluster. Feb 9 08:31:36 link-02 kernel: WARNING: dlm_emergency_shutdown Feb 9 08:31:36 link-02 kernel: WARNING: dlm_emergency_shutdown Feb 9 08:31:36 link-02 kernel: SM: 00000005 sm_stop: SG still joined Feb 9 08:31:36 link-02 kernel: SM: 01000006 sm_stop: SG still joined Feb 9 08:31:36 link-02 kernel: SM: 02000009 sm_stop: SG still joined Feb 9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1 Feb 9 08:31:36 link-02 kernel: Feb 9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Feb 9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed Feb 9 08:31:36 link-02 kernel: device-mapper: recovery failed on region 10714 Feb 9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1 Feb 9 08:31:36 link-02 kernel: Feb 9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Feb 9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed Feb 9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1 Feb 9 08:31:36 link-02 kernel: Feb 9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Feb 9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed
reproduced with slightly different assert, but again same scenario, nodes get told to leave the cluster as a result of a cmirror device failing and slowing down the machines. lock_dlm: Assertion failed on line 432 of file /builddir/build/BUILD/gfs-kernel-2.6.9-67/smp/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4300566187 gfs3: num=2,18 err=-22 cur=0 req=5 lkf=4 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:432 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) qla2300 qla2xxx scsi_transport_fc dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random k8_edac edac_mc tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sd_mod scsi_mod Pid: 6201, comm: lock_dlm1 Tainted: G M 2.6.9-46.ELsmp RIP: 0010:[<ffffffffa0064a0c>] <ffffffffa0064a0c>{:lock_dlm:do_dlm_lock+366} RSP: 0000:00000100300adde8 EFLAGS: 00010212 RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000020000 RDX: 00000000001b2865 RSI: 0000000000000246 RDI: ffffffff803e5600 RBP: 000001003aae3680 R08: 00000000fffffffb R09: 00000000ffffffea R10: 0000000000000000 R11: 0000000000000000 R12: 0000010037e15e00 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001 FS: 0000002a95562b00(0000) GS:ffffffff804ed500(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000059ba08 CR3: 0000000000101000 CR4: 00000000000006e0 Process lock_dlm1 (pid: 6201, threadinfo 00000100300ac000, task 0000010036a35030) Stack: 0000000000000005 0000000000000004 3220202020202020 2020202020202020 3831202020202020 ffffffff80130018 000001003aae3680 0000010037e15e00 0000000000000000 0000000000000000 Call Trace:<ffffffff80130018>{ia32_setup_arg_pages+422} <ffffffffa0064f93>{:lock_dlm:process_submit+43} <ffffffffa0068ab8>{:lock_dlm:dlm_async+2020} <ffffffff801346b1>{__wake_up_common+67} <ffffffff80134660>{default_wake_function+0} <ffffffff8014be24>{keventd_create_kthread+0} <ffffffffa00682d4>{:lock_dlm:dlm_async+0} <ffffffff8014be24>{keventd_create_kthread+0} <ffffffff8014bdfb>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014be24>{keventd_create_kthread+0} <ffffffff8014bd33>{kthread+0} <ffffffff80110f3f>{child_rip+0} Code: 0f 0b 08 92 06 a0 ff ff ff ff b0 01 48 c7 c7 0d 92 06 a0 31 RIP <ffffffffa0064a0c>{:lock_dlm:do_dlm_lock+366} RSP <00000100300adde8> <3>dev<ic0e>-Kmearpneple rp:a nUniacb l-e notto sryenadc ifnrgo:m O opprsi ary mirror during recovery m
I propose setting a restriction that mirrors are limited to 2 sides for 4.5. This would diffuse this bug. Once we agree on that, I'll open a RFE for 4.6 and make this bug dependent on that.
bug 217895 ?
I found a simpler way to reproduce this issue. Do I/O on a cmirror from a node other than the server, then 'cman_tool leave force' on the log server node.
When a log server drops out of the cluster, it ignores any requests - forcing the clients to retry. Unfortunately, the clients never ran another election - causing operations to stall. The server now replies that it cannot handle the requests, which causes proper initiation of elections.
assigned ->needinfo
needinfo -> modified this should be in MODIFIED
Fix verified in latest packages.
Fixed in current release (4.7).