Bug 1080152
Summary: | dlm fences no matter what current quorum status is | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jaroslav Kortus <jkortus> |
Component: | dlm | Assignee: | David Teigland <teigland> |
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.0 | CC: | cluster-maint |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-03-25 21:53:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jaroslav Kortus
2014-03-24 18:56:43 UTC
I tried this myself on my 8-node west cluster. I did a 'halt -fin' on 6/8 nodes. Here are the logs from one of the survivors, west-07. Mar 24 14:52:42 west-07 corosync[941]: [TOTEM ] A processor failed, forming new configuration. Mar 24 14:52:44 west-07 corosync[941]: [TOTEM ] A new membership (10.16.34.107:20100) was formed. Members left: 1 2 3 4 5 6 Mar 24 14:52:44 west-07 crmd[1151]: notice: peer_update_callback: Our peer on the DC is dead Mar 24 14:52:44 west-07 crmd[1151]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ] Mar 24 14:52:44 west-07 corosync[941]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Mar 24 14:52:44 west-07 corosync[941]: [QUORUM] Members[2]: 7 8 Mar 24 14:52:44 west-07 crmd[1151]: notice: pcmk_quorum_notification: Membership 20100: quorum lost (2) Mar 24 14:52:44 west-07 kernel: [15645.760457] dlm: closing connection to node 1 Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-03[3] - state is now lost (was member) Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-01[1] - state is now lost (was member) Mar 24 14:52:44 west-07 kernel: [15645.765059] dlm: closing connection to node 2 Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-04[4] - state is now lost (was member) Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-05[5] - state is now lost (was member) Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-06[6] - state is now lost (was member) Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-02[2] - state is now lost (was member) Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: pcmk_quorum_notification: Membership 20100: quorum lost (2) Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-02[2] - state is now lost (was member) Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-05[5] - state is now lost (was member) Mar 24 14:52:44 west-07 corosync[941]: [MAIN ] Completed service synchronization, ready to provide service. Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-01[1] - state is now lost (was member) Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-06[6] - state is now lost (was member) Mar 24 14:52:44 west-07 kernel: [15645.769589] dlm: closing connection to node 3 Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-03[3] - state is now lost (was member) Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-04[4] - state is now lost (was member) Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 1 Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 2 Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 3 Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 4 Mar 24 14:52:44 west-07 kernel: [15645.774046] dlm: closing connection to node 4 Mar 24 14:52:44 west-07 kernel: [15645.778590] dlm: closing connection to node 5 Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 5 Mar 24 14:52:44 west-07 kernel: [15645.783126] dlm: closing connection to node 6 Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 6 Mar 24 14:52:44 west-07 dlm_controld[2331]: 15645 fence request 1 pid 7858 nodedown time 1395687164 fence_all dlm_stonith Mar 24 14:52:44 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 1/(null): 0 in progress, 1 completed Mar 24 14:52:44 west-07 dlm_stonith: stonith_api_time: Node 1/(null) last kicked at: 1395672162 Mar 24 14:52:44 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7858.7e212830 wants to fence (reboot) '1' with device '(any)' Mar 24 14:52:44 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-01: 6ec6a0f4-f821-4d5a-b285-c0dc8df969e1 (0) Mar 24 14:52:44 west-07 crmd[1151]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Mar 24 14:52:44 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-01 (aka. '2'): static-list Mar 24 14:52:44 west-07 crmd[1151]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Mar 24 14:52:44 west-07 attrd[1149]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Mar 24 14:52:44 west-07 attrd[1149]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Mar 24 14:52:45 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-01 by west-08 for stonith-api.7858: OK Mar 24 14:52:45 west-07 stonith-api[7858]: stonith_api_kick: Node 1/(null) kicked: reboot Mar 24 14:52:45 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-01 was terminated (reboot) by west-08 for west-07: OK (ref=6ec6a0f4-f821-4d5a-b285-c0dc8df969e1) by client stonith-api.7858 Mar 24 14:52:45 west-07 stonith-api[7858]: stonith_api_time: Found 2 entries for 1/(null): 0 in progress, 2 completed Mar 24 14:52:45 west-07 stonith-api[7858]: stonith_api_time: Node 1/(null) last kicked at: 1395687165 Mar 24 14:52:46 west-07 dlm_controld[2331]: 15648 fence result 1 pid 7858 result 0 exit status Mar 24 14:52:46 west-07 dlm_controld[2331]: 15648 fence status 1 receive 0 from 7 walltime 1395687166 local 15648 Mar 24 14:52:46 west-07 dlm_controld[2331]: 15648 fence request 2 pid 7909 nodedown time 1395687164 fence_all dlm_stonith Mar 24 14:52:46 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 2/(null): 0 in progress, 1 completed Mar 24 14:52:46 west-07 dlm_stonith: stonith_api_time: Node 2/(null) last kicked at: 1395672164 Mar 24 14:52:46 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7909.3a510e6d wants to fence (reboot) '2' with device '(any)' Mar 24 14:52:46 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-02: ac8d4fd7-9590-4f95-82e3-c12d547b91db (0) Mar 24 14:52:46 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-02 (aka. '3'): static-list Mar 24 14:52:48 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-02 by west-08 for stonith-api.7909: OK Mar 24 14:52:48 west-07 stonith-api[7909]: stonith_api_kick: Node 2/(null) kicked: reboot Mar 24 14:52:48 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-02 was terminated (reboot) by west-08 for west-07: OK (ref=ac8d4fd7-9590-4f95-82e3-c12d547b91db) by client stonith-api.7909 Mar 24 14:52:48 west-07 stonith-api[7909]: stonith_api_time: Found 2 entries for 2/(null): 0 in progress, 2 completed Mar 24 14:52:48 west-07 stonith-api[7909]: stonith_api_time: Node 2/(null) last kicked at: 1395687168 Mar 24 14:52:49 west-07 dlm_controld[2331]: 15651 fence result 2 pid 7909 result 0 exit status Mar 24 14:52:49 west-07 dlm_controld[2331]: 15651 fence status 2 receive 0 from 7 walltime 1395687169 local 15651 Mar 24 14:52:49 west-07 dlm_controld[2331]: 15651 fence request 3 pid 7957 nodedown time 1395687164 fence_all dlm_stonith Mar 24 14:52:49 west-07 dlm_stonith: stonith_api_time: Found 0 entries for 3/(null): 0 in progress, 0 completed Mar 24 14:52:49 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7957.8d7e5018 wants to fence (reboot) '3' with device '(any)' Mar 24 14:52:49 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-03: 9444d63f-9a06-449e-9bb1-d9b8b30ddaec (0) Mar 24 14:52:49 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-03 (aka. '4'): static-list Mar 24 14:52:50 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-03 by west-08 for stonith-api.7957: OK Mar 24 14:52:50 west-07 stonith-api[7957]: stonith_api_kick: Node 3/(null) kicked: reboot Mar 24 14:52:50 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-03 was terminated (reboot) by west-08 for west-07: OK (ref=9444d63f-9a06-449e-9bb1-d9b8b30ddaec) by client stonith-api.7957 Mar 24 14:52:50 west-07 stonith-api[7957]: stonith_api_time: Found 1 entries for 3/(null): 0 in progress, 1 completed Mar 24 14:52:50 west-07 stonith-api[7957]: stonith_api_time: Node 3/(null) last kicked at: 1395687170 Mar 24 14:52:51 west-07 dlm_controld[2331]: 15653 fence result 3 pid 7957 result 0 exit status Mar 24 14:52:51 west-07 dlm_controld[2331]: 15653 fence status 3 receive 0 from 7 walltime 1395687171 local 15653 Mar 24 14:52:51 west-07 dlm_controld[2331]: 15653 fence request 4 pid 7982 nodedown time 1395687164 fence_all dlm_stonith Mar 24 14:52:51 west-07 dlm_stonith: stonith_api_time: Found 0 entries for 4/(null): 0 in progress, 0 completed Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7982.6513fbdb wants to fence (reboot) '4' with device '(any)' Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-04: dd8775d1-ae19-4e9f-b70f-73c6ce30e7af (0) Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-04 (aka. '5'): static-list Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-04 (aka. '5'): static-list Mar 24 14:52:52 west-07 fence_apc_snmp: Parse error: Ignoring unknown option 'nodename=west-04 Mar 24 14:52:53 west-07 stonith-ng[1146]: notice: log_operation: Operation 'reboot' [7983] (call 2 from stonith-api.7982) for host 'west-04' with device 'west-apc' returned: 0 (OK) Mar 24 14:52:53 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-04 by west-07 for stonith-api.7982: OK Mar 24 14:52:53 west-07 stonith-api[7982]: stonith_api_kick: Node 4/(null) kicked: reboot Mar 24 14:52:53 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-04 was terminated (reboot) by west-07 for west-07: OK (ref=dd8775d1-ae19-4e9f-b70f-73c6ce30e7af) by client stonith-api.7982 Mar 24 14:52:53 west-07 stonith-api[7982]: stonith_api_time: Found 1 entries for 4/(null): 0 in progress, 1 completed Mar 24 14:52:53 west-07 stonith-api[7982]: stonith_api_time: Node 4/(null) last kicked at: 1395687173 Mar 24 14:52:54 west-07 dlm_controld[2331]: 15656 fence result 4 pid 7982 result 0 exit status Mar 24 14:52:54 west-07 dlm_controld[2331]: 15656 fence status 4 receive 0 from 7 walltime 1395687174 local 15656 Mar 24 14:52:54 west-07 dlm_controld[2331]: 15656 fence request 5 pid 8035 nodedown time 1395687164 fence_all dlm_stonith Mar 24 14:52:54 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 5/(null): 0 in progress, 1 completed Mar 24 14:52:54 west-07 dlm_stonith: stonith_api_time: Node 5/(null) last kicked at: 1395671769 Mar 24 14:52:54 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.8035.c921c830 wants to fence (reboot) '5' with device '(any)' Mar 24 14:52:54 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-05: 281d52d1-31cc-45f2-a6f1-108555721f93 (0) Mar 24 14:52:54 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-05 (aka. '6'): static-list Mar 24 14:52:56 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-05 by west-08 for stonith-api.8035: OK Mar 24 14:52:56 west-07 stonith-api[8035]: stonith_api_kick: Node 5/(null) kicked: reboot Mar 24 14:52:56 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-05 was terminated (reboot) by west-08 for west-07: OK (ref=281d52d1-31cc-45f2-a6f1-108555721f93) by client stonith-api.8035 Mar 24 14:52:56 west-07 stonith-api[8035]: stonith_api_time: Found 2 entries for 5/(null): 0 in progress, 2 completed Mar 24 14:52:56 west-07 stonith-api[8035]: stonith_api_time: Node 5/(null) last kicked at: 1395687176 Mar 24 14:52:57 west-07 dlm_controld[2331]: 15659 fence result 5 pid 8035 result 0 exit status Mar 24 14:52:57 west-07 dlm_controld[2331]: 15659 fence status 5 receive 0 from 7 walltime 1395687177 local 15659 Mar 24 14:52:57 west-07 dlm_controld[2331]: 15659 fence request 6 pid 8077 nodedown time 1395687164 fence_all dlm_stonith Mar 24 14:52:57 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 6/(null): 0 in progress, 1 completed Mar 24 14:52:57 west-07 dlm_stonith: stonith_api_time: Node 6/(null) last kicked at: 1395672019 Mar 24 14:52:57 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.8077.bf409d3a wants to fence (reboot) '6' with device '(any)' Mar 24 14:52:57 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-06: 0b3e4da0-518a-45c4-b1b6-418c2a925065 (0) Mar 24 14:52:57 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-06 (aka. '7'): static-list Mar 24 14:52:59 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-06 by west-08 for stonith-api.8077: OK Mar 24 14:52:59 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-06 was terminated (reboot) by west-08 for west-07: OK (ref=0b3e4da0-518a-45c4-b1b6-418c2a925065) by client stonith-api.8077 Mar 24 14:52:59 west-07 stonith-api[8077]: stonith_api_kick: Node 6/(null) kicked: reboot Mar 24 14:52:59 west-07 stonith-api[8077]: stonith_api_time: Found 2 entries for 6/(null): 0 in progress, 2 completed Mar 24 14:52:59 west-07 stonith-api[8077]: stonith_api_time: Node 6/(null) last kicked at: 1395687179 Mar 24 14:53:00 west-07 dlm_controld[2331]: 15661 fence result 6 pid 8077 result 0 exit status Mar 24 14:53:00 west-07 dlm_controld[2331]: 15661 fence status 6 receive 0 from 7 walltime 1395687180 local 15661 Mar 24 14:53:31 west-07 kernel: [15693.536036] rport-6:0-18: blocked FC remote port time out: removing rport Mar 24 14:53:31 west-07 kernel: rport-6:0-18: blocked FC remote port time out: removing rport Mar 24 14:53:34 west-07 kernel: [15696.096033] rport-6:0-19: blocked FC remote port time out: removing rport Mar 24 14:53:34 west-07 kernel: rport-6:0-19: blocked FC remote port time out: removing rport Mar 24 14:53:37 west-07 kernel: [15698.656037] rport-6:0-2: blocked FC remote port time out: removing rport Mar 24 14:53:37 west-07 kernel: rport-6:0-2: blocked FC remote port time out: removing rport Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 west2 wait for quorum Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 west1 wait for quorum Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 west0 wait for quorum Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 clvmd wait for quorum Mar 24 14:53:39 west-07 kernel: [15701.600042] rport-6:0-3: blocked FC remote port time out: removing rport Mar 24 14:53:39 west-07 kernel: rport-6:0-3: blocked FC remote port time out: removing rport Mar 24 14:53:42 west-07 kernel: [15704.032042] rport-6:0-16: blocked FC remote port time out: removing rport Mar 24 14:53:42 west-07 kernel: rport-6:0-16: blocked FC remote port time out: removing rport This is caused by the "-q 0" option used in the controld resource agent which was changed in 1064519. *** This bug has been marked as a duplicate of bug 1064519 *** |