RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1080152 - dlm fences no matter what current quorum status is
Summary: dlm fences no matter what current quorum status is
Keywords:
Status: CLOSED DUPLICATE of bug 1064519
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dlm
Version: 7.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-24 18:56 UTC by Jaroslav Kortus
Modified: 2023-03-08 07:26 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-25 21:53:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jaroslav Kortus 2014-03-24 18:56:43 UTC
Description of problem:
dlm can trigger a fence operation even though it's currently in the minority (non-quorate) part of the cluster.

Version-Release number of selected component (if applicable):
dlm-4.0.2-3.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. setup a pacemaker cluster with working fencing
2. pcs property set no-quorum-policy=freeze
3. pcs resource create dlm controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true
4. halt -fin the majority of nodes

Actual results:
dlm takes care of the fencing, fencing the majority of the nodes from minority part, clearly ignoring pacemaker's non-quorate status.

Expected results:
dlm respects pacemaker's view on quorum and does not fence nodes unless in quorate partition

Additional info:

Comment 2 Nate Straz 2014-03-24 18:59:15 UTC
I tried this myself on my 8-node west cluster.  I did a 'halt -fin' on 6/8 nodes.  Here are the logs from one of the survivors, west-07.

Mar 24 14:52:42 west-07 corosync[941]: [TOTEM ] A processor failed, forming new configuration.
Mar 24 14:52:44 west-07 corosync[941]: [TOTEM ] A new membership (10.16.34.107:20100) was formed. Members left: 1 2 3 4 5 6
Mar 24 14:52:44 west-07 crmd[1151]: notice: peer_update_callback: Our peer on the DC is dead
Mar 24 14:52:44 west-07 crmd[1151]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Mar 24 14:52:44 west-07 corosync[941]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 24 14:52:44 west-07 corosync[941]: [QUORUM] Members[2]: 7 8
Mar 24 14:52:44 west-07 crmd[1151]: notice: pcmk_quorum_notification: Membership 20100: quorum lost (2)
Mar 24 14:52:44 west-07 kernel: [15645.760457] dlm: closing connection to node 1
Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-03[3] - state is now lost (was member)
Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-01[1] - state is now lost (was member)
Mar 24 14:52:44 west-07 kernel: [15645.765059] dlm: closing connection to node 2
Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-04[4] - state is now lost (was member)
Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-05[5] - state is now lost (was member)
Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-06[6] - state is now lost (was member)
Mar 24 14:52:44 west-07 crmd[1151]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-02[2] - state is now lost (was member)
Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: pcmk_quorum_notification: Membership 20100: quorum lost (2)
Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-02[2] - state is now lost (was member)
Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-05[5] - state is now lost (was member)
Mar 24 14:52:44 west-07 corosync[941]: [MAIN  ] Completed service synchronization, ready to provide service.
Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-01[1] - state is now lost (was member)
Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-06[6] - state is now lost (was member)
Mar 24 14:52:44 west-07 kernel: [15645.769589] dlm: closing connection to node 3
Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-03[3] - state is now lost (was member)
Mar 24 14:52:44 west-07 pacemakerd[1076]: notice: crm_update_peer_state: pcmk_quorum_notification: Node west-04[4] - state is now lost (was member)
Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 1
Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 2
Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 3
Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 4
Mar 24 14:52:44 west-07 kernel: [15645.774046] dlm: closing connection to node 4
Mar 24 14:52:44 west-07 kernel: [15645.778590] dlm: closing connection to node 5
Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 5
Mar 24 14:52:44 west-07 kernel: [15645.783126] dlm: closing connection to node 6
Mar 24 14:52:44 west-07 kernel: dlm: closing connection to node 6
Mar 24 14:52:44 west-07 dlm_controld[2331]: 15645 fence request 1 pid 7858 nodedown time 1395687164 fence_all dlm_stonith
Mar 24 14:52:44 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 1/(null): 0 in progress, 1 completed
Mar 24 14:52:44 west-07 dlm_stonith: stonith_api_time: Node 1/(null) last kicked at: 1395672162
Mar 24 14:52:44 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7858.7e212830 wants to fence (reboot) '1' with device '(any)'
Mar 24 14:52:44 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-01: 6ec6a0f4-f821-4d5a-b285-c0dc8df969e1 (0)
Mar 24 14:52:44 west-07 crmd[1151]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Mar 24 14:52:44 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-01 (aka. '2'): static-list
Mar 24 14:52:44 west-07 crmd[1151]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Mar 24 14:52:44 west-07 attrd[1149]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Mar 24 14:52:44 west-07 attrd[1149]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Mar 24 14:52:45 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-01 by west-08 for stonith-api.7858: OK
Mar 24 14:52:45 west-07 stonith-api[7858]: stonith_api_kick: Node 1/(null) kicked: reboot
Mar 24 14:52:45 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-01 was terminated (reboot) by west-08 for west-07: OK (ref=6ec6a0f4-f821-4d5a-b285-c0dc8df969e1) by client stonith-api.7858
Mar 24 14:52:45 west-07 stonith-api[7858]: stonith_api_time: Found 2 entries for 1/(null): 0 in progress, 2 completed
Mar 24 14:52:45 west-07 stonith-api[7858]: stonith_api_time: Node 1/(null) last kicked at: 1395687165
Mar 24 14:52:46 west-07 dlm_controld[2331]: 15648 fence result 1 pid 7858 result 0 exit status
Mar 24 14:52:46 west-07 dlm_controld[2331]: 15648 fence status 1 receive 0 from 7 walltime 1395687166 local 15648
Mar 24 14:52:46 west-07 dlm_controld[2331]: 15648 fence request 2 pid 7909 nodedown time 1395687164 fence_all dlm_stonith
Mar 24 14:52:46 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 2/(null): 0 in progress, 1 completed
Mar 24 14:52:46 west-07 dlm_stonith: stonith_api_time: Node 2/(null) last kicked at: 1395672164
Mar 24 14:52:46 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7909.3a510e6d wants to fence (reboot) '2' with device '(any)'
Mar 24 14:52:46 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-02: ac8d4fd7-9590-4f95-82e3-c12d547b91db (0)
Mar 24 14:52:46 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-02 (aka. '3'): static-list
Mar 24 14:52:48 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-02 by west-08 for stonith-api.7909: OK
Mar 24 14:52:48 west-07 stonith-api[7909]: stonith_api_kick: Node 2/(null) kicked: reboot
Mar 24 14:52:48 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-02 was terminated (reboot) by west-08 for west-07: OK (ref=ac8d4fd7-9590-4f95-82e3-c12d547b91db) by client stonith-api.7909
Mar 24 14:52:48 west-07 stonith-api[7909]: stonith_api_time: Found 2 entries for 2/(null): 0 in progress, 2 completed
Mar 24 14:52:48 west-07 stonith-api[7909]: stonith_api_time: Node 2/(null) last kicked at: 1395687168
Mar 24 14:52:49 west-07 dlm_controld[2331]: 15651 fence result 2 pid 7909 result 0 exit status
Mar 24 14:52:49 west-07 dlm_controld[2331]: 15651 fence status 2 receive 0 from 7 walltime 1395687169 local 15651
Mar 24 14:52:49 west-07 dlm_controld[2331]: 15651 fence request 3 pid 7957 nodedown time 1395687164 fence_all dlm_stonith
Mar 24 14:52:49 west-07 dlm_stonith: stonith_api_time: Found 0 entries for 3/(null): 0 in progress, 0 completed
Mar 24 14:52:49 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7957.8d7e5018 wants to fence (reboot) '3' with device '(any)'
Mar 24 14:52:49 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-03: 9444d63f-9a06-449e-9bb1-d9b8b30ddaec (0)
Mar 24 14:52:49 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-03 (aka. '4'): static-list
Mar 24 14:52:50 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-03 by west-08 for stonith-api.7957: OK
Mar 24 14:52:50 west-07 stonith-api[7957]: stonith_api_kick: Node 3/(null) kicked: reboot
Mar 24 14:52:50 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-03 was terminated (reboot) by west-08 for west-07: OK (ref=9444d63f-9a06-449e-9bb1-d9b8b30ddaec) by client stonith-api.7957
Mar 24 14:52:50 west-07 stonith-api[7957]: stonith_api_time: Found 1 entries for 3/(null): 0 in progress, 1 completed
Mar 24 14:52:50 west-07 stonith-api[7957]: stonith_api_time: Node 3/(null) last kicked at: 1395687170
Mar 24 14:52:51 west-07 dlm_controld[2331]: 15653 fence result 3 pid 7957 result 0 exit status
Mar 24 14:52:51 west-07 dlm_controld[2331]: 15653 fence status 3 receive 0 from 7 walltime 1395687171 local 15653
Mar 24 14:52:51 west-07 dlm_controld[2331]: 15653 fence request 4 pid 7982 nodedown time 1395687164 fence_all dlm_stonith
Mar 24 14:52:51 west-07 dlm_stonith: stonith_api_time: Found 0 entries for 4/(null): 0 in progress, 0 completed
Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.7982.6513fbdb wants to fence (reboot) '4' with device '(any)'
Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-04: dd8775d1-ae19-4e9f-b70f-73c6ce30e7af (0)
Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-04 (aka. '5'): static-list
Mar 24 14:52:51 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-04 (aka. '5'): static-list
Mar 24 14:52:52 west-07 fence_apc_snmp: Parse error: Ignoring unknown option 'nodename=west-04
Mar 24 14:52:53 west-07 stonith-ng[1146]: notice: log_operation: Operation 'reboot' [7983] (call 2 from stonith-api.7982) for host 'west-04' with device 'west-apc' returned: 0 (OK)
Mar 24 14:52:53 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-04 by west-07 for stonith-api.7982: OK
Mar 24 14:52:53 west-07 stonith-api[7982]: stonith_api_kick: Node 4/(null) kicked: reboot
Mar 24 14:52:53 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-04 was terminated (reboot) by west-07 for west-07: OK (ref=dd8775d1-ae19-4e9f-b70f-73c6ce30e7af) by client stonith-api.7982
Mar 24 14:52:53 west-07 stonith-api[7982]: stonith_api_time: Found 1 entries for 4/(null): 0 in progress, 1 completed
Mar 24 14:52:53 west-07 stonith-api[7982]: stonith_api_time: Node 4/(null) last kicked at: 1395687173
Mar 24 14:52:54 west-07 dlm_controld[2331]: 15656 fence result 4 pid 7982 result 0 exit status
Mar 24 14:52:54 west-07 dlm_controld[2331]: 15656 fence status 4 receive 0 from 7 walltime 1395687174 local 15656
Mar 24 14:52:54 west-07 dlm_controld[2331]: 15656 fence request 5 pid 8035 nodedown time 1395687164 fence_all dlm_stonith
Mar 24 14:52:54 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 5/(null): 0 in progress, 1 completed
Mar 24 14:52:54 west-07 dlm_stonith: stonith_api_time: Node 5/(null) last kicked at: 1395671769
Mar 24 14:52:54 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.8035.c921c830 wants to fence (reboot) '5' with device '(any)'
Mar 24 14:52:54 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-05: 281d52d1-31cc-45f2-a6f1-108555721f93 (0)
Mar 24 14:52:54 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-05 (aka. '6'): static-list
Mar 24 14:52:56 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-05 by west-08 for stonith-api.8035: OK
Mar 24 14:52:56 west-07 stonith-api[8035]: stonith_api_kick: Node 5/(null) kicked: reboot
Mar 24 14:52:56 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-05 was terminated (reboot) by west-08 for west-07: OK (ref=281d52d1-31cc-45f2-a6f1-108555721f93) by client stonith-api.8035
Mar 24 14:52:56 west-07 stonith-api[8035]: stonith_api_time: Found 2 entries for 5/(null): 0 in progress, 2 completed
Mar 24 14:52:56 west-07 stonith-api[8035]: stonith_api_time: Node 5/(null) last kicked at: 1395687176
Mar 24 14:52:57 west-07 dlm_controld[2331]: 15659 fence result 5 pid 8035 result 0 exit status
Mar 24 14:52:57 west-07 dlm_controld[2331]: 15659 fence status 5 receive 0 from 7 walltime 1395687177 local 15659
Mar 24 14:52:57 west-07 dlm_controld[2331]: 15659 fence request 6 pid 8077 nodedown time 1395687164 fence_all dlm_stonith
Mar 24 14:52:57 west-07 dlm_stonith: stonith_api_time: Found 1 entries for 6/(null): 0 in progress, 1 completed
Mar 24 14:52:57 west-07 dlm_stonith: stonith_api_time: Node 6/(null) last kicked at: 1395672019
Mar 24 14:52:57 west-07 stonith-ng[1146]: notice: handle_request: Client stonith-api.8077.bf409d3a wants to fence (reboot) '6' with device '(any)'
Mar 24 14:52:57 west-07 stonith-ng[1146]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for west-06: 0b3e4da0-518a-45c4-b1b6-418c2a925065 (0)
Mar 24 14:52:57 west-07 stonith-ng[1146]: notice: can_fence_host_with_device: west-apc can fence west-06 (aka. '7'): static-list
Mar 24 14:52:59 west-07 stonith-ng[1146]: notice: remote_op_done: Operation reboot of west-06 by west-08 for stonith-api.8077: OK
Mar 24 14:52:59 west-07 crmd[1151]: notice: tengine_stonith_notify: Peer west-06 was terminated (reboot) by west-08 for west-07: OK (ref=0b3e4da0-518a-45c4-b1b6-418c2a925065) by client stonith-api.8077
Mar 24 14:52:59 west-07 stonith-api[8077]: stonith_api_kick: Node 6/(null) kicked: reboot
Mar 24 14:52:59 west-07 stonith-api[8077]: stonith_api_time: Found 2 entries for 6/(null): 0 in progress, 2 completed
Mar 24 14:52:59 west-07 stonith-api[8077]: stonith_api_time: Node 6/(null) last kicked at: 1395687179
Mar 24 14:53:00 west-07 dlm_controld[2331]: 15661 fence result 6 pid 8077 result 0 exit status
Mar 24 14:53:00 west-07 dlm_controld[2331]: 15661 fence status 6 receive 0 from 7 walltime 1395687180 local 15661
Mar 24 14:53:31 west-07 kernel: [15693.536036]  rport-6:0-18: blocked FC remote port time out: removing rport
Mar 24 14:53:31 west-07 kernel: rport-6:0-18: blocked FC remote port time out: removing rport
Mar 24 14:53:34 west-07 kernel: [15696.096033]  rport-6:0-19: blocked FC remote port time out: removing rport
Mar 24 14:53:34 west-07 kernel: rport-6:0-19: blocked FC remote port time out: removing rport
Mar 24 14:53:37 west-07 kernel: [15698.656037]  rport-6:0-2: blocked FC remote port time out: removing rport
Mar 24 14:53:37 west-07 kernel: rport-6:0-2: blocked FC remote port time out: removing rport
Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 west2 wait for quorum
Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 west1 wait for quorum
Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 west0 wait for quorum
Mar 24 14:53:37 west-07 dlm_controld[2331]: 15698 clvmd wait for quorum
Mar 24 14:53:39 west-07 kernel: [15701.600042]  rport-6:0-3: blocked FC remote port time out: removing rport
Mar 24 14:53:39 west-07 kernel: rport-6:0-3: blocked FC remote port time out: removing rport
Mar 24 14:53:42 west-07 kernel: [15704.032042]  rport-6:0-16: blocked FC remote port time out: removing rport
Mar 24 14:53:42 west-07 kernel: rport-6:0-16: blocked FC remote port time out: removing rport

Comment 3 Nate Straz 2014-03-25 21:53:59 UTC
This is caused by the "-q 0" option used in the controld resource agent which was changed in 1064519.

*** This bug has been marked as a duplicate of bug 1064519 ***


Note You need to log in before you can comment on or make changes to this bug.