Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 966150

Summary:	fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) not fencing with pacemaker (pacemaker-1.1.8-7.el6.x86_64)
Product:	Red Hat Enterprise Linux 6	Reporter:	John McCabe <john>
Component:	fence-agents	Assignee:	Marek Grac <mgrac>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.4	CC:	acathrow, cluster-maint, john
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-07-08 08:30:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John McCabe 2013-05-22 15:31:55 UTC

Note - have raised a query on the pacemaker list in parallel to this

Description of problem:

Attempting to use fence_rhevm not fencing with pacemaker. fence_rhevm successful from the command line but fails when invoked within pacemaker.

Version-Release number of selected component (if applicable):

fence_rhevm - fence-agents-3.1.5-25.el6_4.2.x86_64
pacemaker - pacemaker-1.1.8-7.el6.x86_64

How reproducible:

Create a basic 2 node cluster with cman/pacemaker/corosync, add fence_rhevm stonith resource, kill corosync on node to be fenced

Steps to Reproduce:

1. create a basic cluster using pacemaker redirect in CMAN, for example cluster.conf as follows:

<?xml version="1.0"?>
<cluster config_version="1" name="testcluster">
  <logging debug="off"/>
  <clusternodes>
    <clusternode name="defiant" nodeid="1">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="defiant"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="enterprise" nodeid="2">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="enterprise"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices> 
    <fencedevice name="pcmk" agent="fence_pcmk"/> 
  </fencedevices>
  <cman two_node="1" expected_votes="1">
  </cman>
</cluster>

2. Add a fence_rhevm stonith resource

pcs stonith create st-rhevm fence_rhevm login="{user}@{domain}" passwd="{password}" ssl=1 ipaddr="{hypervisor ip}" verbose=1 debug="/var/log/fence_debug.log"

3. Confirm that the stonith resource has started

pcs status

4. Kill corosync on the node to fence

ps -ef | grep coro
kill -9 <pid>

Actual results:

Can see from /var/log/messages that there is a request to fence the node where you've killed corosync, but it fails. The debug log output from fence_rhevm doesn't a reboot request being logged only the vms command to list the available vms.

Expected results:

Expect that fencing occurs via pacemaker, I am able to fence successfully from the command line with fence_rhevm and stonith_admin:

stonith_admin --reboot enterprise --tolerance 5s

fence_rhevm -o reboot -a <hypervisor ip> -l <user>@<domain> -p <password> -n enterprise -z


Additional info:

/var/log/messages
-----------------------------
May 21 22:21:32 defiant corosync[1245]:   [TOTEM ] A processor failed, forming new configuration.
May 21 22:21:34 defiant corosync[1245]:   [QUORUM] Members[1]: 1
May 21 22:21:34 defiant corosync[1245]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 22:21:34 defiant kernel: dlm: closing connection to node 2
May 21 22:21:34 defiant corosync[1245]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.25.152) ; members(old:2 left:1)
May 21 22:21:34 defiant corosync[1245]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 22:21:34 defiant crmd[1749]:   notice: crm_update_peer_state: cman_event_callback: Node enterprise[2] - state is now lost
May 21 22:21:34 defiant crmd[1749]:  warning: match_down_event: No match for shutdown action on enterprise
May 21 22:21:34 defiant crmd[1749]:   notice: peer_update_callback: Stonith/shutdown of enterprise not matched
May 21 22:21:34 defiant crmd[1749]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:34 defiant fenced[1302]: fencing node enterprise
May 21 22:21:34 defiant logger: fence_pcmk[2219]: Requesting Pacemaker fence enterprise (reset)
May 21 22:21:34 defiant stonith_admin[2220]:   notice: crm_log_args: Invoked: stonith_admin --reboot enterprise --tolerance 5s
May 21 22:21:35 defiant attrd[1747]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
May 21 22:21:35 defiant attrd[1747]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
May 21 22:21:36 defiant pengine[1748]:   notice: unpack_config: On loss of CCM Quorum: Ignore
May 21 22:21:36 defiant pengine[1748]:   notice: process_pe_message: Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2
May 21 22:21:36 defiant crmd[1749]:   notice: run_graph: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
May 21 22:21:36 defiant crmd[1749]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May 21 22:21:44 defiant logger: fence_pcmk[2219]: Call to fence enterprise (reset) failed with rc=255
May 21 22:21:45 defiant fenced[1302]: fence enterprise dev 0.0 agent fence_pcmk result: error from agent
May 21 22:21:45 defiant fenced[1302]: fence enterprise failed
May 21 22:21:48 defiant fenced[1302]: fencing node enterprise
May 21 22:21:48 defiant logger: fence_pcmk[2239]: Requesting Pacemaker fence enterprise (reset)
May 21 22:21:48 defiant stonith_admin[2240]:   notice: crm_log_args: Invoked: stonith_admin --reboot enterprise --tolerance 5s
May 21 22:21:58 defiant logger: fence_pcmk[2239]: Call to fence enterprise (reset) failed with rc=255
May 21 22:21:58 defiant fenced[1302]: fence enterprise dev 0.0 agent fence_pcmk result: error from agent
May 21 22:21:58 defiant fenced[1302]: fence enterprise failed
May 21 22:22:01 defiant fenced[1302]: fencing node enterprise


/var/log/cluster/corosync.log
-----------------------------
May 21 22:21:32 corosync [TOTEM ] A processor failed, forming new configuration.
May 21 22:21:34 corosync [QUORUM] Members[1]: 1
May 21 22:21:34 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 22:21:34 [1749] defiant       crmd:     info: cman_event_callback:       Membership 296: quorum retained
May 21 22:21:34 [1744] defiant        cib:     info: pcmk_cpg_membership:       Left[5.0] cib.2
May 21 22:21:34 [1744] defiant        cib:     info: crm_update_peer_proc:      pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
May 21 22:21:34 [1744] defiant        cib:     info: pcmk_cpg_membership:       Member[5.0] cib.1
May 21 22:21:34 [1745] defiant stonith-ng:     info: pcmk_cpg_membership:       Left[5.0] stonith-ng.2
May 21 22:21:34 [1745] defiant stonith-ng:     info: crm_update_peer_proc:      pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
May 21 22:21:34 corosync [CPG   ] chosen downlist: sender r(0) ip(10.10.25.152) ; members(old:2 left:1)
May 21 22:21:34 corosync [MAIN  ] Completed service synchronization, ready to provide service.
May 21 22:21:34 [1745] defiant stonith-ng:     info: pcmk_cpg_membership:       Member[5.0] stonith-ng.1
May 21 22:21:34 [1749] defiant       crmd:   notice: crm_update_peer_state:     cman_event_callback: Node enterprise[2] - state is now lost
May 21 22:21:34 [1749] defiant       crmd:     info: peer_update_callback:      enterprise is now lost (was member)
May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section nodes (origin=local/crmd/150, version=0.22.3): OK (rc=0)
May 21 22:21:34 [1749] defiant       crmd:     info: pcmk_cpg_membership:       Left[5.0] crmd.2
May 21 22:21:34 [1749] defiant       crmd:     info: crm_update_peer_proc:      pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
May 21 22:21:34 [1749] defiant       crmd:     info: peer_update_callback:      Client enterprise/peer now has status [offline] (DC=true)
May 21 22:21:34 [1749] defiant       crmd:  warning: match_down_event:  No match for shutdown action on enterprise
May 21 22:21:34 [1749] defiant       crmd:   notice: peer_update_callback:      Stonith/shutdown of enterprise not matched
May 21 22:21:34 [1749] defiant       crmd:     info: crm_update_peer_expected:  peer_update_callback: Node enterprise[2] - expected state is now down
May 21 22:21:34 [1749] defiant       crmd:     info: abort_transition_graph:    peer_update_callback:211 - Triggered transition abort (complete=1) : Node failure
May 21 22:21:34 [1749] defiant       crmd:     info: pcmk_cpg_membership:       Member[5.0] crmd.1
May 21 22:21:34 [1749] defiant       crmd:   notice: do_state_transition:       State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:34 [1749] defiant       crmd:     info: abort_transition_graph:    do_te_invoke:163 - Triggered transition abort (complete=1) : Peer Halt
May 21 22:21:34 [1749] defiant       crmd:     info: join_make_offer:   Making join offers based on membership 296
May 21 22:21:34 [1749] defiant       crmd:     info: do_dc_join_offer_all:      join-7: Waiting on 1 outstanding join acks
May 21 22:21:34 [1749] defiant       crmd:     info: update_dc:         Set DC to defiant (3.0.7)
May 21 22:21:34 [1749] defiant       crmd:     info: do_state_transition:       State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:34 [1749] defiant       crmd:     info: do_dc_join_finalize:       join-7: Syncing the CIB from defiant to the rest of the cluster
May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_sync for section 'all' (origin=local/crmd/154, version=0.22.5): OK (rc=0)
May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section nodes (origin=local/crmd/155, version=0.22.6): OK (rc=0)
May 21 22:21:34 [1749] defiant       crmd:     info: stonith_action_create:     Initiating action metadata for agent fence_rhevm (target=(null))
May 21 22:21:35 [1749] defiant       crmd:     info: do_dc_join_ack:    join-7: Updating node state to member for defiant
May 21 22:21:35 [1749] defiant       crmd:     info: erase_status_tag:  Deleting xpath: //node_state[@uname='defiant']/lrm
May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_delete for section //node_state[@uname='defiant']/lrm (origin=local/crmd/156, version=0.22.7): OK (rc=0)
May 21 22:21:35 [1749] defiant       crmd:     info: do_state_transition:       State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:35 [1749] defiant       crmd:     info: abort_transition_graph:    do_te_invoke:156 - Triggered transition abort (complete=1) : Peer Cancelled
May 21 22:21:35 [1747] defiant      attrd:   notice: attrd_local_callback:      Sending full refresh (origin=crmd)
May 21 22:21:35 [1747] defiant      attrd:   notice: attrd_trigger_update:      Sending flush op to all hosts for: probe_complete (true)
May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section nodes (origin=local/crmd/158, version=0.22.9): OK (rc=0)
May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section cib (origin=local/crmd/160, version=0.22.11): OK (rc=0)
May 21 22:21:36 [1748] defiant    pengine:     info: unpack_config:     Startup probes: enabled
May 21 22:21:36 [1748] defiant    pengine:   notice: unpack_config:     On loss of CCM Quorum: Ignore
May 21 22:21:36 [1748] defiant    pengine:     info: unpack_config:     Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
May 21 22:21:36 [1748] defiant    pengine:     info: unpack_domains:    Unpacking domains
May 21 22:21:36 [1748] defiant    pengine:     info: determine_online_status_fencing:   Node defiant is active
May 21 22:21:36 [1748] defiant    pengine:     info: determine_online_status:   Node defiant is online
May 21 22:21:36 [1748] defiant    pengine:     info: native_print:      st-rhevm        (stonith:fence_rhevm):  Started defiant
May 21 22:21:36 [1748] defiant    pengine:     info: LogActions:        Leave   st-rhevm        (Started defiant)
May 21 22:21:36 [1748] defiant    pengine:   notice: process_pe_message:        Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2
May 21 22:21:36 [1749] defiant       crmd:     info: do_state_transition:       State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 21 22:21:36 [1749] defiant       crmd:     info: do_te_invoke:      Processing graph 64 (ref=pe_calc-dc-1369171296-118) derived from /var/lib/pacemaker/pengine/pe-input-60.bz2
May 21 22:21:36 [1749] defiant       crmd:   notice: run_graph:         Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
May 21 22:21:36 [1749] defiant       crmd:   notice: do_state_transition:       State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]


Live CIB
--------------------------------
<cib epoch="18" num_updates="88" admin_epoch="0" validate-with="pacemaker-1.2" update-origin="defiant" update-client="cibadmin" cib-last-written="Tue May 21 07:55:31 2013" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="defiant">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.8-7.el6-394e906"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="defiant" uname="defiant"/>
      <node id="enterprise" uname="enterprise"/>
    </nodes>
    <resources>
      <primitive class="stonith" id="st-rhevm" type="fence_rhevm">
        <instance_attributes id="st-rhevm-instance_attributes">
          <nvpair id="st-rhevm-instance_attributes-login" name="login" value="<user>@<domain>"/>
          <nvpair id="st-rhevm-instance_attributes-passwd" name="passwd" value="<password>"/>
          <nvpair id="st-rhevm-instance_attributes-debug" name="debug" value="/tmp/debug.log"/>
          <nvpair id="st-rhevm-instance_attributes-ssl" name="ssl" value="1"/>
          <nvpair id="st-rhevm-instance_attributes-verbose" name="verbose" value="1"/>
          <nvpair id="st-rhevm-instance_attributes-ipaddr" name="ipaddr" value="<hypervisor ip>"/>
        </instance_attributes>
      </primitive>
    </resources>
    <constraints/>
  </configuration>
  <status>
    <node_state id="defiant" uname="defiant" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
      <transient_attributes id="defiant">
        <instance_attributes id="status-defiant">
          <nvpair id="status-defiant-probe_complete" name="probe_complete" value="true"/>
        </instance_attributes>
      </transient_attributes>
      <lrm id="defiant">
        <lrm_resources>
          <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith">
            <lrm_rsc_op id="st-rhevm_last_0" operation_key="st-rhevm_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.7" transition-key="2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" transition-magic="0:0;2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" call-id="14" rc-code="0" op-status="0" interval="0" last-run="1369119332" last-rc-change="0" exec-time="232" queue-time="0" op-digest="3bc7e1ce413fe37998a289f77f85d159"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
    <node_state id="enterprise" uname="enterprise" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <lrm id="enterprise">
        <lrm_resources>
          <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith">
            <lrm_rsc_op id="st-rhevm_last_0" operation_key="st-rhevm_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" transition-magic="0:7;5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" call-id="5" rc-code="7" op-status="0" interval="0" last-run="1369170800" last-rc-change="0" exec-time="4" queue-time="0" op-digest="3bc7e1ce413fe37998a289f77f85d159"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
      <transient_attributes id="enterprise">
        <instance_attributes id="status-enterprise">
          <nvpair id="status-enterprise-probe_complete" name="probe_complete" value="true"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
  </status>
</cib>

Comment 1 John McCabe 2013-05-22 15:33:28 UTC

Explicitly setting the ipport and shell_timeout params didn't help.

<nvpair id="st-rhevm-instance_attributes-ipport" name="ipport" value="443"/>
<nvpair id="st-rhevm-instance_attributes-shell_timeout" name="shell_timeout" value="10"/>

Comment 3 Marek Grac 2013-05-24 22:05:28 UTC

@John:

Please, do you have a SELinux on? if yes, can you try it without it

Comment 4 John McCabe 2014-12-09 11:42:51 UTC

As an FYI this had turned out to be due to the stonith device in pacemaker lacking the following:

ssl=1
ipport=443
pcmk_host_list="clusternode1 clusternode2 clusternode3"
pcmk_host_map="clusternode1:node1_vm_name clsuternode2:node2_vm_name clusternode3:node3_vm_name"
action=reboot

Needed to explicitly set the port and enable SSL and also define the mapping from cluster node to VM name as configured in RHEV-M.

Comment 5 Marek Grac 2014-12-15 10:20:50 UTC

Thanks for info,

There should be no need to set ipport when ssl option is used. If ipport is not by default it is a bug.