Bug 966150

Summary: fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) not fencing with pacemaker (pacemaker-1.1.8-7.el6.x86_64)
Product: Red Hat Enterprise Linux 6 Reporter: John McCabe <john>
Component: fence-agentsAssignee: Marek Grac <mgrac>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4CC: acathrow, cluster-maint, john
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-08 08:30:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John McCabe 2013-05-22 15:31:55 UTC
Note - have raised a query on the pacemaker list in parallel to this

Description of problem:

Attempting to use fence_rhevm not fencing with pacemaker. fence_rhevm successful from the command line but fails when invoked within pacemaker.

Version-Release number of selected component (if applicable):

fence_rhevm - fence-agents-3.1.5-25.el6_4.2.x86_64
pacemaker - pacemaker-1.1.8-7.el6.x86_64

How reproducible:

Create a basic 2 node cluster with cman/pacemaker/corosync, add fence_rhevm stonith resource, kill corosync on node to be fenced

Steps to Reproduce:

1. create a basic cluster using pacemaker redirect in CMAN, for example cluster.conf as follows:

<?xml version="1.0"?>
<cluster config_version="1" name="testcluster">
  <logging debug="off"/>
  <clusternodes>
    <clusternode name="defiant" nodeid="1">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="defiant"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="enterprise" nodeid="2">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="enterprise"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices> 
    <fencedevice name="pcmk" agent="fence_pcmk"/> 
  </fencedevices>
  <cman two_node="1" expected_votes="1">
  </cman>
</cluster>

2. Add a fence_rhevm stonith resource

pcs stonith create st-rhevm fence_rhevm login="{user}@{domain}" passwd="{password}" ssl=1 ipaddr="{hypervisor ip}" verbose=1 debug="/var/log/fence_debug.log"

3. Confirm that the stonith resource has started

pcs status

4. Kill corosync on the node to fence

ps -ef | grep coro
kill -9 <pid>

Actual results:

Can see from /var/log/messages that there is a request to fence the node where you've killed corosync, but it fails. The debug log output from fence_rhevm doesn't a reboot request being logged only the vms command to list the available vms.

Expected results:

Expect that fencing occurs via pacemaker, I am able to fence successfully from the command line with fence_rhevm and stonith_admin:

stonith_admin --reboot enterprise --tolerance 5s

fence_rhevm -o reboot -a <hypervisor ip> -l <user>@<domain> -p <password> -n enterprise -z


Additional info:

/var/log/messages
-----------------------------
May 21 22:21:32 defiant corosync[1245]:   [TOTEM ] A processor failed, forming new configuration.
May 21 22:21:34 defiant corosync[1245]:   [QUORUM] Members[1]: 1
May 21 22:21:34 defiant corosync[1245]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 22:21:34 defiant kernel: dlm: closing connection to node 2
May 21 22:21:34 defiant corosync[1245]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.25.152) ; members(old:2 left:1)
May 21 22:21:34 defiant corosync[1245]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 22:21:34 defiant crmd[1749]:   notice: crm_update_peer_state: cman_event_callback: Node enterprise[2] - state is now lost
May 21 22:21:34 defiant crmd[1749]:  warning: match_down_event: No match for shutdown action on enterprise
May 21 22:21:34 defiant crmd[1749]:   notice: peer_update_callback: Stonith/shutdown of enterprise not matched
May 21 22:21:34 defiant crmd[1749]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:34 defiant fenced[1302]: fencing node enterprise
May 21 22:21:34 defiant logger: fence_pcmk[2219]: Requesting Pacemaker fence enterprise (reset)
May 21 22:21:34 defiant stonith_admin[2220]:   notice: crm_log_args: Invoked: stonith_admin --reboot enterprise --tolerance 5s
May 21 22:21:35 defiant attrd[1747]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
May 21 22:21:35 defiant attrd[1747]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
May 21 22:21:36 defiant pengine[1748]:   notice: unpack_config: On loss of CCM Quorum: Ignore
May 21 22:21:36 defiant pengine[1748]:   notice: process_pe_message: Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2
May 21 22:21:36 defiant crmd[1749]:   notice: run_graph: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
May 21 22:21:36 defiant crmd[1749]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May 21 22:21:44 defiant logger: fence_pcmk[2219]: Call to fence enterprise (reset) failed with rc=255
May 21 22:21:45 defiant fenced[1302]: fence enterprise dev 0.0 agent fence_pcmk result: error from agent
May 21 22:21:45 defiant fenced[1302]: fence enterprise failed
May 21 22:21:48 defiant fenced[1302]: fencing node enterprise
May 21 22:21:48 defiant logger: fence_pcmk[2239]: Requesting Pacemaker fence enterprise (reset)
May 21 22:21:48 defiant stonith_admin[2240]:   notice: crm_log_args: Invoked: stonith_admin --reboot enterprise --tolerance 5s
May 21 22:21:58 defiant logger: fence_pcmk[2239]: Call to fence enterprise (reset) failed with rc=255
May 21 22:21:58 defiant fenced[1302]: fence enterprise dev 0.0 agent fence_pcmk result: error from agent
May 21 22:21:58 defiant fenced[1302]: fence enterprise failed
May 21 22:22:01 defiant fenced[1302]: fencing node enterprise


/var/log/cluster/corosync.log
-----------------------------
May 21 22:21:32 corosync [TOTEM ] A processor failed, forming new configuration.
May 21 22:21:34 corosync [QUORUM] Members[1]: 1
May 21 22:21:34 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 22:21:34 [1749] defiant       crmd:     info: cman_event_callback:       Membership 296: quorum retained
May 21 22:21:34 [1744] defiant        cib:     info: pcmk_cpg_membership:       Left[5.0] cib.2
May 21 22:21:34 [1744] defiant        cib:     info: crm_update_peer_proc:      pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
May 21 22:21:34 [1744] defiant        cib:     info: pcmk_cpg_membership:       Member[5.0] cib.1
May 21 22:21:34 [1745] defiant stonith-ng:     info: pcmk_cpg_membership:       Left[5.0] stonith-ng.2
May 21 22:21:34 [1745] defiant stonith-ng:     info: crm_update_peer_proc:      pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
May 21 22:21:34 corosync [CPG   ] chosen downlist: sender r(0) ip(10.10.25.152) ; members(old:2 left:1)
May 21 22:21:34 corosync [MAIN  ] Completed service synchronization, ready to provide service.
May 21 22:21:34 [1745] defiant stonith-ng:     info: pcmk_cpg_membership:       Member[5.0] stonith-ng.1
May 21 22:21:34 [1749] defiant       crmd:   notice: crm_update_peer_state:     cman_event_callback: Node enterprise[2] - state is now lost
May 21 22:21:34 [1749] defiant       crmd:     info: peer_update_callback:      enterprise is now lost (was member)
May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section nodes (origin=local/crmd/150, version=0.22.3): OK (rc=0)
May 21 22:21:34 [1749] defiant       crmd:     info: pcmk_cpg_membership:       Left[5.0] crmd.2
May 21 22:21:34 [1749] defiant       crmd:     info: crm_update_peer_proc:      pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
May 21 22:21:34 [1749] defiant       crmd:     info: peer_update_callback:      Client enterprise/peer now has status [offline] (DC=true)
May 21 22:21:34 [1749] defiant       crmd:  warning: match_down_event:  No match for shutdown action on enterprise
May 21 22:21:34 [1749] defiant       crmd:   notice: peer_update_callback:      Stonith/shutdown of enterprise not matched
May 21 22:21:34 [1749] defiant       crmd:     info: crm_update_peer_expected:  peer_update_callback: Node enterprise[2] - expected state is now down
May 21 22:21:34 [1749] defiant       crmd:     info: abort_transition_graph:    peer_update_callback:211 - Triggered transition abort (complete=1) : Node failure
May 21 22:21:34 [1749] defiant       crmd:     info: pcmk_cpg_membership:       Member[5.0] crmd.1
May 21 22:21:34 [1749] defiant       crmd:   notice: do_state_transition:       State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:34 [1749] defiant       crmd:     info: abort_transition_graph:    do_te_invoke:163 - Triggered transition abort (complete=1) : Peer Halt
May 21 22:21:34 [1749] defiant       crmd:     info: join_make_offer:   Making join offers based on membership 296
May 21 22:21:34 [1749] defiant       crmd:     info: do_dc_join_offer_all:      join-7: Waiting on 1 outstanding join acks
May 21 22:21:34 [1749] defiant       crmd:     info: update_dc:         Set DC to defiant (3.0.7)
May 21 22:21:34 [1749] defiant       crmd:     info: do_state_transition:       State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:34 [1749] defiant       crmd:     info: do_dc_join_finalize:       join-7: Syncing the CIB from defiant to the rest of the cluster
May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_sync for section 'all' (origin=local/crmd/154, version=0.22.5): OK (rc=0)
May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section nodes (origin=local/crmd/155, version=0.22.6): OK (rc=0)
May 21 22:21:34 [1749] defiant       crmd:     info: stonith_action_create:     Initiating action metadata for agent fence_rhevm (target=(null))
May 21 22:21:35 [1749] defiant       crmd:     info: do_dc_join_ack:    join-7: Updating node state to member for defiant
May 21 22:21:35 [1749] defiant       crmd:     info: erase_status_tag:  Deleting xpath: //node_state[@uname='defiant']/lrm
May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_delete for section //node_state[@uname='defiant']/lrm (origin=local/crmd/156, version=0.22.7): OK (rc=0)
May 21 22:21:35 [1749] defiant       crmd:     info: do_state_transition:       State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
May 21 22:21:35 [1749] defiant       crmd:     info: abort_transition_graph:    do_te_invoke:156 - Triggered transition abort (complete=1) : Peer Cancelled
May 21 22:21:35 [1747] defiant      attrd:   notice: attrd_local_callback:      Sending full refresh (origin=crmd)
May 21 22:21:35 [1747] defiant      attrd:   notice: attrd_trigger_update:      Sending flush op to all hosts for: probe_complete (true)
May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section nodes (origin=local/crmd/158, version=0.22.9): OK (rc=0)
May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:       Operation complete: op cib_modify for section cib (origin=local/crmd/160, version=0.22.11): OK (rc=0)
May 21 22:21:36 [1748] defiant    pengine:     info: unpack_config:     Startup probes: enabled
May 21 22:21:36 [1748] defiant    pengine:   notice: unpack_config:     On loss of CCM Quorum: Ignore
May 21 22:21:36 [1748] defiant    pengine:     info: unpack_config:     Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
May 21 22:21:36 [1748] defiant    pengine:     info: unpack_domains:    Unpacking domains
May 21 22:21:36 [1748] defiant    pengine:     info: determine_online_status_fencing:   Node defiant is active
May 21 22:21:36 [1748] defiant    pengine:     info: determine_online_status:   Node defiant is online
May 21 22:21:36 [1748] defiant    pengine:     info: native_print:      st-rhevm        (stonith:fence_rhevm):  Started defiant
May 21 22:21:36 [1748] defiant    pengine:     info: LogActions:        Leave   st-rhevm        (Started defiant)
May 21 22:21:36 [1748] defiant    pengine:   notice: process_pe_message:        Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2
May 21 22:21:36 [1749] defiant       crmd:     info: do_state_transition:       State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 21 22:21:36 [1749] defiant       crmd:     info: do_te_invoke:      Processing graph 64 (ref=pe_calc-dc-1369171296-118) derived from /var/lib/pacemaker/pengine/pe-input-60.bz2
May 21 22:21:36 [1749] defiant       crmd:   notice: run_graph:         Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
May 21 22:21:36 [1749] defiant       crmd:   notice: do_state_transition:       State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]


Live CIB
--------------------------------
<cib epoch="18" num_updates="88" admin_epoch="0" validate-with="pacemaker-1.2" update-origin="defiant" update-client="cibadmin" cib-last-written="Tue May 21 07:55:31 2013" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="defiant">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.8-7.el6-394e906"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="defiant" uname="defiant"/>
      <node id="enterprise" uname="enterprise"/>
    </nodes>
    <resources>
      <primitive class="stonith" id="st-rhevm" type="fence_rhevm">
        <instance_attributes id="st-rhevm-instance_attributes">
          <nvpair id="st-rhevm-instance_attributes-login" name="login" value="<user>@<domain>"/>
          <nvpair id="st-rhevm-instance_attributes-passwd" name="passwd" value="<password>"/>
          <nvpair id="st-rhevm-instance_attributes-debug" name="debug" value="/tmp/debug.log"/>
          <nvpair id="st-rhevm-instance_attributes-ssl" name="ssl" value="1"/>
          <nvpair id="st-rhevm-instance_attributes-verbose" name="verbose" value="1"/>
          <nvpair id="st-rhevm-instance_attributes-ipaddr" name="ipaddr" value="<hypervisor ip>"/>
        </instance_attributes>
      </primitive>
    </resources>
    <constraints/>
  </configuration>
  <status>
    <node_state id="defiant" uname="defiant" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
      <transient_attributes id="defiant">
        <instance_attributes id="status-defiant">
          <nvpair id="status-defiant-probe_complete" name="probe_complete" value="true"/>
        </instance_attributes>
      </transient_attributes>
      <lrm id="defiant">
        <lrm_resources>
          <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith">
            <lrm_rsc_op id="st-rhevm_last_0" operation_key="st-rhevm_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.7" transition-key="2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" transition-magic="0:0;2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" call-id="14" rc-code="0" op-status="0" interval="0" last-run="1369119332" last-rc-change="0" exec-time="232" queue-time="0" op-digest="3bc7e1ce413fe37998a289f77f85d159"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
    <node_state id="enterprise" uname="enterprise" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <lrm id="enterprise">
        <lrm_resources>
          <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith">
            <lrm_rsc_op id="st-rhevm_last_0" operation_key="st-rhevm_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" transition-magic="0:7;5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" call-id="5" rc-code="7" op-status="0" interval="0" last-run="1369170800" last-rc-change="0" exec-time="4" queue-time="0" op-digest="3bc7e1ce413fe37998a289f77f85d159"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
      <transient_attributes id="enterprise">
        <instance_attributes id="status-enterprise">
          <nvpair id="status-enterprise-probe_complete" name="probe_complete" value="true"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
  </status>
</cib>

Comment 1 John McCabe 2013-05-22 15:33:28 UTC
Explicitly setting the ipport and shell_timeout params didn't help.

<nvpair id="st-rhevm-instance_attributes-ipport" name="ipport" value="443"/>
<nvpair id="st-rhevm-instance_attributes-shell_timeout" name="shell_timeout" value="10"/>

Comment 3 Marek Grac 2013-05-24 22:05:28 UTC
@John:

Please, do you have a SELinux on? if yes, can you try it without it

Comment 4 John McCabe 2014-12-09 11:42:51 UTC
As an FYI this had turned out to be due to the stonith device in pacemaker lacking the following:

ssl=1
ipport=443
pcmk_host_list="clusternode1 clusternode2 clusternode3"
pcmk_host_map="clusternode1:node1_vm_name clsuternode2:node2_vm_name clusternode3:node3_vm_name"
action=reboot

Needed to explicitly set the port and enable SSL and also define the mapping from cluster node to VM name as configured in RHEV-M.

Comment 5 Marek Grac 2014-12-15 10:20:50 UTC
Thanks for info,

There should be no need to set ipport when ssl option is used. If ipport is not by default it is a bug.