Bug 1319070

Summary: adding a node on RHEL6 may crash due to hardcoded fencing names in pcs
Product: Red Hat Enterprise Linux 6 Reporter: Radek Steiger <rsteiger>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 6.8CC: cfeist, cluster-maint, idevat, omular, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.154-1.el6 Doc Type: Bug Fix
Doc Text:
Cause: User adds a node into a cluster. Consequence: Pcs exits with an error leaving the cluster configuration in an inconsistent state (the node is half added) if the cluster configuration has been updated out of pcs scope and fence devices has been changed. Fix: Read the configuration and make sure a required fence device exists in the configuration, create it if it does not. Result: The node is added successfully.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-21 11:03:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
proposed fix none

Description Radek Steiger 2016-03-18 15:11:39 UTC
> Description of problem:

Fencing related names in cluster.conf like those in <method>, <device> and <fencedevice> tags can cause pcs to crash on adding a node to cluster if the cluster configuration has been created or modified outside pcs.

The reason is that pcs creates the config file with "pcmk-method" and "pcmk-redirect" used as a name identifier and presumes it is always there when adding additional nodes. If the configuration however has been created or been changed manually to include custom identifiers, the node add procedure will fail.

Example cluster.conf:

<cluster config_version="4" name="STSRHTS30477">
  <cman/>
  <totem token="3000"/>
  <fence_daemon clean_start="0" post_join_delay="20"/>
  <clusternodes>
    <clusternode name="virt-006" nodeid="1" votes="1">
      <fence>
        <method name="mymethod">
          <device name="mypcmk" port="virt-006"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="virt-007" nodeid="2" votes="1">
      <fence>
        <method name="mymethod">
          <device name="mypcmk" port="virt-007"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="virt-008" nodeid="3" votes="1">
      <fence>
        <method name="mymethod">
          <device name="mypcmk" port="virt-008"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_pcmk" name="mypcmk"/>
  </fencedevices>
</cluster>

Now when I try to add a node:

[root@virt-007 ententyky]# pcs cluster node add virt-009
Error: unable to add virt-009 on virt-006 - Error connecting to virt-006 - (HTTP error: 400)
Error: unable to add virt-009 on virt-007 - Error connecting to virt-007 - (HTTP error: 400)
Error: unable to add virt-009 on virt-008 - Error connecting to virt-008 - (HTTP error: 400)
Error: Unable to update any nodes

An incomplete intersection with missing device details and # of votes is added into cluster.conf:

    ...
    <clusternode name="virt-009" nodeid="4">
      <fence>
        <method name="pcmk-method"/>
      </fence>
    </clusternode>
    ...

This is probably because pcs runs ccs internally to do the job, but ccs fails, having pcs silently ignoring the error and failing later with HTTP 400. This is what pcs runs in the background:

[root@virt-007 ~]# ccs -i -f /etc/cluster/cluster.conf --addfenceinst "pcmk-redirect" virt-009 "pcmk-method"  "port=virt-009"
Fence device 'pcmk-redirect' not found.
[root@virt-017 ententyky]# echo $?
1



> Version-Release number of selected component (if applicable):

pcs-0.9.148-5.el6.x86_64



> Additional info:

We could either:
 - read cluster.conf beforehand to figure out what name has been used for the pcmk fencing device and use that one automatically
 - read cluster.conf beforehand and add our own secondary device if not present under the expected name
 - read cluster.conf beforehand and error out properly

Comment 2 Tomas Jelinek 2016-08-26 06:47:47 UTC
Created attachment 1194217 [details]
proposed fix

Comment 3 Ivan Devat 2016-10-19 07:06:19 UTC
Setup:
> modify /etc/cluster/cluster.conf:
> in tag method attribute name: pcmk-method -> mymethod
> in tag device attribute name: pcmk_redirect -> mypcmk
> in tag fencdevice attribute name: pcmk_redirect -> mypcmk
> something like:

<cluster config_version="23" name="devcluster6">
  <fence_daemon/>
  <clusternodes>
    <clusternode name="vm-rhel67-1" nodeid="1">
      <fence>
        <method name="mymethod">
          <device name="mypcmk" port="vm-rhel67-1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vm-rhel67-2" nodeid="2">
      <fence>
        <method name="mymethod">
          <device name="mypcmk" port="vm-rhel67-2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vm-rhel67-3" nodeid="3">
      <fence>
        <method name="pcmk-method"/>
      </fence>
    </clusternode>
  </clusternodes>
  <cman broadcast="no" expected_votes="1" transport="udp" two_node="1"/>
  <fencedevices>
    <fencedevice agent="fence_pcmk" name="mypcmk"/>
  </fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>


Before Fix:

[vm-rhel67-1 ~] $ rpm -q pcs
pcs-0.9.148-7.el6_8.1.x86_64

[vm-rhel67-1 ~] $ pcs status | grep Online:
Online: [ vm-rhel67-1 vm-rhel67-2 ]
[vm-rhel67-1 ~] $ pcs status |grep "2 nodes"
2 nodes and 1 resource configured
[vm-rhel67-1 ~] $ pcs cluster localnode add vm-rhel67-3
Fence device 'pcmk-redirect' not found.

Error: error adding fence instance: vm-rhel67-3


After Fix:

[vm-rhel67-1 ~] $ rpm -q pcs
pcs-0.9.154-1.el6.x86_64

[vm-rhel67-1 ~] $ pcs status | grep Online:
Online: [ vm-rhel67-1 vm-rhel67-2 ]
[vm-rhel67-1 ~] $ pcs status |grep "2 nodes"
2 nodes and 1 resource configured

[vm-rhel67-1 ~] $ pcs cluster localnode add vm-rhel67-3
vm-rhel67-3: successfully added!

Comment 7 errata-xmlrpc 2017-03-21 11:03:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0707.html