Bug 1134426

Summary: pcs needs a better parser for corosync.conf
Product: Red Hat Enterprise Linux 7 Reporter: Radek Steiger <rsteiger>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: cfeist, cluster-maint, jpokorny, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.140-1.el7 Doc Type: Bug Fix
Doc Text:
Cause: User edits corosync.conf configuration file manually. Consequence: Pcs misbehaves as it is not able to read the file properly. Fix: Implement full-featured parser for corosync.conf file. Result: Pcs is able to read a manually edited corosync.conf file properly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 09:32:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1142126    
Attachments:
Description Flags
corosync.conf
none
Example patch (see the comment)
none
proposed fix 1/3
none
proposed fix 2/3
none
proposed fix 3/3 none

Description Radek Steiger 2014-08-27 13:42:53 UTC
Created attachment 931475 [details]
corosync.conf

Description of problem:

(credits go to Honza Friesse and Tomas Jelinek)

The internal parser for getting nodes from corosync.conf uses following code for grepping node names:

    preg = re.compile(r'.*ring0_addr: (.*)')
    for line in lines:
        match = preg.match(line)
        if match:
            nodes.append (match.group(1))

This basically greps _anything_ containing "ring0_addr:" and can lead to some crazy results. As an example I created a cluster with "ring0_addr:" in cluster name:


[root@virt-041 ~]# pcs cluster setup --name "ring0_addr: blabla" virt-041.cluster-qe.lab.eng.brq.redhat.com virt-042.cluster-qe.lab.eng.brq.redhat.com --start --enable
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
virt-041.cluster-qe.lab.eng.brq.redhat.com: Succeeded
virt-041.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster...
virt-042.cluster-qe.lab.eng.brq.redhat.com: Succeeded
virt-042.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster...
virt-041.cluster-qe.lab.eng.brq.redhat.com: Cluster Enabled
virt-042.cluster-qe.lab.eng.brq.redhat.com: Cluster Enabled


The getNodesFromCorosyncConf() function then results in following situations: 

[root@virt-041 ~]# pcs status nodes corosync
Corosync Nodes:
 Online: virt-041.cluster-qe.lab.eng.brq.redhat.com virt-042.cluster-qe.lab.eng.brq.redhat.com 
 Offline: blabla 

[root@virt-041 ~]# pcs status pcsd
  blabla: Offline
  virt-041.cluster-qe.lab.eng.brq.redhat.com: Online
  virt-042.cluster-qe.lab.eng.brq.redhat.com: Online


The related sections from corosync.conf:

[root@virt-041 ~]# grep ring0 /etc/corosync/corosync.conf
cluster_name: ring0_addr: blabla
        ring0_addr: virt-041.cluster-qe.lab.eng.brq.redhat.com
        ring0_addr: virt-042.cluster-qe.lab.eng.brq.redhat.com



Version-Release number of selected component (if applicable):

pcs-0.9.115-32.el7

Comment 6 Jan Pokorný [poki] 2014-11-24 20:37:45 UTC
Created attachment 960932 [details]
Example patch (see the comment)

...but in fact this is just a drop in the sea, proper config format parser is needed to fix the same underlying issue once for all.

Comment 7 Jan Pokorný [poki] 2014-11-24 22:50:03 UTC
FWIW, this is also the reason I added a tunable as a possible
workaround/preventive measure in the pcs-clufter interaction:

https://github.com/jnpkrn/clufter/commit/84d0e6b8bab10abd3f06db8b6f13967f5a809366

Comment 8 Tomas Jelinek 2015-02-19 15:44:56 UTC
Created attachment 993680 [details]
proposed fix 1/3

Comment 9 Tomas Jelinek 2015-02-19 15:45:15 UTC
Created attachment 993681 [details]
proposed fix 2/3

Comment 10 Tomas Jelinek 2015-02-19 15:45:35 UTC
Created attachment 993683 [details]
proposed fix 3/3

Comment 14 Tomas Jelinek 2015-06-04 14:21:55 UTC
Before Fix:
[root@rh71-node1 ~]# rpm -q pcs
pcs-0.9.137-13.el7_1.2.x86_64
[root@rh71-node1:~]# pcs cluster setup --name 'ring0_addr: blabla' rh71-node1 rh71-node2 --start --enable
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
rh71-node1: Succeeded
rh71-node2: Succeeded
Starting cluster on nodes: rh71-node1, rh71-node2...
rh71-node1: Starting Cluster...
rh71-node2: Starting Cluster...
rh71-node1: Cluster Enabled
rh71-node2: Cluster Enabled
[root@rh71-node1:~]# pcs status nodes corosync
Corosync Nodes:
 Online: rh71-node1 rh71-node2
 Offline: blabla 
[root@rh71-node1:~]# pcs status pcsd
  blabla: Offline
  rh71-node1: Online
  rh71-node2: Online



After Fix:
[root@rh71-node1:~]# rpm -q pcs
pcs-0.9.140-1.el6.x86_64
[root@rh71-node1:~]# pcs cluster setup --name 'ring0_addr: blabla' rh71-node1 rh71-node2 --start --enable
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
rh71-node1: Succeeded
rh71-node2: Succeeded
Starting cluster on nodes: rh71-node1, rh71-node2...
rh71-node1: Starting Cluster...
rh71-node2: Starting Cluster...
rh71-node1: Cluster Enabled
rh71-node2: Cluster Enabled
Synchronizing pcsd certificates on nodes rh71-node1, rh71-node2. pcsd needs to be restarted on the nodes in order to reload the certificates.
[root@rh71-node1:~]# pcs status nodes corosync
Corosync Nodes:
 Online: rh71-node1 rh71-node2 
 Offline: 
[root@rh71-node1:~]# pcs status pcsd
  rh71-node1: Online
  rh71-node2: Online

Comment 18 errata-xmlrpc 2015-11-19 09:32:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2290.html