Bug 695795

Summary: Do not ignore 'transport' if 'totem' node exists
Product: Red Hat Enterprise Linux 6 Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: clusterAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: agk, bubble, ccaulfie, cfeist, cluster-maint, djansa, donhoover, fdinitto, jkortus, lhh, rpeterso, sdake, swhiteho, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cluster-3.0.12.1-11.el6 Doc Type: Bug Fix
Doc Text:
Cause: cman implements a complex set of checks to configure totem. One of the checks, that copies the configuration data was not correct Consequence: the transport protocol option was not handled correctly Fix: change cman copy and checks to handle transport correctly Result: cman now handles the transport option properly
Story Points: ---
Clone Of: 689128 Environment:
Last Closed: 2011-12-06 14:51:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 689128    
Bug Blocks: 695794    

Description Fabio Massimo Di Nitto 2011-04-12 17:49:05 UTC
+++ This bug was initially created as a clone of Bug #689128 +++

Attached patch fixes typo in code which leads to <cman transport="..." > is ignored if <totem /> XML node is present.

--- Additional comment from bubble on 2011-03-19 13:52:39 EDT ---

Created attachment 486397 [details]
Fix typo

--- Additional comment from fdinitto on 2011-03-22 05:01:22 EDT ---

Hi Vladislav,

in principle the patch is correct, but cannot be applied as is and needs some more work.

<cluster>
 <cman transport="...."/>
 <totem transport="...."/>

In this case the patch should take care to check and either report an error that only one can be specified or eventually apply a bigger hammer and say: cman config has higher priority than totem and take appropriate action.

Basically it needs a failsafe for bad configs.

Thanks
Fabio

--- Additional comment from bubble on 2011-03-22 05:09:14 EDT ---

Will <totem transport="...."/> pass validation?

--- Additional comment from fdinitto on 2011-03-22 05:25:13 EDT ---

(In reply to comment #3)
> Will <totem transport="...."/> pass validation?

even if it doesn´t, validation can always be turned off or set to warning. It´s a matter of trying to be resilient to user errors and make sure expectations are met.

If you specify both, which one should win? etc..

--- Additional comment from bubble on 2011-03-23 09:54:42 EDT ---

Created attachment 487042 [details]
2nd version of patch

Hi Fabio,

attached should be close to what you've requested.

Best,
Vladislav

--- Additional comment from fdinitto on 2011-03-23 10:58:56 EDT ---

(In reply to comment #5)
> Created attachment 487042 [details]
> 2nd version of patch
> 
> Hi Fabio,
> 
> attached should be close to what you've requested.
> 
> Best,
> Vladislav

Hi Vladislav,

at a first glance the patch looks Ok. I'll need to test it before I merge it upstream.

Thanks a lot for your work!

Fabio

--- Additional comment from fdinitto on 2011-03-28 08:12:28 EDT ---

Hi Vladislav,

there is a substantial error in the patch.

transport is a key to totem and not an object underneath totem.

So basically this will never work.

Also your patch triggers the error only when cman and totem transport are specified but it should provide always an error path when specified in totem.

Comment 1 Fabio Massimo Di Nitto 2011-04-15 07:51:08 UTC
*** Bug 695794 has been marked as a duplicate of this bug. ***

Comment 5 Fabio Massimo Di Nitto 2011-08-08 09:37:29 UTC
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=da36fb6bc9e7a8908011e41ef78235f6c0160ca6

Unit test results:

pre patch:

1) normal startup
          <cman/>
          <totem/>

[root@rhel6-node2 ~]# /etc/init.d/cman start
[all good]

grep -i transport /var/log/cluster/corosync.log
Aug 08 11:01:29 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).

OK

2) specify transport in <totem..

  <cman/>
  <totem transport="udpu"/>

[root@rhel6-node2 ~]# ccs_config_validate
Relax-NG validity error : Extra element totem in interleave
tempfile:5: element totem: Relax-NG validity error : Element cluster failed to validate content
Configuration fails to validate

[root@rhel6-node2 ~]# /etc/init.d/cman start
   Starting cman... Relax-NG validity error : Extra element totem in interleave
tempfile:5: element totem: Relax-NG validity error : Element cluster failed to validate content
Configuration fails to validate
[fail to start after timeout]

corosync process will be hanging in background

3) specify transport in cman

  <cman transport="udpu"/>
  <totem/>

[root@rhel6-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... Aug 08 11:06:47 corosync [MAIN  ] Corosync Cluster Engine ('1.3.2'): started and ready to provide service.
Aug 08 11:06:47 corosync [MAIN  ] Corosync built-in features: nss rdma
Aug 08 11:06:47 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Aug 08 11:06:47 corosync [MAIN  ] Successfully parsed cman config
Aug 08 11:06:47 corosync [MAIN  ] Successfully configured openais services to load
Aug 08 11:06:47 corosync [MAIN  ] parse error in config: No multicast address specified
Aug 08 11:06:47 corosync [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1665.
corosync died: Could not read cluster configuration Check cluster logs for details
                                                           [FAILED]


4) broadcast/udpb (not support in rhel6, but check anyway for regression):

  <cman broadcast="yes"/>

[root@rhel6-node2 ~]# ccs_config_validate
tempfile:4: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xF0 0xB3 0xAB 0x22
        <cman transport="udp" broadcast="-ð³«" nodename="rhel6-node2" cluster_id="25

hit a memory corruptor

post patch:

1) normal startup

  <cman/>
  <totem/>

[root@rhel6-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Starting qdiskd...                                      [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
Starting gfs_controld:                                     [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

[root@rhel6-node2 daemon]# grep -i transport /var/log/cluster/corosync.log
Aug 08 11:09:02 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).

2) specify transport in <totem..

  <cman/>
  <totem transport="udpu"/>

[root@rhel6-node2 ~]# ccs_config_validate
Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration

[root@rhel6-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration
corosync [MAIN  ] Corosync Cluster Engine ('1.3.2'): started and ready to provide service.
corosync [MAIN  ] Corosync built-in features: nss rdma
corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
corosync [MAIN  ] Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
corosync [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1616.
Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
cman_tool: corosync daemon didn't start Check cluster logs for details
                                                           [FAILED]

3) specify transport in both cman and totem:

  <cman transport="udpu"/>
  <totem transport="udpu"/>

[root@rhel6-node2 ~]# ccs_config_validate
Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration

[root@rhel6-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration
corosync [MAIN  ] Corosync Cluster Engine ('1.3.2'): started and ready to provide service.
corosync [MAIN  ] Corosync built-in features: nss rdma
corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
corosync [MAIN  ] Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
corosync [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1616.
Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
cman_tool: corosync daemon didn't start Check cluster logs for details
                                                           [FAILED]

4) specify transport only in cman:

  <cman transport="udpu"/>
  <totem/>

[root@rhel6-node2 ~]# ccs_config_validate
Configuration validates

[root@rhel6-node2 ~]# /etc/init.d/cman start
[all good]

[root@rhel6-node2 daemon]# grep -i transport /var/log/cluster/corosync.log
Aug 08 11:12:33 corosync [TOTEM ] Initializing transport (UDP/IP Unicast).

value is applied correctly

5) quick check for regressions

  <cman transport="udp"/>
  <totem/>

[root@rhel6-node2 daemon]# grep -i transport /var/log/cluster/corosync.log
Aug 08 11:13:29 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).

  <cman transport="udpu"/>
(note drop totem config bits)

[root@rhel6-node2 daemon]# grep -i transport /var/log/cluster/corosync.log
Aug 08 11:14:23 corosync [TOTEM ] Initializing transport (UDP/IP Unicast).

  <cman transport="udpb"/>

[root@rhel6-node2 daemon]# grep -i transport /var/log/cluster/corosync.log
Aug 08 11:19:59 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).

Multicast addresses: 255.255.255.255

  <cman broadcast="yes"/>

[root@rhel6-node2 daemon]# grep -i transport /var/log/cluster/corosync.log
Aug 08 11:21:12 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).

Multicast addresses: 255.255.255.255

Comment 8 Fabio Massimo Di Nitto 2011-08-18 07:49:57 UTC
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=90182b28490ec8383e2d45dbab8c23a1d1420bdc

post patch:

All steps have been verified with cman_tool status and checking corosync.log for correct configs. Traffic flow over altname have been checked with tcpdump and in a couple of cases (but unnecessary to this test) downing/up'ing intefaces.

1) normal startup

 no <cman>
 no <totem>
 no <altname>

[root@clusternet-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
[snip]

[root@clusternet-node2 ~]# cman_tool status
Version: 6.2.0
Config Version: 2
Cluster Name: fabbione
Cluster Id: 25573
Cluster Member: Yes
Cluster Generation: 72
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 178
Node name: clusternet-node2-eth1
Node ID: 2
Multicast addresses: 239.192.99.73
Node addresses: 192.168.4.2

[root@clusternet-node2 ~]# grep -i transport /var/log/cluster/corosync.log
Aug 18 09:35:52 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).

2) specify transport in <totem..

  <cman/>
  <totem transport="udpu"/>

[root@clusternet-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration

3) specify transport in both cman and totem:

  <cman transport="udpu"/>
  <totem transport="udpu"/>

[root@clusternet-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman... Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration

4) specify transport only in cman:

  <cman transport="udpu"/>
  <totem/>

[root@clusternet-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
[snip]

5) do not specify totem at all

  <cman transport="udpu"/>

[root@clusternet-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]

[root@clusternet-node2 ~]# grep -i transport /var/log/cluster/corosync.log
Aug 18 09:40:56 corosync [TOTEM ] Initializing transport (UDP/IP Unicast).

6)  configure altname no totem/no cman/no transport:

    <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1">
      <altname name="clusternet-node1-eth2"/>

[root@clusternet-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]

[root@clusternet-node2 ~]# ccs_config_validate
Configuration validates

7) add totem transport

  <totem transport="udpu"/>

[root@clusternet-node2 ~]# ccs_config_validate
Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration

[root@clusternet-node2 ~]# /etc/init.d/cman start
   Starting cman... Transport should not be specified within <totem .../>, use <cman transport="..." /> instead
Unable to get the configuration

8) switch to cman transport

  <cman transport="udpu"/>

[root@clusternet-node2 ~]# ccs_config_validate
Configuration validates

[root@clusternet-node2 ~]# /etc/init.d/cman start
Starting cluster:
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]

Comment 9 Don Hoover 2011-08-18 17:25:30 UTC
Fabio, did you test actually using the <TOTEM> section to change a few of the timing settings such as token, token_retransmits_before_loss_const, join, consensus...and make sure they were properly applied to the configuration when using UDPU?

That was the problem I had in 6.1, I had to take out the <totem> section so I could no longer modify any of the defaults for the various values you can put into the totem section to modify quorum behavior.

Comment 10 Fabio Massimo Di Nitto 2011-08-18 18:16:25 UTC
(In reply to comment #9)
> Fabio, did you test actually using the <TOTEM> section to change a few of the
> timing settings such as token, token_retransmits_before_loss_const, join,
> consensus...and make sure they were properly applied to the configuration when
> using UDPU?
> 
> That was the problem I had in 6.1, I had to take out the <totem> section so I
> could no longer modify any of the defaults for the various values you can put
> into the totem section to modify quorum behavior.

This patch addresses exactly the problem you had.

token values here are random just to make it easier to spot them in the config/objdb. The code that copies the values into totem is the same for every parameter, I am using token as one of them.

Adding extra unit test cases per customer request:

9) specify totem token (no cman)

  <totem token="6000"/>


[root@rhel6-node2 ~]# ccs_config_validate 
Configuration validates

[root@rhel6-node2 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
[snip]

[root@rhel6-node2 cluster]# corosync-objctl |grep totem |grep token
cluster.totem.token=6000
totem.token=6000

10) specify token and transport

  <cman transport="udpu"/>
  <totem token="16000"/>

[root@rhel6-node2 ~]# ccs_config_validate 
Configuration validates

[root@rhel6-node2 ~]# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
[snip]

[root@rhel6-node2 cluster]# corosync-objctl |grep totem |grep token
cluster.totem.token=16000
totem.token=16000

[root@rhel6-node2 cluster]# corosync-objctl |grep totem |grep transport
totem.transport=udpu

Comment 12 Fabio Massimo Di Nitto 2011-10-27 08:34:04 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: cman implements a complex set of checks to configure totem. One of the checks, that copies the configuration data was not correct
Consequence: the transport protocol option was not handled correctly
Fix: change cman copy and checks to handle transport correctly
Result: cman now handles the transport option properly

Comment 13 errata-xmlrpc 2011-12-06 14:51:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1516.html