Bug 1024962 - ccs --sync --activate is not atomic (and --activate alone does not work)
Summary: ccs --sync --activate is not atomic (and --activate alone does not work)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: ricci
Version: 6.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Chris Feist
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 1081248
TreeView+ depends on / blocked
 
Reported: 2013-10-30 16:28 UTC by Jaroslav Kortus
Modified: 2014-10-14 07:29 UTC (History)
8 users (show)

Fixed In Version: ccs-0.16.2-71.el6
Doc Type: Bug Fix
Doc Text:
no docs needed
Clone Of:
: 1081248 (view as bug list)
Environment:
Last Closed: 2014-10-14 07:29:22 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1539 0 normal SHIPPED_LIVE ricci bug fix and enhancement update 2014-10-14 01:21:42 UTC

Description Jaroslav Kortus 2013-10-30 16:28:45 UTC
Description of problem:
When doing cluster updates, I'm issuing the commands to one node and then when the config is ready I run ccs --sync --activate.

This has racy behaviour when the nodes are udpated one-by-one. The later nodes receive the updates with a delay. When I manage to stop the service on first node before the update propagates to the last node, it gets started again.

Solution to that would be that first --sync action is done on all nodes and then --activate on one of them (the reload should propagate automatically as it does with cman_tool version -r).

Surprisingly splitting these two and calling --sync first and --activate later does not yield the expected result, as --activate has absolutely no effect (watch it with:  watch -n1 'grep config_version /etc/cluster/cluster.conf; cman_tool version')

Version-Release number of selected component (if applicable):
ccs-0.16.2-69.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. ccs -h <cluster_node> --sync --activate
2. watch "Unable to load new config in corosync..." messages appearing in syslog on the last node
3.

Actual results:
cluster config being synced in a serialized way (node-by-node)
Error messages:
Oct 30 17:14:33 virt-019 corosync[10015]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Oct 30 17:14:33 virt-019 corosync[10015]:   [CMAN  ] Can't get updated config version 263: New configuration version has to be newer than current running configuration#012.
Oct 30 17:14:33 virt-019 corosync[10015]:   [CMAN  ] Activity suspended on this node
Oct 30 17:14:33 virt-019 corosync[10015]:   [CMAN  ] Error reloading the configuration, will retry every second
[ block repeating ]


Expected results:
1. serialized/paralell --sync (here it does not matter really)
2. atomic --activate only when --sync is already finished

3. --sync and --activate should both have their desired effect and work independently (now --activate without sync does not produce new conf activation)


Additional info:

Comment 2 Jaroslav Kortus 2013-10-30 17:17:56 UTC
cman_tool version -r does nicely the same operation as ccs --sync --activate, but is not affected by this bug. It's also considerably faster.

Comment 3 Jan Pokorný [poki] 2013-10-30 17:24:03 UTC
re speed: partly matter of the validation (can be suppressed by -i/--ignore)

Comment 4 Jan Pokorný [poki] 2013-10-30 18:55:27 UTC
re speed 2: also partly matter of not using one connection per node, but
            rather establishing new SSL/TLS connection for each remote
            procedure call (partially connected with the previous comment)

Comment 5 Chris Feist 2013-10-30 19:20:08 UTC
There are two separate issues here, first, --activate only works with --sync, I need to improve the documentation and add an error message if it is used without --sync.

The second issue is that there is a bug in ccs when syncing and activating it only needs to do it (and succeed on one node), but ccs attempts to do it on every node.  Updating this code to only do it once should give a speed similar to cman_tool version -r.

Comment 6 Jan Pokorný [poki] 2013-10-30 19:58:20 UTC
re raciness: it really looks like the messages sent via libcman, such as
             cman_set_version causing actual cluster.conf propagation
             (called by modcluster ricci's helper as the last step when
             told by ricci instructed by ccs --sync --activate),
             are not guaranteed to be fully synchronous (socket buffers,
             omitted ACKs in the protocol), and consequently that ccs is
             not guaranteed to finish *after* the configuration was fully
             propagated everywhere;
             this could be achieved by active wait after the ccs command
             (answer for how is left as an exercise ... but probably
              along the lines of "ssh $NODE corosync-objctl | grep ..."
              or perhaps clustat et al.)

Additionally, this is, in case of ccs, boosted by the fact that
the sequence is (C=set cluster.conf, P=propagate): C1P C2P C3P ... CnP
unlike the case of "cman version -r": C1 C2 C3 ... Cn P

- simply because as of C1P, the cluster stack on the rest (or, dependending
  on the exact timing, its subset) of the nodes is already switched to
  active polling each second [1], and in this meantime ccs may have
  already exited while "propagate" request hasn't been read yet, and other
  cluster manipulations executed (may be Jarda's case, but it's hard to say)

-> perhaps better solution:
   1. for each node in nodes: set cluster.conf without propagation
   2a. for each node in nodes: set version to that of cluster.conf being set
   2b. for one node (preferrably local one if possible): ditto

Point 2. requires "set_cluster_version" method of ricci's modcluster module
to be supported by ccs.  It could also be exposed to user, indirectly and
most fittingly via standalone --activate switch (which currently has no use
without --sync and/or modification commands and this is not documented
clearly as noted by Jarda) and with some warning ala:

> activating the config when not sync'ed across the nodes causes troubles


[1] https://git.fedorahosted.org/cgit/cluster.git/tree/cman/daemon/commands.c?h=RHEL64#n1278

Comment 8 Chris Feist 2014-06-12 21:42:00 UTC
Fixed upstream here:

https://github.com/feist/ccs/commit/8ed1b2be878177bfffd171505e477feb981cdf07

Comment 9 Chris Feist 2014-06-13 01:38:25 UTC
Before Fix:

Two nodes, bid-05, bid-06, config is all synced up.

[root@bid-05 ~]# ccs --sync --activate -h localhost


In bid-06 logs

Jun 12 20:36:25 bid-06 corosync[15731]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Jun 12 20:36:25 bid-06 corosync[15731]:   [CMAN  ] Can't get updated config version 57: New configuration version has to be newer than current running configuration#012.
Jun 12 20:36:25 bid-06 corosync[15731]:   [CMAN  ] Activity suspended on this node
Jun 12 20:36:25 bid-06 corosync[15731]:   [CMAN  ] Error reloading the configuration, will retry every second



After Fix:

[root@bid-05 ~]# ccs --sync --activate -h localhost
[root@bid-05 ~]# rpm -q ccs
ccs-0.16.2-71.el6.x86_64

No errors in bid-06 (or bid-05) logs.

Comment 12 errata-xmlrpc 2014-10-14 07:29:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1539.html


Note You need to log in before you can comment on or make changes to this bug.