| Summary: | ccs --sync --activate is not atomic (and --activate alone does not work) | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Jaroslav Kortus <jkortus> | |
| Component: | ricci | Assignee: | Chris Feist <cfeist> | |
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 6.6 | CC: | ccaulfie, cluster-maint, fdinitto, jpokorny, rpeterso, rsteiger, slevine, teigland | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | ccs-0.16.2-71.el6 | Doc Type: | Bug Fix | |
| Doc Text: |
no docs needed
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1081248 (view as bug list) | Environment: | ||
| Last Closed: | 2014-10-14 07:29:22 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1081248 | |||
|
Description
Jaroslav Kortus
2013-10-30 16:28:45 UTC
cman_tool version -r does nicely the same operation as ccs --sync --activate, but is not affected by this bug. It's also considerably faster. re speed: partly matter of the validation (can be suppressed by -i/--ignore) re speed 2: also partly matter of not using one connection per node, but
rather establishing new SSL/TLS connection for each remote
procedure call (partially connected with the previous comment)
There are two separate issues here, first, --activate only works with --sync, I need to improve the documentation and add an error message if it is used without --sync. The second issue is that there is a bug in ccs when syncing and activating it only needs to do it (and succeed on one node), but ccs attempts to do it on every node. Updating this code to only do it once should give a speed similar to cman_tool version -r. re raciness: it really looks like the messages sent via libcman, such as
cman_set_version causing actual cluster.conf propagation
(called by modcluster ricci's helper as the last step when
told by ricci instructed by ccs --sync --activate),
are not guaranteed to be fully synchronous (socket buffers,
omitted ACKs in the protocol), and consequently that ccs is
not guaranteed to finish *after* the configuration was fully
propagated everywhere;
this could be achieved by active wait after the ccs command
(answer for how is left as an exercise ... but probably
along the lines of "ssh $NODE corosync-objctl | grep ..."
or perhaps clustat et al.)
Additionally, this is, in case of ccs, boosted by the fact that
the sequence is (C=set cluster.conf, P=propagate): C1P C2P C3P ... CnP
unlike the case of "cman version -r": C1 C2 C3 ... Cn P
- simply because as of C1P, the cluster stack on the rest (or, dependending
on the exact timing, its subset) of the nodes is already switched to
active polling each second [1], and in this meantime ccs may have
already exited while "propagate" request hasn't been read yet, and other
cluster manipulations executed (may be Jarda's case, but it's hard to say)
-> perhaps better solution:
1. for each node in nodes: set cluster.conf without propagation
2a. for each node in nodes: set version to that of cluster.conf being set
2b. for one node (preferrably local one if possible): ditto
Point 2. requires "set_cluster_version" method of ricci's modcluster module
to be supported by ccs. It could also be exposed to user, indirectly and
most fittingly via standalone --activate switch (which currently has no use
without --sync and/or modification commands and this is not documented
clearly as noted by Jarda) and with some warning ala:
> activating the config when not sync'ed across the nodes causes troubles
[1] https://git.fedorahosted.org/cgit/cluster.git/tree/cman/daemon/commands.c?h=RHEL64#n1278
Fixed upstream here: https://github.com/feist/ccs/commit/8ed1b2be878177bfffd171505e477feb981cdf07 Before Fix: Two nodes, bid-05, bid-06, config is all synced up. [root@bid-05 ~]# ccs --sync --activate -h localhost In bid-06 logs Jun 12 20:36:25 bid-06 corosync[15731]: [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration Jun 12 20:36:25 bid-06 corosync[15731]: [CMAN ] Can't get updated config version 57: New configuration version has to be newer than current running configuration#012. Jun 12 20:36:25 bid-06 corosync[15731]: [CMAN ] Activity suspended on this node Jun 12 20:36:25 bid-06 corosync[15731]: [CMAN ] Error reloading the configuration, will retry every second After Fix: [root@bid-05 ~]# ccs --sync --activate -h localhost [root@bid-05 ~]# rpm -q ccs ccs-0.16.2-71.el6.x86_64 No errors in bid-06 (or bid-05) logs. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1539.html |