RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1078361 - Implement dynamic token timeout and make it default
Summary: Implement dynamic token timeout and make it default
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: corosync
Version: 7.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 1111381 1074673 1082167 1108538
TreeView+ depends on / blocked
 
Reported: 2014-03-19 16:31 UTC by Jan Friesse
Modified: 2015-03-05 08:26 UTC (History)
4 users (show)

Fixed In Version: corosync-2.3.4-3.el7
Doc Type: Bug Fix
Doc Text:
Cause: User dynamically adds/removes nodes. Consequence: Token timeout may be too small. Fix: Implement dynamic token timeout. New option token_coefficient is added with following meaning: This value is used only when nodelist section is specified and contains at least 3 nodes. If so, real token timeout is then computed as token + (number_of_nodes - 2) * token_coefficient. This allows cluster to scale without manually changing token timeout every time new node is added. This value can be set to 0 resulting in effective removal of this feature. The default is 650 milliseconds. Result: Corosync handles dynamic adding and removing nodes.
Clone Of:
Environment:
Last Closed: 2015-03-05 08:26:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
config: Handle totem_set_volatile_defaults errors (1.05 KB, patch)
2014-06-12 07:39 UTC, Jan Friesse
no flags Details | Diff
Log: Make reload of logging work (2.58 KB, patch)
2014-06-12 07:39 UTC, Jan Friesse
no flags Details | Diff
Really clear totemconfig nodes on reload (1.95 KB, patch)
2014-06-12 07:39 UTC, Jan Friesse
no flags Details | Diff
totemconfig: Key change process dependencies (19.90 KB, patch)
2014-06-12 07:39 UTC, Jan Friesse
no flags Details | Diff
totemconfig: Log errors on key change and reload (4.99 KB, patch)
2014-06-12 07:39 UTC, Jan Friesse
no flags Details | Diff
Add token_coefficient option (3.49 KB, patch)
2014-06-12 07:39 UTC, Jan Friesse
no flags Details | Diff
config: Allow dynamic change of token_coefficient (1.26 KB, patch)
2014-06-12 07:40 UTC, Jan Friesse
no flags Details | Diff
manpage: Fix English (13.76 KB, patch)
2014-10-13 14:41 UTC, Jan Friesse
no flags Details | Diff
Store configuration values used by totem to cmap (3.28 KB, patch)
2014-10-13 14:41 UTC, Jan Friesse
no flags Details | Diff
man page: Improve description of token timeout (1.42 KB, patch)
2014-10-13 14:42 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0365 0 normal SHIPPED_LIVE corosync bug fix and enhancement update 2015-03-05 12:51:37 UTC

Description Jan Friesse 2014-03-19 16:31:05 UTC
Description of problem:
Corosync currently supports only static setting of token timeout. It would be nice to have "dynamic" token timeout. This will work ONLY with nodelist and will reflect ONLY changes in nodelist.

Version-Release number of selected component (if applicable):
2.3.3

How reproducible:
100%

Actual results:
It's impossible to set dynamic token timeout


Expected results:
Possibility to set token timeout

Additional info:

Comment 4 Jan Friesse 2014-06-12 07:39:14 UTC
Created attachment 907987 [details]
config: Handle totem_set_volatile_defaults errors

config: Handle totem_set_volatile_defaults errors

When totem_set_volatile_defaults is called from totem_config_validate
return code is unchecked.

It's then perfectly possible to set (for example) join timeout to very
small value (1) and consensus value is then set to 0 making corosync
unable to create membership.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 5 Jan Friesse 2014-06-12 07:39:34 UTC
Created attachment 907988 [details]
Log: Make reload of logging work

Log: Make reload of logging work

When reload was called multiple times (~20), logging to file stopped
working.

Main problem was hidden in the fact, that log file was opened multiple
times, because even target_id was shared via subsystem loggers, file
name was not.

Solution is to ALWAYS set proper log file name into subsystem logger
(copy is stored). This will not only fix problem but also removes small
leak.

Also if filename didn't changed, function can return sooner.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 6 Jan Friesse 2014-06-12 07:39:39 UTC
Created attachment 907989 [details]
Really clear totemconfig nodes on reload

Really clear totemconfig nodes on reload

When reload was called nodes were constantly added to totemconfig
nodelist.

So simple corosync-cfgtool -R resulted very quickly in filling whole
array and segfault.

Solution is to clear member_count.

Clearing is also moved directly to put_nodelist_members_to_config to
make sure it's always processed.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 7 Jan Friesse 2014-06-12 07:39:48 UTC
Created attachment 907990 [details]
totemconfig: Key change process dependencies

totemconfig: Key change process dependencies

When key with dependency was changed, dependant keys were not recomputed.
Nice example is consensus timeout. If token timout was changed,
consensus timeout was not recomputed correctly (nether via cmap change
of key nor via cfg reload).

Solution is almost complete refactor of handling volatile defaults.

totem_volatile_config_read now handles not only storing cmap key to
totem_config structure, but also checking of existence, comparing with
zero value and properly storing defaults.

totem_set_volatile_defaults is gone. It's function was splitted into
totem_volatile_config_read and totem_volatile_config_validate functions.

Reload callback and change of key callback are now mostly same functions
and both calls totem_volatile_config_read.

Patch also fixes small memory leak. totem.vsftype key is not used for
long time and original totem_volatile_config_read wasn't freeing
allocated memory returned by icmap_get_string. Whole reading of
totem.vsftype is removed.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 8 Jan Friesse 2014-06-12 07:39:54 UTC
Created attachment 907991 [details]
totemconfig: Log errors on key change and reload

totemconfig: Log errors on key change and reload

When volatile key was changed (cmap set or reload) and checks fails,
nothing was logged.

Values are now checked and error string is logged on problems.

Also totem_config is dumped to log (DEBUG level) after every
volatile key change and every reload.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 9 Jan Friesse 2014-06-12 07:39:59 UTC
Created attachment 907992 [details]
Add token_coefficient option

Add token_coefficient option

Token coefficient is used only when nodelist is specified and contains
at least 3 nodes. If so, real token timeout is then computed as
token + (number_of_nodes - 2) * token_coefficient. This allows cluster
to scale without manually changing token timeout every time new
node is added. This value can be set to 0 resulting in effective
removal of this feature.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 10 Jan Friesse 2014-06-12 07:40:07 UTC
Created attachment 907993 [details]
config: Allow dynamic change of token_coefficient

config: Allow dynamic change of token_coefficient

token_coefficient change in cmap didn't triggered change. So only way
how to change token_coefficient was editing config file and reload.

Patch let's key totem.token_coefficient to be processed so
token_coefficient can be dynamically changed.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 15 Jaroslav Kortus 2014-10-07 14:23:52 UTC
just a thought while I'm testing this functionality...

While this is now the default, it should probably be mentioned in the man page at "token" option description, so one does not have an perception, that the value configured here is really used as is.

It could also be worth considering, if setting the token manually should not disable the automatic logic. If user sets the token to any certain value, it should be taken as an informed decision and the coefficient should be set to 0 automatically.

What are your opinions on these?

Comment 16 Jaroslav Kortus 2014-10-07 14:26:18 UTC
and one more note. As it's not possible to determine the runtime value, users still might be thinking that the token timeout is configured to the number they see in the config, while in fact, different value is used.

The only way to find out is to run with "debug: on", which is not the default. This might cause some trouble when customers upgrade from 7.0 to 7.1 (their token timeout might change, despite what they may have configured manually).

Comment 17 Jan Friesse 2014-10-07 14:35:17 UTC
Jardo,
as you can notice, token_coefficient is documented and I strongly believe this documentation is enough.

token_coefficient is NOT default. It's default ONLY for configuration with nodelist, so configuration with multicast are not affected.

My opinion on setting token_coefficient to 0 when token is manually set is clear. Take a look to man page:

real token timeout is then computed as token + (number_of_nodes - 2) * token_coefficient

If we would add another auto-magic, it would be probably much more confusing for user.

Storing REAL used token timeout is good idea and it's worth of filing RFE.

Comment 18 Jan Friesse 2014-10-13 14:41:43 UTC
Created attachment 946428 [details]
manpage: Fix English

manpage: Fix English

While I was looking at the above man page changes I thought I'd review
the rest of it. So here are some more English fixes for the cmap_keys.8
man page

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 19 Jan Friesse 2014-10-13 14:41:55 UTC
Created attachment 946429 [details]
Store configuration values used by totem to cmap

Store configuration values used by totem to cmap

Some totem configuration values (like token, consensus, ...) are ether
computed or default value is used. It's hard to find out, what
value is really used.

Solution is to store values in cmap.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 20 Jan Friesse 2014-10-13 14:42:06 UTC
Created attachment 946430 [details]
man page: Improve description of token timeout

man page: Improve description of token timeout

With introduction of token_coefficient, token timeout defined in
configuration file may be no longer reflect real token timeout, what may
be confusing.

Enhanced description hopefully fix that.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 21 Jan Friesse 2014-10-13 14:46:41 UTC
Last 3 added patches improved documentation and added runtime.config.* cmap keys which reflects current state of internal configuration used by totem (together with token timeout). This should fix problems found by QA.

"Unit" test:

3 nodes in nodelist

# corosync-cmapctl  | grep runtime.config
runtime.config.totem.consensus (u32) = 1980
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 303
runtime.config.totem.join (u32) = 50
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.rrp_autorecovery_check_timeout (u32) = 1000
runtime.config.totem.rrp_problem_count_mcast_threshold (u32) = 100
runtime.config.totem.rrp_problem_count_threshold (u32) = 10
runtime.config.totem.rrp_problem_count_timeout (u32) = 2000
runtime.config.totem.rrp_token_expired_timeout (u32) = 392
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 1650
runtime.config.totem.token_retransmit (u32) = 392
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.window_size (u32) = 50

- Reduce number of nodes to 2 (simply comment out 1 node in nodelist)
# corosync-cfgtool -R
Reloading corosync.conf...
Done

corosync-cmapctl  | grep runtime.config
runtime.config.totem.consensus (u32) = 1200
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 180
runtime.config.totem.join (u32) = 50
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.rrp_autorecovery_check_timeout (u32) = 1000
runtime.config.totem.rrp_problem_count_mcast_threshold (u32) = 100
runtime.config.totem.rrp_problem_count_threshold (u32) = 10
runtime.config.totem.rrp_problem_count_timeout (u32) = 2000
runtime.config.totem.rrp_token_expired_timeout (u32) = 238
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 1000
runtime.config.totem.token_retransmit (u32) = 238
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.window_size (u32) = 50

Specially notice difference in runtime.config.totem.token

Comment 23 Jaroslav Kortus 2014-10-27 17:07:12 UTC
on corosync-2.3.4-3.el7.x86_64:
7-node cluster:
runtime.config.totem.consensus (u32) = 5100
runtime.config.totem.token (u32) = 4250

1 node removed:
runtime.config.totem.consensus (u32) = 4320
runtime.config.totem.token (u32) = 3600

1 node returned:
runtime.config.totem.consensus (u32) = 5100
runtime.config.totem.token (u32) = 4250

2node cluster:
runtime.config.totem.consensus (u32) = 1200
runtime.config.totem.token (u32) = 1000

Thank you for adding the runtime values to cmap. Also the man page is now updated and I don't find it confusing any more. Well done :).

Having said that I think this change deserves a release note, so I added a flag for it as well (as this basically touches every cluster there that is not a default 2-node cluster).

Comment 25 errata-xmlrpc 2015-03-05 08:26:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0365.html


Note You need to log in before you can comment on or make changes to this bug.