Bug 1774143

Summary: [Support RFE] Make it easier to raise corosync totem token
Product: Red Hat Enterprise Linux 8 Reporter: John Ruemker <jruemker>
Component: pcsAssignee: Ondrej Mular <omular>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact: Steven J. Levine <slevine>
Priority: high    
Version: 8.3CC: cfeist, cluster-maint, idevat, mlisik, mmazoure, mpospisi, nhostako, omular, sbradley, slevine, tojeline
Target Milestone: rcKeywords: FutureFeature, Triaged
Target Release: 8.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pcs-0.10.7-3.el8 Doc Type: Enhancement
Doc Text:
Feature: Allow to change corosync totem token. Reason: Users need to raise corosync totem token to avoid fencing during temporary system unresponsiveness. Result: New comand 'pcs cluster config update' was introduced to change corosync configuration which includes change of totem token value.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-18 15:12:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1774149    

Description John Ruemker 2019-11-19 17:09:04 UTC
Most customers accept the default totem token setting in their clusters.  However, it is very common for our customers to experience node fencing as a result of temporary system unresponsiveness, and usually we have to suggest raising totem token before a customer will even consider that as a possible adjustment.  

Also, many admins don't even know what totem token is, and so they're not finding the documentation or options on their own if they want to dial-back how aggressive their cluster is in fencing nodes.

The Red Hat Support team has long dealt with a high volume of customers asking for root-cause analysis of fence events or node-reboots.  The most common scenario for node fencing is a node becoming temporarily unresponsive longer than corosync's communication timeout, and many customers are left with the impression that RHEL HA is "unstable" when they see node fencing, when in reality a 1s timeout is often just too aggressive for many environments. 

Even though you can set token timeout at setup-time with 'pcs cluster setup totem ...', there is no way yet to do this after the fact.  Also, like I said, many customers don't know what totem is, or what token is, so even with the setup option many admins miss this.

The goal of this request is to make it easier for customers to raise the totem token value using pcs.  I would like to propose a few different ways to do that:

1) Offer a command to change totem token after setup, such as: 

   # pcs cluster update --token=10s


2) Make a command to do the same but with more obvious terminology and simpler for a novice administrator to figure out.  Maybe we could have variations of this under both 'pcs cluster' and 'pcs stonith'.   For example: 

   # pcs cluster communication-timeout (--aggressive|--moderate|--lenient|--timeout=X)

   # pcs stonith communication-timeout (--aggressive|--moderate|--lenient|--timeout=X)

and those options might set something like 1s, 10s, 60s.  


3) In bug #1774132, I proposed we add a 'pcs diagnostics' command, and if we do that, we could offer a variation of this token-adjustment as a subcommand.  For example:

   # pcs diagnostics communication-timeout (--aggressive|--moderate|--lenient|--timeout=X)

   # pcs diagnostics setup  # (which would deploy all the diagnostics, including a --moderate communication timeout).


Obviously these are all redundant.  My goal with these suggestions is to make it more obvious to admins that they can tweak how aggressive the cluster is with fencing, and to put those commands in several places so they'll be more likely to actually discover it. 

My top priority is to at least make it possible to update token through SOME command, so we don't have to guide customers through a config file adjustment whenever there is a fencing incident.  

Once we can do that, then these other suggestions seem like they'd be easy to add as alternative ways to do the same thing.

Comment 1 John Ruemker 2019-11-19 17:12:47 UTC
Related RFE filed in RHEL 7 and linked by many customers asking for ability to change arbitrary corosync values through pcs: https://bugzilla.redhat.com/show_bug.cgi?id=1173346

This request I've described here is more narrow, asking just for totem token.

Comment 10 Miroslav Lisik 2020-12-16 16:33:11 UTC
Proposed fix + tests in attachment 1739694 [details] (bz1667061 comment 12)

Test:
pcs cluster config update totem token=10000

Comment 11 Miroslav Lisik 2020-12-18 17:44:53 UTC
Test:

[root@r8-node-01 ~]# rpm -q pcs
pcs-0.10.7-3.el8.x86_64

[root@r8-node-01 ~]# grep token /etc/corosync/corosync.conf
[root@r8-node-01 ~]# pcs cluster config update totem token=3000
Sending updated corosync.conf to nodes...
r8-node-01: Succeeded
r8-node-02: Succeeded
r8-node-01: Corosync configuration reloaded
[root@r8-node-01 ~]# grep token /etc/corosync/corosync.conf
    token: 3000
[root@r8-node-01 ~]# corosync-cmapctl | grep token | head -1
runtime.config.totem.token (u32) = 3000

[root@r8-node-01 ~]# pcs cluster config update totem token=10000
Sending updated corosync.conf to nodes...
r8-node-01: Succeeded
r8-node-02: Succeeded
r8-node-01: Corosync configuration reloaded
[root@r8-node-01 ~]# grep token /etc/corosync/corosync.conf
    token: 10000
[root@r8-node-01 ~]# corosync-cmapctl | grep token | head -1
runtime.config.totem.token (u32) = 10000

Comment 15 Nina Hostakova 2021-01-12 16:25:38 UTC
Updating the totem token value has been tested along with other corosync configuration options within bz1667061.

Marking verified based on bz1667061 comment19.

Comment 19 errata-xmlrpc 2021-05-18 15:12:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1737