Bug 2002115

Summary: totem: Add cancel_hold_on_retransmit config option [RHEL 8]
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.4CC: ccaulfie, cherrylegler, cluster-maint, cluster-qe, jfriesse, mjuricek, nwahl, phagara, sbradley
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-3.1.5-2.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2001969
: 2024652 (view as bug list) Environment:
Last Closed: 2022-05-10 14:04:02 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2024095    
Bug Blocks:    
Attachments:
Description Flags
totem: Add cancel_hold_on_retransmit config option
none
totem: Add cancel_hold_on_retransmit config option none

Description Reid Wahl 2021-09-08 00:46:52 UTC
+++ This bug was initially created as a clone of Bug #2001969 +++

Description of problem:
A customer has requested the following commit be included from corosync:
  - totem: Add cancel_hold_on_retransmit config option 
    https://github.com/corosync/corosync/pull/653


Version-Release number of selected component (if applicable):
corosync - latest

How reproducible:
N/A

Steps to Reproduce:
N/A

Actual results:
Customers wants to avoid so many 'FAILED TO RECEIVE' error during Corosync retransmits.

Expected results:
Significantly avoid the 'FAILED TO RECEIVE' error during Corosync retransmits.

Additional info:

This request is for RHEL 7.9.z from the customer: 03023616

Related comment:
  - Corosync retransmit messages · Issue #622 · corosync/corosync 
    https://github.com/corosync/corosync/issues/622#issuecomment-903513295


--- Additional comment from Reid Wahl on 2021-09-08 00:44:32 UTC ---

Our partner Cherry at Google requested z-stream releases for RHEL 7.7 and 7.9.

If it's feasible to do so, it might make sense to ship z-streams for RHEL 7.4 and 7.6. AFAIK (this information may no longer be up-to-date), RHEL 7.4 and 7.6 are still covered under EUS for SAP HA.

Comment 1 Reid Wahl 2021-09-08 00:49:09 UTC
Google requested this for RHEL 8 as well, with the understanding that the upstream version does not yet work with knet/corosync-3.x.

Supported releases for SAP include 8.1, 8.2, and 8.4, so we would want to bring the patch into all of those once it's working.

Comment 2 Jan Friesse 2021-09-08 07:03:59 UTC
Setting ITR to 8.6.0 (this is RHEL 8 bz).

It's worth noting that there is ether:
- Need to fix also Knet (increase number of fragment buffers) to make corosync working
- Or larger (= no ZStream) patch of corosync (listening on mtu change and not use knet fragmentation) is needed

Comment 3 Jan Friesse 2021-10-26 08:38:08 UTC
Setting DTM to 22, but it's not really sure how real is 8.6 (because of Christmas, complexity of problem, ...)

Comment 4 Jan Friesse 2021-11-18 15:34:33 UTC
Created attachment 1842600 [details]
totem: Add cancel_hold_on_retransmit config option

totem: Add cancel_hold_on_retransmit config option

Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).

This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.

Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.

Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.

(backported from master cdf72925db5a81e546ca8e8d7d8291ee1fc77be4)

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 5 Jan Friesse 2021-11-18 15:35:41 UTC
This patch itself is not enough for solve the problem. Knet fix (defrag buffers) is also needed so it depends on bug 2024095

Comment 6 Jan Friesse 2021-11-18 15:40:40 UTC
Created attachment 1842603 [details]
totem: Add cancel_hold_on_retransmit config option

totem: Add cancel_hold_on_retransmit config option

Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).

This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.

Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.

Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 7 Jan Friesse 2021-11-18 15:45:27 UTC
For QA: Reproducer is same as for RHEL 7 bug 2001969 (bug 2001969 comment 11)

Comment 13 errata-xmlrpc 2022-05-10 14:04:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1871