Bug 2024652

Summary: totem: Add cancel_hold_on_retransmit config option [RHEL 9]
Product: Red Hat Enterprise Linux 9 Reporter: Jan Friesse <jfriesse>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 9.0CC: ccaulfie, cherrylegler, cluster-maint, cluster-qe, jfriesse, mjuricek, nwahl, phagara, sbradley
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-3.1.5-3.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2002115 Environment:
Last Closed: 2022-05-17 13:11:03 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2024090    
Bug Blocks:    
Attachments:
Description Flags
totem: Add cancel_hold_on_retransmit config option
none
totem: Add cancel_hold_on_retransmit config option none

Description Jan Friesse 2021-11-18 15:37:10 UTC
+++ This bug was initially created as a clone of Bug #2002115 +++

+++ This bug was initially created as a clone of Bug #2001969 +++

Description of problem:
A customer has requested the following commit be included from corosync:
  - totem: Add cancel_hold_on_retransmit config option 
    https://github.com/corosync/corosync/pull/653


Version-Release number of selected component (if applicable):
corosync - latest

How reproducible:
N/A

Steps to Reproduce:
N/A

Actual results:
Customers wants to avoid so many 'FAILED TO RECEIVE' error during Corosync retransmits.

Expected results:
Significantly avoid the 'FAILED TO RECEIVE' error during Corosync retransmits.

Additional info:

This request is for RHEL 7.9.z from the customer: 03023616

Related comment:
  - Corosync retransmit messages · Issue #622 · corosync/corosync 
    https://github.com/corosync/corosync/issues/622#issuecomment-903513295


--- Additional comment from Reid Wahl on 2021-09-08 00:44:32 UTC ---

Our partner Cherry at Google requested z-stream releases for RHEL 7.7 and 7.9.

If it's feasible to do so, it might make sense to ship z-streams for RHEL 7.4 and 7.6. AFAIK (this information may no longer be up-to-date), RHEL 7.4 and 7.6 are still covered under EUS for SAP HA.

--- Additional comment from Reid Wahl on 2021-09-08 00:49:09 UTC ---

Google requested this for RHEL 8 as well, with the understanding that the upstream version does not yet work with knet/corosync-3.x.

Supported releases for SAP include 8.1, 8.2, and 8.4, so we would want to bring the patch into all of those once it's working.

--- Additional comment from Jan Friesse on 2021-09-08 07:03:59 UTC ---

Setting ITR to 8.6.0 (this is RHEL 8 bz).

It's worth noting that there is ether:
- Need to fix also Knet (increase number of fragment buffers) to make corosync working
- Or larger (= no ZStream) patch of corosync (listening on mtu change and not use knet fragmentation) is needed

--- Additional comment from Jan Friesse on 2021-10-26 08:38:08 UTC ---

Setting DTM to 22, but it's not really sure how real is 8.6 (because of Christmas, complexity of problem, ...)

--- Additional comment from Jan Friesse on 2021-11-18 15:34:33 UTC ---

totem: Add cancel_hold_on_retransmit config option

Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).

This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.

Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.

Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.

(backported from master cdf72925db5a81e546ca8e8d7d8291ee1fc77be4)

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

--- Additional comment from Jan Friesse on 2021-11-18 15:35:41 UTC ---

This patch itself is not enough for solve the problem. Knet fix (defrag buffers) is also needed so it depends on bug 2024095

Comment 1 Jan Friesse 2021-11-18 15:38:58 UTC
Created attachment 1842601 [details]
totem: Add cancel_hold_on_retransmit config option

totem: Add cancel_hold_on_retransmit config option

Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).

This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.

Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.

Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.

(backported from master cdf72925db5a81e546ca8e8d7d8291ee1fc77be4)

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 2 Jan Friesse 2021-11-18 15:40:33 UTC
Created attachment 1842602 [details]
totem: Add cancel_hold_on_retransmit config option

totem: Add cancel_hold_on_retransmit config option

Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).

This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.

Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.

Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 3 Jan Friesse 2021-11-18 15:45:34 UTC
For QA: Reproducer is same as for RHEL 7 bug 2001969 (bug 2001969 comment 11)

Comment 9 errata-xmlrpc 2022-05-17 13:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: corosync), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2471