RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2002115 - totem: Add cancel_hold_on_retransmit config option [RHEL 8]
Summary: totem: Add cancel_hold_on_retransmit config option [RHEL 8]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: corosync
Version: 8.4
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On: 2024095
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-08 00:46 UTC by Reid Wahl
Modified: 2022-05-10 14:30 UTC (History)
9 users (show)

Fixed In Version: corosync-3.1.5-2.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2001969
: 2024652 (view as bug list)
Environment:
Last Closed: 2022-05-10 14:04:02 UTC
Type: Feature Request
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
totem: Add cancel_hold_on_retransmit config option (4.69 KB, patch)
2021-11-18 15:34 UTC, Jan Friesse
no flags Details | Diff
totem: Add cancel_hold_on_retransmit config option (4.66 KB, patch)
2021-11-18 15:40 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-96474 0 None None None 2021-09-08 00:47:58 UTC
Red Hat Knowledge Base (Solution) 6310541 0 None None None 2021-09-08 13:17:07 UTC
Red Hat Product Errata RHBA-2022:1871 0 None None None 2022-05-10 14:04:06 UTC

Description Reid Wahl 2021-09-08 00:46:52 UTC
+++ This bug was initially created as a clone of Bug #2001969 +++

Description of problem:
A customer has requested the following commit be included from corosync:
  - totem: Add cancel_hold_on_retransmit config option 
    https://github.com/corosync/corosync/pull/653


Version-Release number of selected component (if applicable):
corosync - latest

How reproducible:
N/A

Steps to Reproduce:
N/A

Actual results:
Customers wants to avoid so many 'FAILED TO RECEIVE' error during Corosync retransmits.

Expected results:
Significantly avoid the 'FAILED TO RECEIVE' error during Corosync retransmits.

Additional info:

This request is for RHEL 7.9.z from the customer: 03023616

Related comment:
  - Corosync retransmit messages · Issue #622 · corosync/corosync 
    https://github.com/corosync/corosync/issues/622#issuecomment-903513295


--- Additional comment from Reid Wahl on 2021-09-08 00:44:32 UTC ---

Our partner Cherry at Google requested z-stream releases for RHEL 7.7 and 7.9.

If it's feasible to do so, it might make sense to ship z-streams for RHEL 7.4 and 7.6. AFAIK (this information may no longer be up-to-date), RHEL 7.4 and 7.6 are still covered under EUS for SAP HA.

Comment 1 Reid Wahl 2021-09-08 00:49:09 UTC
Google requested this for RHEL 8 as well, with the understanding that the upstream version does not yet work with knet/corosync-3.x.

Supported releases for SAP include 8.1, 8.2, and 8.4, so we would want to bring the patch into all of those once it's working.

Comment 2 Jan Friesse 2021-09-08 07:03:59 UTC
Setting ITR to 8.6.0 (this is RHEL 8 bz).

It's worth noting that there is ether:
- Need to fix also Knet (increase number of fragment buffers) to make corosync working
- Or larger (= no ZStream) patch of corosync (listening on mtu change and not use knet fragmentation) is needed

Comment 3 Jan Friesse 2021-10-26 08:38:08 UTC
Setting DTM to 22, but it's not really sure how real is 8.6 (because of Christmas, complexity of problem, ...)

Comment 4 Jan Friesse 2021-11-18 15:34:33 UTC
Created attachment 1842600 [details]
totem: Add cancel_hold_on_retransmit config option

totem: Add cancel_hold_on_retransmit config option

Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).

This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.

Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.

Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.

(backported from master cdf72925db5a81e546ca8e8d7d8291ee1fc77be4)

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 5 Jan Friesse 2021-11-18 15:35:41 UTC
This patch itself is not enough for solve the problem. Knet fix (defrag buffers) is also needed so it depends on bug 2024095

Comment 6 Jan Friesse 2021-11-18 15:40:40 UTC
Created attachment 1842603 [details]
totem: Add cancel_hold_on_retransmit config option

totem: Add cancel_hold_on_retransmit config option

Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).

This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.

Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.

Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 7 Jan Friesse 2021-11-18 15:45:27 UTC
For QA: Reproducer is same as for RHEL 7 bug 2001969 (bug 2001969 comment 11)

Comment 13 errata-xmlrpc 2022-05-10 14:04:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1871


Note You need to log in before you can comment on or make changes to this bug.