Bug 2002115
Summary: | totem: Add cancel_hold_on_retransmit config option [RHEL 8] | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> | ||||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 8.4 | CC: | ccaulfie, cherrylegler, cluster-maint, cluster-qe, jfriesse, mjuricek, nwahl, phagara, sbradley | ||||||
Target Milestone: | rc | Keywords: | Triaged | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | corosync-3.1.5-2.el8 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | 2001969 | ||||||||
: | 2024652 (view as bug list) | Environment: | |||||||
Last Closed: | 2022-05-10 14:04:02 UTC | Type: | Feature Request | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 2024095 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Reid Wahl
2021-09-08 00:46:52 UTC
Google requested this for RHEL 8 as well, with the understanding that the upstream version does not yet work with knet/corosync-3.x. Supported releases for SAP include 8.1, 8.2, and 8.4, so we would want to bring the patch into all of those once it's working. Setting ITR to 8.6.0 (this is RHEL 8 bz). It's worth noting that there is ether: - Need to fix also Knet (increase number of fragment buffers) to make corosync working - Or larger (= no ZStream) patch of corosync (listening on mtu change and not use knet fragmentation) is needed Setting DTM to 22, but it's not really sure how real is 8.6 (because of Christmas, complexity of problem, ...) Created attachment 1842600 [details]
totem: Add cancel_hold_on_retransmit config option
totem: Add cancel_hold_on_retransmit config option
Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).
This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.
Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.
Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.
(backported from master cdf72925db5a81e546ca8e8d7d8291ee1fc77be4)
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
This patch itself is not enough for solve the problem. Knet fix (defrag buffers) is also needed so it depends on bug 2024095 Created attachment 1842603 [details]
totem: Add cancel_hold_on_retransmit config option
totem: Add cancel_hold_on_retransmit config option
Previously, existence of retransmit messages canceled holding
of token (and never allowed representative to enter token hold
state).
This makes token rotating maximum speed and keeps processor
resending messages over and over again - overloading network
and reducing chance to successfully deliver the messages.
Also there were reports of various Antivirus / IPS / IDS which slows
down delivery of packets with certain sizes (packets bigger than token)
what make Corosync retransmit messages over and over again.
Proposed solution is to allow representative to enter token hold
state when there are only retransmit messages. This allows network to
handle overload and/or gives Antivirus/IPS/IDS enough time scan and
deliver packets without corosync entering "FAILED TO RECEIVE" state and
adding more load to network.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
For QA: Reproducer is same as for RHEL 7 bug 2001969 (bug 2001969 comment 11) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (corosync bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1871 |