2024658 – totemsrp: Switch totempg buffers at the right time [RHEL 8]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2024658 - totemsrp: Switch totempg buffers at the right time [RHEL 8]

Summary: totemsrp: Switch totempg buffers at the right time [RHEL 8]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	8.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	2024095
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-18 15:50 UTC by Jan Friesse
Modified:	2022-05-10 14:30 UTC (History)
CC List:	3 users (show)
Fixed In Version:	corosync-3.1.5-2.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-10 14:04:02 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
totemsrp: Switch totempg buffers at the right time (4.14 KB, patch) 2021-11-18 15:54 UTC, Jan Friesse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-103182	0	None	None	None	2021-11-18 15:52:49 UTC
Red Hat Product Errata	RHBA-2022:1871	0	None	None	None	2022-05-10 14:04:06 UTC

Description Jan Friesse 2021-11-18 15:50:37 UTC

Description of problem:
Commit 92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc added switching of totempg buffers in sync phase. But because buffers got switch too early there was a problem when delivering recovered messages (messages got corrupted and/or lost). Solution is to switch buffers after recovered messages got delivered.

How reproducible:
92.518%

Steps to Reproduce:
Reproducer is described in upstream issue https://github.com/corosync/corosync/issues/660

Actual results:
cpgverify delivers corrupted message

Expected results:
No corrupted message

Additional info:
Fixed knet is needed, otherwise pmxcfs remains broken

Comment 1 Jan Friesse 2021-11-18 15:54:04 UTC

Created attachment 1842606 [details]
totemsrp: Switch totempg buffers at the right time

totemsrp: Switch totempg buffers at the right time

Commit 92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc added switching of
totempg buffers in sync phase. But because buffers got switch too early
there was a problem when delivering recovered messages (messages got
corrupted and/or lost). Solution is to switch buffers after recovered
messages got delivered.

I think it is worth to describe complete history with reproducers so it
doesn't get lost.

It all started with 402638929e5045ef520a7339696c687fbed0b31b (more info
about original problem is described in
https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
solves problem which is way to be reproduced with following reproducer:
- 2 nodes
- Both nodes running corosync and testcpg
- Pause node 1 (SIGSTOP of corosync)
- On node 1, send some messages by testcpg
  (it's not answering but this doesn't matter). Simply hit ENTER key
  few times is enough)
- Wait till node 2 detects that node 1 left
- Unpause node 1 (SIGCONT of corosync)

and on node 1 newly mcasted cpg messages got sent before sync barrier,
so node 2 logs "Unknown node -> we will not deliver message".

Solution was to add switch of totemsrp new messages buffer.

This patch was not enough so new one
(92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc) was created. Reproducer of
problem was similar, just cpgverify was used instead of testcpg.
Occasionally when node 1 was unpaused it hang in sync phase because
there was a partial message in totempg buffers. New sync message had
different frag cont so it was thrown away and never delivered.

After many years problem was found which is solved by this patch
(original issue describe in
https://github.com/corosync/corosync/issues/660).
Reproducer is more complex:
- 2 nodes
- Node 1 is rate-limited (used script on the hypervisor side):
  ```
  iface=tapXXXX
  # ~0.1MB/s in bit/s
  rate=838856
  # 1mb/s
  burst=1048576
  tc qdisc add dev $iface root handle 1: htb default 1
  tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
    burst ${burst}b
  tc qdisc add dev $iface handle ffff: ingress
  tc filter add dev $iface parent ffff: prio 50 basic police rate \
    ${rate}bps burst ${burst}b mtu 64kb "drop"
  ```
- Node 2 is running corosync and cpgverify
- Node 1 keeps restarting of corosync and running cpgverify in cycle
  - Console 1: while true; do corosync; sleep 20; \
      kill $(pidof corosync); sleep 20; done
  - Console 2: while true; do ./cpgverify;done

And from time to time (reproduced usually in less than 5 minutes)
cpgverify reports corrupted message.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Fabio M. Di Nitto <fdinitto>

Comment 3 Jan Friesse 2021-11-18 15:56:10 UTC

For QA: Reproduced is described in comment 1, knet 1.23 is required

Comment 10 Patrik Hagara 2022-02-28 11:33:43 UTC

env
===

3-node cluster, one of the nodes has the following iptables rules on the corosync network interface (command order matters, make sure the iface is used for cluster traffic only):

> iptables -I INPUT -i ens5 -j DROP
> iptables -I INPUT -m limit --limit 100/s --limit-burst 250 -i ens5 -j ACCEPT
> iptables -I OUTPUT -o ens5 -j DROP
> iptables -I OUTPUT -m limit --limit 100/s --limit-burst 250 -o ens5 -j ACCEPT

on the slowed-down node, run the following commands concurrently:

> while true; do systemctl restart corosync; sleep 20; done
> while true; do ~/corosync/test/cpgverify; done

on one of the other nodes, start cpgverify too (before the fix, this process will terminate with an "incorrect hash" error after a short while, with fixed corosync it should continue running indefinitely):

> ~/corosync/test/cpgverify

The cpgverify program must be compiled from corosync git against the installed libqb-devel & libknet1-devel packages.

During the test, corosync.log on healthy nodes should contain many "Retransmit List" lines. The message list will expand rapidly on each corosync restart performed on the slowed-down node and then very slowly empty out as the firewall lets the packets in.


before (corosync-3.1.5-1.el8)
=============================

> [root@virt-020 ~]# ~/corosync/test/cpgverify
> [...]
> SIZE 81013 HASH: 0xc83955ad
> msg 'cpg_mcast_joined: This is message       354121'
> SIZE 86706 HASH: 0xcaf9d9e0
> msg 'cpg_mcast_joined: This is message       354122'
> SIZE 90762 HASH: 0xf386fa76
> msg 'cpg_mcast_joined: This is message       354123'
> SIZE 1827 HASH: 0xcb3c623a
> msg 'cpg_mcast_joined: This is message       354049'
> SIZE 56059 HASH: 0x2fe4c7d6
> incorrect hash

Result: within a minute or so, the cpgverify test program running on one of the healthy nodes exits with an "incorrect hash" error message.


after (corosync-3.1.5-2.el8)
============================

Result: cpgverify happily runs for approx. 10 minutes (depending on the rate limit), afterwards the retransmit list grows so big that the slowed-down node is fenced by pacemaker. With pacemaker/fencing disabled, cpgverify continues to run without encountering any checksum errors (tested for 1 hour).

Comment 12 errata-xmlrpc 2022-05-10 14:04:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1871

Note You need to log in before you can comment on or make changes to this bug.