Bug 2024657

Summary: totemsrp: Switch totempg buffers at the right time [RHEL 9]
Product: Red Hat Enterprise Linux 9 Reporter: Jan Friesse <jfriesse>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: ccaulfie, cluster-maint, phagara
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-3.1.5-3.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-17 13:11:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2024090    
Bug Blocks:    
Attachments:
Description Flags
totemsrp: Switch totempg buffers at the right time none

Description Jan Friesse 2021-11-18 15:50:35 UTC
Description of problem:
Commit 92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc added switching of totempg buffers in sync phase. But because buffers got switch too early there was a problem when delivering recovered messages (messages got corrupted and/or lost). Solution is to switch buffers after recovered messages got delivered.

How reproducible:
92.673%

Steps to Reproduce:
Reproducer is described in upstream issue https://github.com/corosync/corosync/issues/660

Actual results:
cpgverify delivers corrupted message

Expected results:
No corrupted message

Additional info:
Fixed knet is needed, otherwise pmxcfs remains broken

Comment 1 Jan Friesse 2021-11-18 15:54:12 UTC
Created attachment 1842607 [details]
totemsrp: Switch totempg buffers at the right time

totemsrp: Switch totempg buffers at the right time

Commit 92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc added switching of
totempg buffers in sync phase. But because buffers got switch too early
there was a problem when delivering recovered messages (messages got
corrupted and/or lost). Solution is to switch buffers after recovered
messages got delivered.

I think it is worth to describe complete history with reproducers so it
doesn't get lost.

It all started with 402638929e5045ef520a7339696c687fbed0b31b (more info
about original problem is described in
https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
solves problem which is way to be reproduced with following reproducer:
- 2 nodes
- Both nodes running corosync and testcpg
- Pause node 1 (SIGSTOP of corosync)
- On node 1, send some messages by testcpg
  (it's not answering but this doesn't matter). Simply hit ENTER key
  few times is enough)
- Wait till node 2 detects that node 1 left
- Unpause node 1 (SIGCONT of corosync)

and on node 1 newly mcasted cpg messages got sent before sync barrier,
so node 2 logs "Unknown node -> we will not deliver message".

Solution was to add switch of totemsrp new messages buffer.

This patch was not enough so new one
(92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc) was created. Reproducer of
problem was similar, just cpgverify was used instead of testcpg.
Occasionally when node 1 was unpaused it hang in sync phase because
there was a partial message in totempg buffers. New sync message had
different frag cont so it was thrown away and never delivered.

After many years problem was found which is solved by this patch
(original issue describe in
https://github.com/corosync/corosync/issues/660).
Reproducer is more complex:
- 2 nodes
- Node 1 is rate-limited (used script on the hypervisor side):
  ```
  iface=tapXXXX
  # ~0.1MB/s in bit/s
  rate=838856
  # 1mb/s
  burst=1048576
  tc qdisc add dev $iface root handle 1: htb default 1
  tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
    burst ${burst}b
  tc qdisc add dev $iface handle ffff: ingress
  tc filter add dev $iface parent ffff: prio 50 basic police rate \
    ${rate}bps burst ${burst}b mtu 64kb "drop"
  ```
- Node 2 is running corosync and cpgverify
- Node 1 keeps restarting of corosync and running cpgverify in cycle
  - Console 1: while true; do corosync; sleep 20; \
      kill $(pidof corosync); sleep 20; done
  - Console 2: while true; do ./cpgverify;done

And from time to time (reproduced usually in less than 5 minutes)
cpgverify reports corrupted message.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Fabio M. Di Nitto <fdinitto>

Comment 3 Jan Friesse 2021-11-18 15:56:26 UTC
For QA: Reproduced is described in comment 1, knet 1.23 is required

Comment 9 Patrik Hagara 2022-02-28 11:56:01 UTC
for environment setup details & before-the-fix reproducer, see: https://bugzilla.redhat.com/show_bug.cgi?id=2024658#c10


after (corosync-3.1.5-3.el9)
============================

> [root@virt-002 ~]# iptables -L -n -v
> Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination         
> 83993  120M ACCEPT     all  --  ens5   *       0.0.0.0/0            0.0.0.0/0            limit: avg 100/sec burst 250
>   30M   44G DROP       all  --  ens5   *       0.0.0.0/0            0.0.0.0/0           
> 
> Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination         
> 
> Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination         
> 31470   29M ACCEPT     all  --  *      ens5    0.0.0.0/0            0.0.0.0/0            limit: avg 100/sec burst 250
>  108K  103M DROP       all  --  *      ens5    0.0.0.0/0            0.0.0.0/0   

Result:

With pacemaker/fencing disabled, cpgverify happily runs without encountering any checksum errors (tested for 30 min). After the first few minutes, corosync on the slowed-down node does not even manage to catch up on all the messages before being restarted again.

With fencing enabled, the slowed-down node gets fenced once the corosync token gets lost (approx. 10 minutes in, depending on the firewall rate-limit).

Comment 11 errata-xmlrpc 2022-05-17 13:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: corosync), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2471