1765025 – corosync can corrupt messages under heavy load and large messages

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1765025 - corosync can corrupt messages under heavy load and large messages

Summary: corosync can corrupt messages under heavy load and large messages

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1765617 1765619
TreeView+	depends on / blocked

Reported:	2019-10-24 07:52 UTC by Christine Caulfield
Modified:	2020-04-28 15:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:	corosync-3.0.2-4.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Corosync forms new membership and tries to send messages in recovery. Consequence: Messages are not fully sent and other nodes receives them corrupted. Fix: Properly set maximum size of message. Result: Messages are always fully sent so other nodes receive them correctly.
Clone Of:
Clones:	1765617 1765619 (view as bug list)
Environment:
Last Closed:	2020-04-28 15:56:45 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
totemsrp: Reduce MTU to left room second mcast (1.58 KB, patch) 2019-10-25 07:32 UTC, Jan Friesse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:1674	0	None	None	None	2020-04-28 15:56:55 UTC

Description Christine Caulfield 2019-10-24 07:52:02 UTC

Description of problem:
When placed under very heavy load, corosync can corrupt large messages

Version-Release number of selected component (if applicable):
corosync-3.0.2-2.el8.x86_64
but almost certainly applicable to all earlier corosync 3

How reproducible:
fairly easily

Steps to Reproduce:
1. set up an 8 node cluster (smaller may work, but 8 seems to reproduce more reliably for me)
2. run cpghum -f repeatedly on one node
3. at the same time run cpghum (no args) on the other nodes

Actual results:
After a time you will get CRC errors like the following on all nodes apart from the flooding node (with cpghum -f running on node 1)
cpghum: CRCs don't match. got e1cb8a64, expected c2d73db8 from nodeid 1
cpghum: CRCs don't match. got 742eed85, expected 846184f0 from nodeid 1
cpghum: CRCs don't match. got 7b893d7c, expected b1442cf0 from nodeid 1
cpghum: CRCs don't match. got 66bfa4aa, expected 646e6f2b from nodeid 1
cpghum: CRCs don't match. got 4b0d03fa, expected aedc346 from nodeid 1
cpghum: CRCs don't match. got bc18d717, expected 1072bac1 from nodeid 1
cpghum: CRCs don't match. got b8b3c54, expected 804c7588 from nodeid 1
cpghum: CRCs don't match. got 7666e09c, expected ae4c7eba from nodeid 1
cpghum: CRCs don't match. got d4291d72, expected 1154ae33 from nodeid 1
cpghum: CRCs don't match. got 76123261, expected 9656b1f8 from nodeid 1
cpghum: CRCs don't match. got 869d6eee, expected e2069a5a from nodeid 1
cpghum: CRCs don't match. got eb23309b, expected 3790feeb from nodeid 1
cpghum: CRCs don't match. got 352fc1dd, expected 1773680 from nodeid 1

Expected results:
No CRC errors

Additional info:
A fix has been identified and is upstream

Comment 1 Christine Caulfield 2019-10-24 07:59:43 UTC

I should add that this only happens when using knet as the transport as the MTU size is set larger than the allowed knet buffer size for some large packets.

Comment 3 Patrik Hagara 2019-10-24 08:42:27 UTC

do these messages not get re-sent until successfully received, or are they always corrupt? do nodes get fenced because of this?

in other words, what's the actual impact from admin's point of view?

Comment 4 Christine Caulfield 2019-10-24 09:39:59 UTC

> do these messages not get re-sent until successfully received, or are they always corrupt? do nodes get fenced because of this?
> in other words, what's the actual impact from admin's point of view?

TBH I've never seen it in the wild, only on my test rig, so I don't know for sure what the impact would be. It does depend on the message that gets corrupted, so possible effects could be unexpected behaviour, hangs or fencing. If it's a pacemaker message then it will probably be resent as I believe pacemaker checksums its messages.

Comment 5 Patrik Hagara 2019-10-24 17:14:04 UTC

qa_ack+, reproducer in comment#0

Comment 8 Jan Friesse 2019-10-25 07:32:04 UTC

Created attachment 1629053 [details]
totemsrp: Reduce MTU to left room second mcast

totemsrp: Reduce MTU to left room second mcast

Messages sent during recovery phase are encapsulated so such message has
extra size of mcast structure. This is not so big problem for UDPU,
because most of the switches are able to fragment and defragment packet
but it is problem for knet, because totempg is using maximum packet size
(65536 bytes) and when another header is added during retransmition,
then packet is too large.

Solution is to reduce mtu by 2 * sizeof (struct mcast).

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Fabio M. Di Nitto <fdinitto>

Comment 18 Patrik Hagara 2020-01-29 19:15:02 UTC

before
======

> [root@virt-034 ~]# rpm -q corosync
> corosync-3.0.2-3.el8.x86_64

set the following options in corosync.conf in order to make it easier to trigger the bug:
> token: 400
> token_coefficient: 0

start cluster:
> [root@virt-034 ~]# pcs status
> Cluster name: STSRHTS30715
> Stack: corosync
> Current DC: virt-065 (version 2.0.2-3.el8_1.2-744a30d655) - partition with quorum
> Last updated: Wed Jan 29 17:51:46 2020
> Last change: Wed Jan 29 17:29:48 2020 by root via cibadmin on virt-034
> 
> 13 nodes configured
> 13 resources configured
> 
> Online: [ virt-034 virt-056 virt-057 virt-058 virt-059 virt-060 virt-061 virt-062 virt-065 virt-068 virt-069 virt-070 virt-074 ]
> 
> Full list of resources:
> 
>  fence-virt-034	(stonith:fence_xvm):	Started virt-034
>  fence-virt-056	(stonith:fence_xvm):	Started virt-056
>  fence-virt-057	(stonith:fence_xvm):	Started virt-057
>  fence-virt-058	(stonith:fence_xvm):	Started virt-058
>  fence-virt-059	(stonith:fence_xvm):	Started virt-059
>  fence-virt-060	(stonith:fence_xvm):	Started virt-060
>  fence-virt-061	(stonith:fence_xvm):	Started virt-061
>  fence-virt-062	(stonith:fence_xvm):	Started virt-062
>  fence-virt-065	(stonith:fence_xvm):	Started virt-065
>  fence-virt-068	(stonith:fence_xvm):	Started virt-068
>  fence-virt-069	(stonith:fence_xvm):	Started virt-069
>  fence-virt-070	(stonith:fence_xvm):	Started virt-070
>  fence-virt-074	(stonith:fence_xvm):	Started virt-074
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled

start `cpghum` on all but one node, then `cpghum -f` on the remaining one.

some nodes report CRC mismatches:
> [root@virt-070 test]# ./cpghum 
> cpghum:   119 messages received,  4096 bytes per write. RTT min/avg/max: 4312/51378/468873
> cpghum: counters don't match. got 1021, expected 1920 from node 1
> cpghum: CRCs don't match. got b8366af3, expected b1f6f778 from nodeid 1
> cpghum: counters don't match. got 1921, expected 1022 from node 1
> cpghum: counters don't match. got 1021, expected 2225 from node 1
> cpghum: CRCs don't match. got b8366af3, expected 36e6d7d5 from nodeid 1
> cpghum: counters don't match. got 2226, expected 1022 from node 1
> cpghum: counters don't match. got 1021, expected 2613 from node 1
> cpghum: CRCs don't match. got b8366af3, expected 5528d35e from nodeid 1
> cpghum: counters don't match. got 2614, expected 1022 from node 1
> cpghum: counters don't match. got 1021, expected 3071 from node 1
> cpghum: CRCs don't match. got b8366af3, expected 9d355ebf from nodeid 1
> cpghum: counters don't match. got 3072, expected 1022 from node 1
> cpghum: 47847 messages received,  4096 bytes per write. RTT min/avg/max: 3226/354354/2989521
> cpghum: counters don't match. got 55191, expected 53275 from node 1

result: CRC errors occur


after
=====

> [root@virt-034 test]# rpm -q corosync
> corosync-3.0.3-2.el8.x86_64

set the following options in corosync.conf in order to make it easier to trigger the bug:
> token: 400
> token_coefficient: 0

start cluster:
> [root@virt-034 test]# pcs status
> Cluster name: STSRHTS12010
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-069 (version 2.0.3-4.el8-4b1f869f0f) - partition with quorum
>   * Last updated: Wed Jan 29 20:00:15 2020
>   * Last change:  Wed Jan 29 19:48:03 2020 by root via cibadmin on virt-034
>   * 13 nodes configured
>   * 13 resource instances configured
> 
> Node List:
>   * Online: [ virt-034 virt-056 virt-057 virt-058 virt-059 virt-060 virt-061 virt-062 virt-065 virt-068 virt-069 virt-070 virt-074 ]
> 
> Full List of Resources:
>   * fence-virt-034	(stonith:fence_xvm):	Started virt-034
>   * fence-virt-056	(stonith:fence_xvm):	Started virt-056
>   * fence-virt-057	(stonith:fence_xvm):	Started virt-057
>   * fence-virt-058	(stonith:fence_xvm):	Started virt-061
>   * fence-virt-059	(stonith:fence_xvm):	Started virt-062
>   * fence-virt-060	(stonith:fence_xvm):	Started virt-065
>   * fence-virt-061	(stonith:fence_xvm):	Started virt-069
>   * fence-virt-062	(stonith:fence_xvm):	Started virt-070
>   * fence-virt-065	(stonith:fence_xvm):	Started virt-074
>   * fence-virt-068	(stonith:fence_xvm):	Started virt-058
>   * fence-virt-069	(stonith:fence_xvm):	Started virt-060
>   * fence-virt-070	(stonith:fence_xvm):	Started virt-068
>   * fence-virt-074	(stonith:fence_xvm):	Started virt-059
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled



start `cpghum` on all but one node, then `cpghum -f` on the remaining one:
> [root@virt-034 test]# ./cpghum -f
> 131289 messages received    64 bytes per write  10.000 Seconds runtime 13128.854 TP/s   0.840 MB/s RTT for this size (min/avg/max) 2473/155467/2129302
> 79488 messages received   320 bytes per write  10.010 Seconds runtime  7941.084 TP/s   2.541 MB/s RTT for this size (min/avg/max) 4630/164556/2555632
> cpghum: counters don't match. got 663, expected 660 from node 12
> cpghum: counters don't match. got 660, expected 658 from node 13
> cpghum: counters don't match. got 693, expected 692 from node 2
> cpghum: counters don't match. got 690, expected 689 from node 3
> cpghum: counters don't match. got 688, expected 687 from node 4
> cpghum: counters don't match. got 684, expected 683 from node 5
> cpghum: counters don't match. got 683, expected 682 from node 6
> cpghum: counters don't match. got 682, expected 679 from node 7
> cpghum: counters don't match. got 676, expected 672 from node 9
> cpghum: counters don't match. got 672, expected 669 from node 10
> cpghum: counters don't match. got 667, expected 663 from node 11
> 13931 messages received  1600 bytes per write  10.011 Seconds runtime  1391.575 TP/s   2.227 MB/s RTT for this size (min/avg/max) 2892/734898/6191690
>  8479 messages received  8000 bytes per write  10.020 Seconds runtime   846.194 TP/s   6.770 MB/s RTT for this size (min/avg/max) 4032/32442/195605
> cpghum: counters don't match. got 715, expected 713 from node 2
> cpghum: counters don't match. got 680, expected 678 from node 13
> cpghum: counters don't match. got 712, expected 710 from node 3
> cpghum: counters don't match. got 710, expected 708 from node 4
> cpghum: counters don't match. got 706, expected 704 from node 5
> cpghum: counters don't match. got 705, expected 703 from node 6
> cpghum: counters don't match. got 702, expected 700 from node 7
> cpghum: counters don't match. got 699, expected 697 from node 8
> cpghum: counters don't match. got 695, expected 693 from node 9
> cpghum: counters don't match. got 692, expected 690 from node 10
> cpghum: counters don't match. got 686, expected 684 from node 11
> cpghum: counters don't match. got 683, expected 681 from node 12
>   756 messages received 40000 bytes per write  10.001 Seconds runtime    75.595 TP/s   3.024 MB/s RTT for this size (min/avg/max) 4710/714697/5581197
> cpghum: counters don't match. got 695, expected 691 from node 12
> cpghum: counters don't match. got 692, expected 688 from node 13
> cpghum: counters don't match. got 727, expected 724 from node 2
>    51 messages received 200000 bytes per write  10.758 Seconds runtime     4.741 TP/s   0.948 MB/s RTT for this size (min/avg/max) 11723/1103590/10031384
> cpghum: counters don't match. got 725, expected 719 from node 3
> cpghum: counters don't match. got 723, expected 717 from node 4
> cpghum: counters don't match. got 719, expected 713 from node 5
> cpghum: counters don't match. got 718, expected 711 from node 6
> cpghum: counters don't match. got 714, expected 708 from node 7
> cpghum: counters don't match. got 711, expected 705 from node 8
> cpghum: counters don't match. got 707, expected 702 from node 9
> cpghum: counters don't match. got 705, expected 699 from node 10
> cpghum: counters don't match. got 698, expected 693 from node 11
>   127 messages received 1000000 bytes per write  10.005 Seconds runtime    12.694 TP/s  12.694 MB/s RTT for this size (min/avg/max) 109680/10589399/12703485
> 
> Stats:
>    packets sent:    233447
>    send failures:   0
>    send retries:    2871038
>    length errors:   0
>    packets recvd:   234121
>    sequence errors: 35
>    crc errors:	    0
>    min RTT:         2473
>    max RTT:         12703486
>    avg RTT:         164851

The flooding node prints sequence errors, which is expected due to a known bug in cpghum (see comment#16).

Similar sequence errors appear on the listening nodes, eg.:
> cpghum: 39948 messages received,  4096 bytes per write. RTT min/avg/max: 925/331992/17183573
> cpghum: counters don't match. got 216434, expected 216039 from node 1
> cpghum: 13271 messages received,  4096 bytes per write. RTT min/avg/max: 925/330413/17183573
> cpghum: counters don't match. got 232838, expected 232808 from node 1
> cpghum:  3445 messages received,  4096 bytes per write. RTT min/avg/max: 925/332751/17183573
> cpghum:   639 messages received,  4096 bytes per write. RTT min/avg/max: 925/336903/17183573
> cpghum: counters don't match. got 233420, expected 233419 from node 1

No CRC or other errors detected on any node.

Marking verified in 3.0.3-2.el8

Comment 20 errata-xmlrpc 2020-04-28 15:56:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1674

Note You need to log in before you can comment on or make changes to this bug.