Bug 682771

Summary: RFE: remove 1M message size limit
Product: Red Hat Enterprise Linux 7 Reporter: Florian Haas <florian>
Component: corosyncAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: abeekhof, agk, ccaulfie, cfeist, cluster-maint, fdinitto, jfriesse, jkortus, wnix
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-2.3.4-6.el7 Doc Type: Enhancement
Doc Text:
Feature: The maximum size of a message that could be transferred using corosync CPG messaging facility was previously limited to 1MB. This limit has now been lifted. Reason: Pacemaker uses corosync CPG messaging to communicate changes in cluster state, and with larger numbers of resources this amount of information could get quite large and, even with data compression, exceed the maximum size allowed by corosync. Result: There is now no limit on the size of the data packets sent using CPG messaging in corosync. It is still necessary to configure pacemaker in /etc/sysconfig/pacemaker to allow larger messages to be sent.
Story Points: ---
Clone Of:
: 975903 (view as bug list) Environment:
Last Closed: 2015-11-19 11:41:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 975903    
Bug Blocks: 1133060, 1174884, 1205796, 1251103    
Attachments:
Description Flags
cpg: Add support for messages larger than 1Mb
none
Really add cpghum
none
Don't link with libz when not needed none

Description Florian Haas 2011-03-07 14:50:39 UTC
Description of problem:
IIUC, Corosync currently has a 1M message size limit due to a hardcoded default in TOTEM buffer allocation. This may become a problem as Pacemaker clusters become more complex, with cluster sizes upward of 16 nodes and CIBs exceeding perhaps dozens of resources.

Version-Release number of selected component (if applicable):
1.2.3

Expected results:

Make the message size limit configurable, or (if this is possible) remove the hard limit altogether.

Comment 3 Steven Dake 2011-03-24 17:40:19 UTC
The client->server ipc portion of this RFE could be addressed by using the zero-copy feature to allocate buffers when the requested buffer size is greater then 1MB (and then do a memcpy).  From server to client, an additional message type could be added to indicate the buffer is a freshly mmapped buffer needing special attention by the dispatch code.  The totempg code could then have a memory allocation that takes place if a new message is received that will be larger then 1MB.  All sounds pretty complicated though and prone to breakage.

Do you have customers that have run into this limit?

Regards
-steve

Comment 4 Steven Dake 2011-03-24 17:41:04 UTC
Angus, please comment on how this RFE would be achieved in the libqb corosync 2.0+ case.

Comment 5 Angus Salkeld 2011-03-26 11:23:53 UTC
Are you sending XML text? Is it possible to compress the text
(it should compress well)?

Another option is to automatically fragment the message between
client and server. I'de need to have a look into a bit more though.

Comment 6 Andrew Beekhof 2011-04-13 07:42:56 UTC
It is XML that is being sent and we do compress it already.
However the status section can get really big so hitting the limit is still conceivable.

I don't think we necessarily need to remove the limit completely, just allow it to be tuned from corosync.conf (_before_ startup) by those that find it necessary.  

This would have the nice property of also allowing it to be tuned down, thus lowering corosync's memory footprint in situations not needing large messages.

Comment 9 Steven Dake 2011-08-10 19:24:12 UTC
Will propose as a 2.0 feature (rhel7 timeframe).

Comment 12 Jan Friesse 2013-06-19 14:57:41 UTC
IPC is now handled by LibQB. According to https://github.com/asalkeld/libqb/issues/14, that problem still exists. There is also another problem https://github.com/asalkeld/libqb/issues/71. After removing these two issues, support in corosync should be seamless. Cloning this bug. This bug will be used for corosync and cloned one Bug 975903 for LibQB.

Comment 16 Christine Caulfield 2015-03-06 08:40:44 UTC
commit 8cc8e513633a1a8b12c416e32fb5362fcf4d65dd
Author: Christine Caulfield <ccaulfie>
Date:   Thu Mar 5 16:45:15 2015 +0000

    cpg: Add support for messages larger than 1Mb

Comment 18 Jan Friesse 2015-04-01 13:16:35 UTC
Created attachment 1009656 [details]
cpg: Add support for messages larger than 1Mb

cpg: Add support for messages larger than 1Mb

If a cpg client sends a message larger than 1Mb (actually slightly
less to allow for internal buffers) cpg will now fragment that into
several corosync messages before sending it around the ring.

cpg_mcast_joined() can now return CS_ERR_INTERRUPT which means that the
cpg membership was disrupted during the send operation and the message
needs to be resent.

The new API call cpg_max_atomic_msgsize_get() returns the maximum size
of a message that will not be fragmented internally.

New test program cpghum was written to stress test this functionality,
it checks message integrity and order of receipt.

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 25 Jan Friesse 2015-06-22 08:28:19 UTC
Created attachment 1041624 [details]
Really add cpghum

Really add cpghum

Signed-off-by: Jan Friesse <jfriesse>

Comment 27 Jan Friesse 2015-06-22 14:05:07 UTC
Created attachment 1041839 [details]
Don't link with libz when not needed

Don't link with libz when not needed

Commit 8cc8e513633a1a8b12c416e32fb5362fcf4d65dd added check for libz
resulting in linking with lib z for all libraries. This is not expected
behavior. Patch solves it by making defining automake conditional so
cpghum is linked only if libz is available and LIBS variable is not
modified at all.

Signed-off-by: Jan Friesse <jfriesse>

Comment 28 Nate Straz 2015-08-14 16:02:34 UTC
I'm not able to get through the test case David used in 
 bug 1174462 comment 8.  Is there a configuration change that's needed too?


[root@host-026 ~]# for x in `seq 1 40`; do pcs resource create FAKE$x Dummy meta target-role=Stopped fake="`openssl rand -hex 32000`" || break; echo $x done; done
1 done
2 done
3 done
4 done
Error: unable to get cib
Error: unable to get cib
[root@host-026 ~]# tail /var/log/messages -n 30
Aug 14 10:59:31 host-026 crmd[13434]:  notice: Initiating action 16: monitor FAKE2_monitor_0 on host-027
Aug 14 10:59:31 host-026 crmd[13434]:  notice: Initiating action 14: monitor FAKE2_monitor_0 on host-026 (local)
Aug 14 10:59:31 host-026 crmd[13434]:  notice: Initiating action 15: probe_complete probe_complete-host-027 on host-027 - no waiting
Aug 14 10:59:31 host-026 crmd[13434]:  notice: Initiating action 17: probe_complete probe_complete-host-028 on host-028 - no waiting
Aug 14 10:59:31 host-026 crmd[13434]:  notice: Operation FAKE2_monitor_0: not running (node=host-026, call=43, rc=7, cib-update=479, confirmed=true)
Aug 14 10:59:31 host-026 crmd[13434]:  notice: Initiating action 13: probe_complete probe_complete-host-026 on host-026 (local) - no waiting
Aug 14 10:59:31 host-026 crmd[13434]:  notice: Transition 381 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-52.bz2): Complete
Aug 14 10:59:31 host-026 crmd[13434]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 14 10:59:32 host-026 cibadmin[1527]:  notice: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe
Aug 14 10:59:32 host-026 crmd[13434]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 14 10:59:32 host-026 pengine[13433]:  notice: Calculated Transition 382: /var/lib/pacemaker/pengine/pe-input-53.bz2
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Initiating action 18: monitor FAKE3_monitor_0 on host-028
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Initiating action 16: monitor FAKE3_monitor_0 on host-027
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Initiating action 14: monitor FAKE3_monitor_0 on host-026 (local)
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Initiating action 15: probe_complete probe_complete-host-027 on host-027 - no waiting
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Initiating action 17: probe_complete probe_complete-host-028 on host-028 - no waiting
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Operation FAKE3_monitor_0: not running (node=host-026, call=47, rc=7, cib-update=481, confirmed=true)
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Initiating action 13: probe_complete probe_complete-host-026 on host-026 (local) - no waiting
Aug 14 10:59:32 host-026 crmd[13434]:  notice: Transition 382 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-53.bz2): Complete
Aug 14 10:59:32 host-026 crmd[13434]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 14 10:59:32 host-026 cibadmin[1547]:  notice: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe
Aug 14 10:59:33 host-026 cib[13429]:   error: Compression of 329080 bytes failed: output data will not fit into the buffer provided (-8)
Aug 14 10:59:33 host-026 cib[13429]:   error: Could not compress the message into less than the configured ipc limit (131072 bytes).Set PCMK_ipc_buffer to a higher value (658160 bytes suggested)
Aug 14 10:59:33 host-026 cib[13429]:  notice: Notification failed: Message too long (-90)
Aug 14 10:59:33 host-026 cib[13429]:   error: Compression of 286029 bytes failed: output data will not fit into the buffer provided (-8)
Aug 14 10:59:33 host-026 cib[13429]:   error: Could not compress the message into less than the configured ipc limit (131072 bytes).Set PCMK_ipc_buffer to a higher value (1316320 bytes suggested)
Aug 14 10:59:33 host-026 cib[13429]:  notice: Message to 0x18fdd00[1551] failed: Message too long (-90)
Aug 14 10:59:33 host-026 cib[13429]: warning: A-Sync reply to cibadmin failed: No message of desired type
Aug 14 11:00:01 host-026 systemd: Started Session 1541 of user root.
Aug 14 11:00:01 host-026 systemd: Starting Session 1541 of user root.

[root@host-026 ~]# rpm -q pacemaker corosync libqb
pacemaker-1.1.13-6.el7.x86_64
corosync-2.3.4-7.el7.x86_64
libqb-0.17.1-2.el7.x86_64

Comment 29 Nate Straz 2015-08-14 20:14:28 UTC
I found the setting in /etc/sysconfig/pacemaker to adjust the IPC buffer and set it to 2MB.  This allowed me to get through the test cases in bug 1174462 comment 8 and 9.

Comment 30 errata-xmlrpc 2015-11-19 11:41:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2354.html