Bug 664969 - corosync-pload crash corosync under load
Summary: corosync-pload crash corosync under load
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Corosync Cluster Engine
Classification: Retired
Component: unknown
Version: 1.3
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Jan Friesse
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-12-22 10:26 UTC by dietmar
Modified: 2011-08-17 07:46 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-17 07:46:52 UTC
Embargoed:


Attachments (Terms of Use)
Patch for cpg sent to ML (3.69 KB, patch)
2011-07-28 14:21 UTC, Jan Friesse
no flags Details | Diff
Patch for cfg sent to ML (7.04 KB, patch)
2011-07-28 14:22 UTC, Jan Friesse
no flags Details | Diff
Patch for cpg sent to ML (4.21 KB, patch)
2011-07-29 08:20 UTC, Jan Friesse
no flags Details | Diff

Description dietmar 2010-12-22 10:26:44 UTC
corosync version 1.3.0:

When I run “corosync-pload” it prints:

# corosync-pload 
Init result 1

The process never stops (but I can stop it with cntrl-c), but it seems to work anyways:

Dec 22 09:32:46 maui corosync[2409]:   [PLOAD ] 1500000 Writes 300 bytes per write   2.495 seconds runtime, 601307.250 TP/S,   172.035 MB/S.
Dec 22 09:32:53 maui corosync[2409]:   [PLOAD ] 1500000 Writes 300 bytes per write   3.062 seconds runtime, 489821.674 TP/S,   140.139 MB/S.
Dec 22 09:33:01 maui corosync[2409]:   [PLOAD ] 1500000 Writes 300 bytes per write   4.372 seconds runtime, 343112.460 TP/S,    98.165 MB/S.
Dec 22 09:33:09 maui corosync[2409]:   [PLOAD ] 1500000 Writes 300 bytes per write   4.369 seconds runtime, 343358.870 TP/S,    98.236 MB/S.
Dec 22 09:33:53 maui corosync[2409]:   [PLOAD ] 1500000 Writes 300 bytes per write   3.475 seconds runtime, 431594.847 TP/S,   123.480 MB/S.

If I now start cpgbench I get:

/corosync-1.3.0/test# ./cpgbench
463802 messages received  1000 bytes per write  10.000 Seconds runtime 46380.121 TP/s  46.380 MB/s.
470350 messages received  2000 bytes per write  10.000 Seconds runtime 47034.864 TP/s  94.070 MB/s.
460633 messages received  3000 bytes per write  10.000 Seconds runtime 46063.231 TP/s 138.190 MB/s.
443571 messages received  4000 bytes per write  10.000 Seconds runtime 44357.016 TP/s 177.428 MB/s.

Everything OK, but if I also start corosync-pload I get a corosync crash:
/corosync-1.3.0/test# ./cpgbench
…
cpg dispatch returned error 2

and the syslog shows:

Dec 22 09:39:45 maui corosync[2409]:   [PLOAD ] 1500000 Writes 300 bytes per write   2.184 seconds runtime, 686771.055 TP/S,   196.487 MB/S.
Dec 22 09:40:03 maui dlm_controld[2479]: cluster is down, exiting
Dec 22 09:40:03 maui fenced[2464]: cluster is down, exiting
Dec 22 09:40:05 maui kernel: dlm: closing connection to node 3

Comment 1 Steven Dake 2011-07-21 20:34:07 UTC
corosync-pload is a developer-only test tool, and I believe we had this discussion on the ML some time ago, so priority is low.

Honza can you look into removing the segfault that occurs per this test case?

Thanks
-steve

Comment 2 Jan Friesse 2011-07-28 14:21:25 UTC
Created attachment 515732 [details]
Patch for cpg sent to ML

totem_mcast function can return -1 if corosync is overloaded. Sadly in
many calls of this functions was error code ether not handled at all, or
handled by assert.

Commit changes behaviour to ether return CS_ERR_TRY_AGAIN or put error
code to later layers to handle it.

Comment 3 Jan Friesse 2011-07-28 14:22:42 UTC
Created attachment 515733 [details]
Patch for cfg sent to ML

totem_mcast function can return -1 if corosync is overloaded. Sadly
in many calls of this functions was error code ether not handled at
all, or handled by assert.

Commit changes behaviour to ether return CS_ERR_TRY_AGAIN or put
error code to later layers to handle it.

Comment 4 Jan Friesse 2011-07-29 08:20:55 UTC
Created attachment 515838 [details]
Patch for cpg sent to ML

totem_mcast function can return -1 if corosync is overloaded. Sadly in
many calls of this functions was error code ether not handled at all, or
handled by assert.

Commit changes behaviour to ether return CS_ERR_TRY_AGAIN or put error
code to later layers to handle it.

This patch differs from previous version in storing group_name + pid to be able to restore them in message_handler_req_lib_cpg_join

Comment 5 Jan Friesse 2011-08-17 07:46:52 UTC
Patch is now included in flatiron branch, so will be included in next release.


Note You need to log in before you can comment on or make changes to this bug.