Bug 1001210 - If rdma transport is used in cluster.conf, cman is unable to start after shutdown.
If rdma transport is used in cluster.conf, cman is unable to start after shut...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.4
x86_64 Linux
medium Severity high
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
: OtherQA
Depends On: 1055584
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-26 14:13 EDT by Yevheniy Demchenko
Modified: 2014-10-14 03:11 EDT (History)
9 users (show)

See Also:
Fixed In Version: corosync-1.4.7-1.el6
Doc Type: Bug Fix
Doc Text:
Cause: User use IBA (infiniband) as transport protocol for corosync (in native iba mode). Corosync is stopped and started again. User can also restart iba sm. Consequence: Corosync doesn't start or doesn't handle iba sm restart. Fix: Almost inexpressible huge number of patches. See comment 9. Result: Corosync now start and survive iba sm restart.
Story Points: ---
Clone Of:
: 1102057 (view as bug list)
Environment:
Last Closed: 2014-10-14 03:11:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cluster.conf (1.39 KB, application/xml)
2013-08-26 14:13 EDT, Yevheniy Demchenko
no flags Details
/var/log/messages on node1 (36.32 KB, text/plain)
2013-08-26 14:14 EDT, Yevheniy Demchenko
no flags Details
/var/log/messages on node2 (91.67 KB, text/plain)
2013-08-26 14:14 EDT, Yevheniy Demchenko
no flags Details
opensm.conf (14.41 KB, application/x-troff-man)
2013-08-26 14:15 EDT, Yevheniy Demchenko
no flags Details
IB partitions.conf (34 bytes, application/x-troff-man)
2013-08-26 14:16 EDT, Yevheniy Demchenko
no flags Details
Patch to resolve the issue (2.80 KB, patch)
2013-09-04 19:44 EDT, Yevheniy Demchenko
no flags Details | Diff
01totemiba_recv_buf_size.patch (2.40 KB, patch)
2013-09-14 07:58 EDT, Yevheniy Demchenko
no flags Details | Diff
02totemiba_correct_poll_dispatch_del.patch (930 bytes, patch)
2013-09-14 07:58 EDT, Yevheniy Demchenko
no flags Details | Diff
03totemiba_correct_handler_params.patch (3.65 KB, patch)
2013-09-14 07:58 EDT, Yevheniy Demchenko
no flags Details | Diff
04totemiba_rrp.patch (1.18 KB, patch)
2013-09-14 07:59 EDT, Yevheniy Demchenko
no flags Details | Diff
05totemiba_mtu_check.patch (1.30 KB, patch)
2013-09-14 07:59 EDT, Yevheniy Demchenko
no flags Details | Diff
06totemiba_add_imm_data.patch (1.67 KB, patch)
2013-09-14 08:00 EDT, Yevheniy Demchenko
no flags Details | Diff
05totemiba_mtu_check.patch (1.30 KB, patch)
2013-09-14 08:35 EDT, Yevheniy Demchenko
no flags Details | Diff
05totemiba_mtu_check.patch (1.34 KB, patch)
2013-09-14 08:39 EDT, Yevheniy Demchenko
no flags Details | Diff
03totemiba_correct_handler_params.patch - take 2 (4.46 KB, patch)
2013-09-20 05:10 EDT, Jan Friesse
no flags Details | Diff
04totemiba_rrp.patch - take 2 (1.75 KB, patch)
2013-09-20 05:11 EDT, Jan Friesse
no flags Details | Diff
05totemiba_mtu_check.patch - take 2 (1.88 KB, patch)
2013-09-20 05:14 EDT, Jan Friesse
no flags Details | Diff
Add multicast recovery to IB (3.45 KB, patch)
2014-05-04 14:12 EDT, Yevheniy Demchenko
no flags Details | Diff
totemiba: Add multicast recovery - upstream commit (4.34 KB, patch)
2014-05-14 08:59 EDT, Jan Friesse
no flags Details | Diff
totemiba: Fix incorrect failed log message (1.35 KB, patch)
2014-05-15 10:19 EDT, Jan Friesse
no flags Details | Diff
totemiba: Use size of struct mcast in MTU check (1.52 KB, patch)
2014-05-28 08:21 EDT, Jan Friesse
no flags Details | Diff

  None (edit)
Description Yevheniy Demchenko 2013-08-26 14:13:47 EDT
Created attachment 790643 [details]
cluster.conf

Description of problem:
cman is unable to start after shutdown if rdma transport is used

Version-Release number of selected component (if applicable):
openais-1.1.1-7.el6.x86_64
clusterlib-3.0.12.1-49.el6_4.1.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create 2-node cluster with transport=rdma
2. "#service cman start"  on both nodes
3. "#service cman stop remove" on node2
4. "#service cman start" on node2

Actual results:
Cman does not start, lots of errors in logs on both nodes, cluster components on node1 are not usable until node2 is fenced.

Expected results:
Cman starts cleanly.

Additional info:
Using Mellanox ConnectX2 IB cards with fw 2.9.1000.
Tried different infiniband configurations (even peer-to-peer), different IB mtus and netmtu in cluster.conf with no success. transport=udp works fine through ipoib.
Comment 1 Yevheniy Demchenko 2013-08-26 14:14:33 EDT
Created attachment 790644 [details]
/var/log/messages on node1
Comment 2 Yevheniy Demchenko 2013-08-26 14:14:59 EDT
Created attachment 790645 [details]
/var/log/messages on node2
Comment 3 Yevheniy Demchenko 2013-08-26 14:15:33 EDT
Created attachment 790646 [details]
opensm.conf
Comment 4 Yevheniy Demchenko 2013-08-26 14:16:04 EDT
Created attachment 790647 [details]
IB partitions.conf
Comment 6 Yevheniy Demchenko 2013-09-04 19:44:33 EDT
Created attachment 793906 [details]
Patch to  resolve the issue

It seems that buffers used for sending/receiving rdma data should have enough space for the GRH as well. Also, ibv_wc.byte_len is not set to payload len, but to payload len + sizeof (ibv_grh),so this value should be extracted from msg_len in further proceeding.
Proposed patch resolves the problems. Also, it adds test for user-defined mtu size and sets it to 2048 if it is set to larger size as it is hard-coded in totemiba.c to not have MTUs>2048.
Comment 7 Jan Friesse 2013-09-12 10:34:24 EDT
Yevheniy,
nice work, thanks for patch!

I have only one question. How that patch solves problem you were hitting? I mean, corosync was overwriting memory or what?
Comment 8 Yevheniy Demchenko 2013-09-12 14:53:28 EDT
(In reply to Jan Friesse from comment #7)
> Yevheniy,
> nice work, thanks for patch!
> 
> I have only one question. How that patch solves problem you were hitting? I
> mean, corosync was overwriting memory or what?

I think, in UD mode it just strips non-fitting last 40 bytes (size of grh). It won't overwrite memory as sge.length on receiver side is set to 2048 instead of 2048 + sizeof (grh).
Anyway, the patch i've posted has bug itself, please don't use it yet, i'll update it soon. Also, there is another serious issue in corosync vs rdma, i'll post more information later.
Comment 9 Yevheniy Demchenko 2013-09-14 07:57:11 EDT
Some more info about discovered problems.

1. In UD mode receivnig side of RDMA application should have enough space in buffer to hold data and GRH. Also, sge.length on the receiving size should be set to max_msg_size + sizeof (struct ibv_grh). Current corosync doesn't take grh in the account and does not work if mtu is set to the real mtu of IB port (it works if netmtu is set to < 2048-40).
2. ibv_wc.byte_len is the actual lentgh of the received packet, i.e. msg_len + GRH. GRH length should be substracted in further proceeding. If not, it might cause problems when messages get retransmitted, as their apparent size will constantly grow.
3. Current corosync will not work with rdma and mtus > 2048. Most modern IB HW supports 4096 mtu.
Patch 01totemiba_recv_buf_size.patch resolves these issues.
---

4. With first patch applied, corosync still freezes after several peer node connects/disconnects. The freeze happens in recv_token_cq_recv_event_fn in ibv_get_cq_event call. The problems is in fact, that after each peer node connect, recv_token_accept_destroy is called, which tries to call poll_dispatch_delete _after_ freeing of completion_channel. As completion_channel contains fd, handlers are not disconnected from poller properly. This leads to complete inconsistency in subsequent calls to handlers.
Patch 02totemiba_correct_poll_dispatch_del.patch resolves this issue.
---

5. Cosmetic change - parameters in handlers mcast_cq_send_event_fn, mcast_cq_recv_event_fn etc. are defined in weird way.
Patch 03totemiba_correct_handler_params.patch corrects that.
---

6. Even though cman allows redundant rings with rdma, corosync fails to start if they are defined. Resolved in 
Patch 04totemiba_rrp.patch  - this patch is not mine, pulled from this discussion: http://www.spinics.net/lists/corosync/msg02174.html
---

7. Patch 05totemiba_mtu_check.patch checks if configured mtu is allowed by IB HW. It also informs user if it is lower than maximum allowable. This patch is ugly and has to be improved. totemsrp.c subtracts sizeof (mcast) from net_mtu and this structure is not available in totemiba.c, so we just add 98 to totem_config->net_mtu and check against this value. This will stop working sooner or later.
---

8. It is stated in rdma programming user manual, that in case of MCAST we should set wr.opcode to IBV_WR_SEND_WITH_IMM instead of IBV_WR_SEND and set wr.imm_data to qp_num. It is not clear to me why to do that, but anyway
Patch 06totemiba_add_imm_data.patch does that.


Patches 01totemiba_recv_buf_size.patch and 02totemiba_correct_poll_dispatch_del.patch are crucial as corosync is unable to work without them at all if rdma is configured in cluster.conf.
04totemiba_rrp.patch is crucial for RRP mode, it is not possible to use RRP with rdma without this patch.
03,05 and 06 are not critical but recommended.

Feel free to change/combine these patches, they probably need some verifying.

How to test:
On node2:
#service cman start
On node1:
for k in {1..1000} ; do echo $((k++)) ; service cman start ; sleep 1 ; service cman stop ; done
Comment 10 Yevheniy Demchenko 2013-09-14 07:58:10 EDT
Created attachment 797602 [details]
01totemiba_recv_buf_size.patch
Comment 11 Yevheniy Demchenko 2013-09-14 07:58:35 EDT
Created attachment 797603 [details]
02totemiba_correct_poll_dispatch_del.patch
Comment 12 Yevheniy Demchenko 2013-09-14 07:58:58 EDT
Created attachment 797604 [details]
03totemiba_correct_handler_params.patch
Comment 13 Yevheniy Demchenko 2013-09-14 07:59:23 EDT
Created attachment 797605 [details]
04totemiba_rrp.patch
Comment 14 Yevheniy Demchenko 2013-09-14 07:59:48 EDT
Created attachment 797606 [details]
05totemiba_mtu_check.patch
Comment 15 Yevheniy Demchenko 2013-09-14 08:00:15 EDT
Created attachment 797607 [details]
06totemiba_add_imm_data.patch
Comment 16 Yevheniy Demchenko 2013-09-14 08:35:42 EDT
Created attachment 797619 [details]
05totemiba_mtu_check.patch
Comment 17 Yevheniy Demchenko 2013-09-14 08:39:22 EDT
Created attachment 797620 [details]
05totemiba_mtu_check.patch
Comment 18 Steven Dake 2013-09-16 19:03:42 EDT
First off, great work.  Happy to see someone else could understand the IBA implementation, as painful as it is :)  I think the IMM patch is wrong - the man pages said something about IMM should really never be used as some cards don't support it.  I could be wrong on this point, it was 4-5 years ago I implemented it.

Regards
-steve
Comment 19 Yevheniy Demchenko 2013-09-18 08:04:24 EDT
(In reply to Steven Dake from comment #18)
> First off, great work.  Happy to see someone else could understand the IBA
> implementation, as painful as it is :)  I think the IMM patch is wrong - the
> man pages said something about IMM should really never be used as some cards
> don't support it.  I could be wrong on this point, it was 4-5 years ago I
> implemented it.
> 
> Regards
> -steve

Could you please point me to some info regarding IMM? There is not much documentation about IMM and i could not find any advise against using it in man pages. I'm not insisting on this patch as it seems that those imm data are not used anywhere on the receiving side (maybe somewhere in lower levels?).
Comment 20 Jan Friesse 2013-09-20 05:10:15 EDT
Created attachment 800375 [details]
03totemiba_correct_handler_params.patch - take 2

Patch commited into flatiron with fixed spaces.
Comment 21 Jan Friesse 2013-09-20 05:11:04 EDT
Created attachment 800376 [details]
04totemiba_rrp.patch - take 2

Cherry pick of patch from master
Comment 22 Jan Friesse 2013-09-20 05:14:03 EDT
Created attachment 800377 [details]
05totemiba_mtu_check.patch - take 2

I've removed (temporarily) functionality of interface mtu > configured mtu. This is really nice feature and should be implemented also for IP. Also I've increased estimation of structures to 160 bytes (should give us enough space for use same patch in master).
Comment 23 Jan Friesse 2013-09-20 05:36:29 EDT
Yevheniy,
first, thanks very much for your hard work. As you can notice, I've commited patch 1 without changes to flatiron (1.4.x - RHEL 6) and master/needle (2.3.x) branch. Patch 2 is commited to flatiron without changes and to master with needed changes (coropoll -> qb_loop).

I wouldn't really call patch 3 as cosmetic change. Parameters was in wrong order so good catch there. I've only changed two spaces between parameter to one space and commited to master + flatiron.

Patch 4 was already in master branch so I've cherry picked from master to flatrion.

Patch 5 is definitively good patch and very nice idea. What needs to be done correctly is to determine MTU (mtu handling needs some work anyway) and it would be nice to have same functionality for IP. So I've removed (temporary) case where configured_mtu < maximum allowed mtu. My idea is to replace this with something like mtu: auto (default) which will use ethernet/IBA NIC mtu.
Other case maximum_allowed_mtu > configured_mtu is commited both into flatiron and master.

I didn't yet decided what to do with patch 6.

Thanks again for your great work!

Steve: Can you please provide informations Yevheniy was asking about?
Comment 24 Yevheniy Demchenko 2013-09-24 10:03:28 EDT
Jan,
thanks a lot for reviewing and applying my patches. But honestly, i still don't think, that patch 5 is any good yet (idea probably is, but implementation is not).
Also, increasing of redundant space to 160 bytes from 92 makes things even more confusing to users: one would expect clean corosync start if one sets net_mtu to the MTU of underlying device, but corosync wouldn't start as we expect configured mtu to be a bit lower.
If i understand the code correctly, MTU in corosync is handled in layers, and totem_config->net_mtu changes as layers get initialized. The proper function to deal with mtu is *_net_mtu_adjust(), and it seems that these functions are called always in the same order, from lower layers to upper. It is a nice idea to handle mtu in this way as every layer knows that underlying layers have already substracted size of their appended structures from totem_config->net_mtu. In this way we can be sure, that packet of apparent totem_config->net_mtu would be transported correctly by underlying layers.

In case of iba transport the order is following:
totemiba_net_mtu_adjust() - no changes in net_mtu
totemnet_net_mtu_adjust() - no changes in net_mtu
totemsrp_net_mtu_adjust() net_mtu -= sizeof (struct mcast)

The nicest way to check if configured MTU is reasonable is to do it in totemiba_net_mtu_adjust(), and i'll try to rewrite patch5 in a proper way. I'm not sure though if verbs structure is already available at this point - if not, it would be a bit more complicated to determine the real mtu of IB device.
Comment 25 Yevheniy Demchenko 2013-09-24 10:46:11 EDT
> In case of iba transport the order is following:
> totemiba_net_mtu_adjust() - no changes in net_mtu
> totemnet_net_mtu_adjust() - no changes in net_mtu
> totemsrp_net_mtu_adjust() net_mtu -= sizeof (struct mcast)

Replying to myself: i'm wrong, the order is different and makes no sence to me (yet): totemsrp_net_mtu_adjust is called from totempg_initialize before transport initializing.
Comment 26 Yevheniy Demchenko 2013-10-10 15:28:49 EDT
I think, to close this bug patches 1,2,3 and 4 should be applied leaving mtu alone. There are more inconsistences in MTU handling in another parts of code (i.e., totemnet_net_mtu_adjust() is called serveral times from totemrrp_initialize() causing unnecessary decrease of totem_config->net_mtu, which is probably a bug). Probably, it is appropriate to close this bug with proposed patches 1-4 and to open another one about mtu handling. Thanks.
Comment 27 Jan Friesse 2013-10-11 04:19:46 EDT
Yevheniy,
totemrrp_initialize calls mtu adjustment once per NIC so there should be no problem. But other then that, you are right. MTU handling is really done quite badly.

Next release should include same patches as I've committed into corosync.git, so patch 1, 2, 3, 4 and changed patch 5.

Thanks again for all your work,
  Honza
Comment 28 Yevheniy Demchenko 2013-10-11 13:22:28 EDT
(In reply to Jan Friesse from comment #27)
> Yevheniy,
> totemrrp_initialize calls mtu adjustment once per NIC so there should be no
> problem.

Actually, there is a problem. Size of structures added by transport layers (udp, udpu, etc) should be subtracted only once from totem_config->net_mtu independently of how many interfaces user has configured. The net_mtu variable is global. For instance, if udp transport is configured and user configured 4 NICs for rrp, UDP_HEADER_SIZE will be subtracted 4 times from net_mtu decreasing payload size for no reason, because actual network paket will always have only one udp header.

> But other then that, you are right. MTU handling is really done
> quite badly.

> 
> Next release should include same patches as I've committed into
> corosync.git, so patch 1, 2, 3, 4 and changed patch 5.

If we want to leave patch 5, the added size should be changed back to 92 bytes (sizeof(mcast) ). MTU check in patch 5 happens after totemsrp_net_mtu_adjust() but before other mtu adjustments, so we have to take into account only changes in totemsrp_net_mtu_adjust(). But i would rather drop this patch at all for now.

> 
> Thanks again for all your work,
>   Honza

There are still some issues in iba transport, but they are not blocking:
1. 4096 bytes MTU works well witch OFED 2.0, but has some issues with ofed provided in rhel6.4, so 2048 should remain recommended for iba transport.
2. IBA rrp works quite well in passive mode but not in active - some weird delays occur in cluster initialization, so passive mode should be recommended for rrp with infiniband transport.

Thanks,
Yevheniy
Comment 29 Jan Friesse 2013-10-14 10:02:08 EDT
Yevheniy,

(In reply to Yevheniy Demchenko from comment #28)
> (In reply to Jan Friesse from comment #27)
> > Yevheniy,
> > totemrrp_initialize calls mtu adjustment once per NIC so there should be no
> > problem.
> 
> Actually, there is a problem. Size of structures added by transport layers
> (udp, udpu, etc) should be subtracted only once from totem_config->net_mtu
> independently of how many interfaces user has configured. The net_mtu
> variable is global. For instance, if udp transport is configured and user
> configured 4 NICs for rrp, UDP_HEADER_SIZE will be subtracted 4 times from
> net_mtu decreasing payload size for no reason, because actual network paket
> will always have only one udp header.

Yes, you are right. These looks like a serious bug. MTU handling must be redone completely.
> 
> > But other then that, you are right. MTU handling is really done
> > quite badly.
> 
> > 
> > Next release should include same patches as I've committed into
> > corosync.git, so patch 1, 2, 3, 4 and changed patch 5.
> 
> If we want to leave patch 5, the added size should be changed back to 92
> bytes (sizeof(mcast) ). MTU check in patch 5 happens after
> totemsrp_net_mtu_adjust() but before other mtu adjustments, so we have to
> take into account only changes in totemsrp_net_mtu_adjust(). But i would
> rather drop this patch at all for now.
> 

It's already in upstream corosync.git, but of course can be reverted anytime.

> > 
> > Thanks again for all your work,
> >   Honza
> 
> There are still some issues in iba transport, but they are not blocking:
> 1. 4096 bytes MTU works well witch OFED 2.0, but has some issues with ofed
> provided in rhel6.4, so 2048 should remain recommended for iba transport.
> 2. IBA rrp works quite well in passive mode but not in active - some weird
> delays occur in cluster initialization, so passive mode should be
> recommended for rrp with infiniband transport.

Passive is default also for udpu. Active rrp weird behavior is possible to explain technically, but sadly it's very hard to fix. (in short, passive makes progress and send messages even with failed nic, active doesn't).


> 
> Thanks,
> Yevheniy

Thanks again for your work,
  Honza
Comment 32 Yevheniy Demchenko 2014-05-04 14:12:16 EDT
Created attachment 892304 [details]
Add multicast recovery to IB

Besides the problems above, current IB implementation in corosync is unable to survive SubnetManager handover or restart. If SM is migrated to another node, corosync logs "multicast error" and losses connectivity. There is an attempt to address the issue in 07totemiba_mcast_error.patch.
Comment 33 Jan Friesse 2014-05-14 08:57:19 EDT
(In reply to Yevheniy Demchenko from comment #32)
> Created attachment 892304 [details]
> Add multicast recovery to IB
> 
> Besides the problems above, current IB implementation in corosync is unable
> to survive SubnetManager handover or restart. If SM is migrated to another
> node, corosync logs "multicast error" and losses connectivity. There is an
> attempt to address the issue in 07totemiba_mcast_error.patch.

Yevheniy,
thanks for patch. It looks good. I've changed your patch a bit (to be possible to include into upstream, some chunks were not appliable) and commited as 0ea20a3d545c4bf1a02306dd2152eb497a03b7d0 (flatiron branch) and with a little more modifications as 4d6a18d8a5c0001f2eaeebb79d75f999c671cb74 for master branch.

Thanks,
  Honza
Comment 34 Jan Friesse 2014-05-14 08:59:01 EDT
Created attachment 895478 [details]
totemiba: Add multicast recovery - upstream commit

Totemiba wasn't able to survive SubnetManager handover or
restart. If SM was migrated to another node, corosync logged
"multicast error" and losses connectivity.

Commit should solve this situation.
Comment 35 Yevheniy Demchenko 2014-05-14 09:31:59 EDT
Thanks a lot for reviewing and commiting, but there is an obvious error in commited patch, parameters for printf are in wrong order, errno and MCAST_REJOIN_MSEC are swapped.

	res = rdma_join_multicast (instance->mcast_cma_id, &instance->mcast_addr, instance);
	if (res != 0) {
		log_printf (LOGSYS_LEVEL_DEBUG,
		    "rdma_join_multicast failed, errno=%d, rejoining in %u ms\n",
		    MCAST_REJOIN_MSEC,
		    errno);

Also, we've changed 
log_printf (LOGSYS_LEVEL_DEBUG, "Joined multicast!\n"); 
to 
log_printf (LOGSYS_LEVEL_ERROR, "Joined multicast!\n"); 
on our systems, log messages look more clear after that, i.e.:

May  6 14:09:05 ar02 corosync[1740]:   [TOTEM ] multicast error, trying to rejoin in 100ms
May  6 14:09:08 ar02 corosync[1740]:   [TOTEM ] Joined multicast!
Comment 36 Jan Friesse 2014-05-15 10:19:51 EDT
Created attachment 895956 [details]
totemiba: Fix incorrect failed log message

rdma_join_multicast failed ... message parameters was swapped.

Also information about multicast join is now logged as notice.
Comment 37 Jan Friesse 2014-05-15 11:30:53 EDT
(In reply to Yevheniy Demchenko from comment #35)
> Thanks a lot for reviewing and commiting, but there is an obvious error in
> commited patch, parameters for printf are in wrong order, errno and
> MCAST_REJOIN_MSEC are swapped.
> 
> 	res = rdma_join_multicast (instance->mcast_cma_id, &instance->mcast_addr,
> instance);
> 	if (res != 0) {
> 		log_printf (LOGSYS_LEVEL_DEBUG,
> 		    "rdma_join_multicast failed, errno=%d, rejoining in %u ms\n",
> 		    MCAST_REJOIN_MSEC,
> 		    errno);
> 
> Also, we've changed 
> log_printf (LOGSYS_LEVEL_DEBUG, "Joined multicast!\n"); 
> to 
> log_printf (LOGSYS_LEVEL_ERROR, "Joined multicast!\n"); 
> on our systems, log messages look more clear after that, i.e.:
> 
> May  6 14:09:05 ar02 corosync[1740]:   [TOTEM ] multicast error, trying to
> rejoin in 100ms
> May  6 14:09:08 ar02 corosync[1740]:   [TOTEM ] Joined multicast!

Yevheniy,
thanks for comment. I've fixed that problem + made "Joined multicast" message into NOTICE level (I believe error is just too much and it's not error actually).

I have question. Are you planning to add there some other patches? Because I would like to make final/almost final RHEL 6.6 build with patches included and if it would be possible I would be more then happy if you could test if that build solved problem(s) with IBA. Also if you would like to add some other patch later, we can just open new BZ for 6.7.

Thanks,
  Honza
Comment 39 Yevheniy Demchenko 2014-05-16 09:31:30 EDT
Honzo,
Currently corosync with all proposed patches seems to work fine with IB on our systems, i'd be happy to test corosync-1.4.1-19.el6, could you please provide a link to fetch its SRPM?
Comment 40 Jan Friesse 2014-05-19 04:30:51 EDT
Yevheniy,
current package (srpm and built rpms, which has all patches included) are at
http://honzaf.fedorapeople.org/corosync-1.4.1-19/. 

Thanks again,
  Honza
Comment 41 Yevheniy Demchenko 2014-05-20 13:54:24 EDT
New version was tested and result are promising so far.
Cluster doesn't crash at node connects/disconnects.
RRP works fine in passive mode.
IB stack survives SM handovers.
Unfortunately, MTU check is not correct, if one has IB cards with port mtus set to 2048 and defines netmtu=2048 in cluster.conf, cluster is unable to start with error "requested net_mtu is 2110 and is larger than the active port mtu 2048", which is confusing. If mtu patch is desired, the size of added structures should be set exactly to the sizeof(struct mcast), defined in totemsrp.c, which is 98 bytes.
Comment 42 Jan Friesse 2014-05-26 11:35:54 EDT
Yevheniy,
thanks for testing. As we already discussed, getting of sizeof(struct mcast) is hard. So I will probably revert 3403fe06e8a915632679c48ccccb2662521746f2 (upstream master) and 3403fe06e8a915632679c48ccccb2662521746f2 (upstream flatiron).

Just one question. What will happen if configured mtu > iba mtu? Does it fragment or just throw away packets?
Comment 43 Yevheniy Demchenko 2014-05-27 14:24:12 EDT
AFAIK, if configured mtu > MAX_MTU_SIZE, corosync would just segfault at memcpy(msg,ms,msg_len), as there is MAX_MTU_SIZE space in msg and msg_len might be larger. Even worse thing is, that corosync might work for a quite long time as messages are usually small and then just crash unpredictably while trying to transmit larger message.
IF port_mtu < configured_mtu <= MAX_MTU_SIZE, corosync will just start trying to retransmit messages, which are larger then port_mtu, because IB is unable to transmit a single message larger than active_port_mtu in UD mode. Again, corosync will work fine for some time and then will start to infinitely retransmit larger message once it appears.
I could do some real-life tests for you, but i'm almost sure, that corosync won't work in IB if it hits the message larger than the active port mtu.
Comment 44 Jan Friesse 2014-05-28 08:02:44 EDT
Yevheniy,

> AFAIK, if configured mtu > MAX_MTU_SIZE, corosync would just segfault at
> memcpy(msg,ms,msg_len), as there is MAX_MTU_SIZE space in msg and msg_len

Oh. Ok, this is definitively something to fix.

> might be larger. Even worse thing is, that corosync might work for a quite
> long time as messages are usually small and then just crash unpredictably
> while trying to transmit larger message.

Yes. Also this is very hard to debug, because if node is quiet and no message > netmtu is created, everything seems to work...

> IF port_mtu < configured_mtu <= MAX_MTU_SIZE, corosync will just start
> trying to retransmit messages, which are larger then port_mtu, because IB is
> unable to transmit a single message larger than active_port_mtu in UD mode.
> Again, corosync will work fine for some time and then will start to
> infinitely retransmit larger message once it appears.

This sounds really bad (and this is exactly kind of information I was missing... actually I was hoping for fragmentation (as UDP does)).

> I could do some real-life tests for you, but i'm almost sure, that corosync
> won't work in IB if it hits the message larger than the active port mtu.

I'm not sure that I understand you at this point. Corosync internally fragments (and assembles if needed) packets. So it should be able to send/receive packet of arbitrary length. Problem is, if configured_mtu > port_mtu.

Anyway. To have something into release. I will change constant in "instance->totem_config->net_mtu + 160" to 98 as you proposed. We cannot change structures of sent packets in flatiron anyway, so this will be just MAGIC constant (and hopefully makes iba work almost correcly).

I will open new BZ for 6.7 as a next step and I will try to solve that problem somehow systematically.

Thanks again,
  Honza
Comment 45 Jan Friesse 2014-05-28 08:21:50 EDT
Created attachment 899961 [details]
totemiba: Use size of struct mcast in MTU check
Comment 46 Jan Friesse 2014-05-28 08:26:42 EDT
Yevheniy,
next version bz is https://bugzilla.redhat.com/show_bug.cgi?id=1102057. We can continue discuss and fixing bugs there.
Comment 47 Yevheniy Demchenko 2014-05-28 08:34:29 EDT
(In reply to Jan Friesse from comment #44)

> > I could do some real-life tests for you, but i'm almost sure, that corosync
> > won't work in IB if it hits the message larger than the active port mtu.
> 
> I'm not sure that I understand you at this point. Corosync internally
> fragments (and assembles if needed) packets. So it should be able to
> send/receive packet of arbitrary length. Problem is, if configured_mtu >
> port_mtu.
 
Sorry, i didn't express myself clear, corosync does fragment _messages_ in upper layers so that _pakets_ won't exceed configured net mtu, problems will occur if the paket (not message) size is larger then the active port mtu.

> Anyway. To have something into release. I will change constant in
> "instance->totem_config->net_mtu + 160" to 98 as you proposed. We cannot
> change structures of sent packets in flatiron anyway, so this will be just
> MAGIC constant (and hopefully makes iba work almost correcly).
> 

I'm not a C guru, so i wonder, maybe it makes sense to determine precisely the size of (struct mcast) at configure time and pass it to totemiba.c as -Dconstant? 

> I will open new BZ for 6.7 as a next step and I will try to solve that
> problem somehow systematically.

We have some ideas about better mtu handling, but they require changes in other parts of corosync code.

> 
> Thanks again,
>   Honza
Comment 49 Jaroslav Kortus 2014-07-22 16:50:52 EDT
Yevheniy, is corosync running well with the patches?
Comment 50 Yevheniy Demchenko 2014-07-23 02:50:19 EDT
(In reply to Jaroslav Kortus from comment #49)
> Yevheniy, is corosync running well with the patches?

Jaroslave, currently we are not experiencing any problems with patched corosync and IB.
Comment 53 errata-xmlrpc 2014-10-14 03:11:29 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1508.html

Note You need to log in before you can comment on or make changes to this bug.