Red Hat Bugzilla – Bug 976861
Possible parallel communication problem with 11+ nodes cluster
Last modified: 2017-06-01 04:51:38 EDT
Just noticed that send_batch_parallel function cannot currently cope with
11+ target nodes (provided with only send a single message during
the cluster communication round).
First 10 items of communication batch are send OK, but than the limit
of threads kicks in and the rest of items is silently ignored.
If my (and jrummy's) observation is correct, the algorithm needs to be
slightly extended to be robust enough (coping with at least 16 nodes
as supported is the very entry level fix here, universal one is better).
Ok, attachment 765707 [details] (of [bug 978479]) seems to prove [*] that no
end-point of the "multicast" is ever ignored regardless the threads limit.
The rest will simplt be proceeded in one of subsequent rounds until
the queue is empty.
Lowering the priority, but keeping this opened until final statement
[*] During that experiment, limit of threads was hardcoded as 3, however
the communication happened across 6 (later 8 nodes). What can be observed
that the communication was split into several subsequent
rounds of 3 communication end-points at a time.