Bug 1572886 - add delay to `pcs cluster start --all` to avoid corosync JOIN flood
Summary: add delay to `pcs cluster start --all` to avoid corosync JOIN flood
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcs   
(Show other bugs)
Version: 7.6
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: ---
Assignee: Tomas Jelinek
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-28 12:14 UTC by Josef Zimek
Modified: 2018-10-30 08:07 UTC (History)
12 users (show)

Fixed In Version: pcs-0.9.165-3.el7
Doc Type: Bug Fix
Doc Text:
At cluster startup, `corosync` starts on each node with a small delay to reduce the risk of JOIN flood Starting `corosync` on all nodes at the same time may cause a JOIN flood, which may result in some nodes not joining the cluster. With this update, each node starts `corosync` with a small delay to reduce the risk of this happening.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-10-30 08:06:06 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
proposed fix (12.72 KB, patch)
2018-06-20 12:43 UTC, Tomas Jelinek
no flags Details | Diff
additional fix (1023 bytes, patch)
2018-08-03 13:13 UTC, Tomas Jelinek
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3066 None None None 2018-10-30 08:07 UTC
Red Hat Bugzilla 1572892 None CLOSED Corosync is prone to flooding network with JOIN messages 2019-04-16 12:57 UTC
Red Hat Bugzilla 1618775 None None None 2019-04-16 12:57 UTC
Red Hat Bugzilla 1622198 None None None 2019-04-16 12:57 UTC
Red Hat Knowledge Base (Solution) 3554381 None None None 2018-08-07 13:11 UTC

Internal Trackers: 1572892 1618775 1622198

Description Josef Zimek 2018-04-28 12:14:41 UTC
Description of problem:

Corosync is prone to flooding network with JOIN messages which may occasionally result in corosync to not join the cluster. More nodes are in cluster the chance of hitting this problem increases - especially if corosync starts at the same time on all nodes. Adding small delay to sequence of starting corosync on each node in case --all parameter is use will minimize conditions for this problem to emerge. The delay can vary based on total number of nodes in cluster.


For informational purposes this is how corosync fails to join the cluster uon above described scenario:

Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 66, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 67, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 68, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 69, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 70, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 71, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 72, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 73, aru 0
Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] Denied connection, corosync is not ready
Mar 20 09:24:51 [localhost] corosync[3455]: [QB    ] Denied connection, is not ready (3455-3459-27)
Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN  ] cs_ipcs_connection_destroyed()



Version-Release number of selected component (if applicable):
pcs-0.9.158-6.el7_4.1.x86_64
corosync-2.4.0-9.el7_4.2.x86_64

How reproducible:
Randomly

Steps to Reproduce:
Due to random nature of reproducer there are no clear steps. The fact is that issue was reported when high amount of nodes attempts to start cluster at the same time - `pcs cluster start --all` leads to this situation more often in big clusters. With adoption of ansible massive deployments this issue becomes more obvious.

Actual results:
Command `pcs cluster start -all` starts corosync on all cluster nodes at the same time which in some cases leads to corosync JOIN message to flood the net and ends up in nodes not joining cluster.

Expected results:
Command `pcs cluster start -all` starts corosync on all cluster nodes with certain delay to avoid situation of flooding with JOIN messages

Additional info:

Comment 9 Tomas Jelinek 2018-05-03 10:23:31 UTC
This is possible to do in pcs. However, we need a few points to be clarified first:

1) What is a sane delay? Should it apply for each node or for a group of nodes? Should it be applied like this:
start node1, wait, start node2, wait, start node3, wait...
or like this
start nodes 1, 2 and 3, wait, start nodes 4, 5 and 6, wait...

2) Currently, pcs starts corosync and then pacemaker on each node independently. Should this be coordinated - first start corosync on all nodes, then start pacemaker on all nodes? Why I am asking: consider a delay of 1 second for each node in a 16-node cluster. The first node starts corosync and pacemaker, the second node does the same 1 second later and so on. Once 9 nodes has started, quorum is acquired and pacemaker starts running resources. After that the rest of the nodes gets started and pacemaker may start moving resources to new nodes. This is something we definitely should avoid. I see two ways to do it: set the delay small enough or by start corosync and pacemaker in a coordinated manner.

Comment 10 Ken Gaillot 2018-05-03 16:50:54 UTC
It could make sense to start corosync everywhere first, then pacemaker. That would keep pacemaker from fencing slow-joining nodes as soon as it gains quorum (which will be a bigger problem if we're intentionally delaying some starts).

The main drawback would be when some nodes will end up not starting (e.g. powered off) or take a long time to start -- the whole cluster will be blocked until they time out.

BTW pacemaker *should* be able to handle simultaneous start-up, so it wouldn't need to be delayed if done in a separate step. You might get an election storm early on, but it should settle before too long.

It would be a bit complicated, but maybe start corosync in groups, with the delay between starting groups, but don't wait for one group's starts to complete before starting the next (wait the delay only). Wait to start pacemaker until all corosync starts have been initiated, but only wait for the local corosync start to complete (i.e. don't wait for all corosync starts to complete).

Comment 12 Tomas Jelinek 2018-06-15 07:55:34 UTC
Chrissie, can you share your point of view regarding this bz? Thanks!

Comment 14 Tomas Jelinek 2018-06-20 12:43 UTC
Created attachment 1453211 [details]
proposed fix

Thanks for the ideas.

> Wait to start pacemaker until all corosync starts have been initiated, but only wait for the local corosync start to complete (i.e. don't wait for all corosync starts to complete).

This does not work in cases when the local corosync is already running or is not going to be started at all. This may very well happen as 'pcs cluster start --all' uses the same code as 'pcs cluster start node1 node2...' and we want to avoid JOIN flood for both cases anyway.

I went with the following solution:
1. Start corosync on all nodes with 250ms delay between nodes: send a request to node1 for starting corosync, wait 250ms, send a request to node2, wait 250ms...
2. Each request waits for the systemd unit finish.
3. Once all requests finish, start pacemaker on all nodes in parallel.

This is simple enough and should fix the problem. If it does not, we can twiddle with the delay or put nodes into groups.


Note the pcs patch has no influence when nodes are started without the help of pcs, such as parallel nodes reboot. This would require a patch in corosync.

Comment 15 Ken Gaillot 2018-06-20 14:38:40 UTC
(In reply to Tomas Jelinek from comment #14)
> Created attachment 1453211 [details]
> proposed fix
> 
> Thanks for the ideas.
> 
> > Wait to start pacemaker until all corosync starts have been initiated, but only wait for the local corosync start to complete (i.e. don't wait for all corosync starts to complete).
> 
> This does not work in cases when the local corosync is already running or is
> not going to be started at all. This may very well happen as 'pcs cluster
> start --all' uses the same code as 'pcs cluster start node1 node2...' and we
> want to avoid JOIN flood for both cases anyway.
> 
> I went with the following solution:
> 1. Start corosync on all nodes with 250ms delay between nodes: send a
> request to node1 for starting corosync, wait 250ms, send a request to node2,
> wait 250ms...
> 2. Each request waits for the systemd unit finish.
> 3. Once all requests finish, start pacemaker on all nodes in parallel.

The problem I see with this is that some nodes may never come up, and this would prevent pacemaker from ever starting anywhere (or at least until some long timeout).

I'm not sure why --all vs only some nodes would need to be handled differently. I would simply wait until step 1 above is complete (whether successful or not) for all nodes that are going to be started, then wait until step 2 is complete locally before starting pacemaker locally.

Comment 16 Christine Caulfield 2018-06-21 07:03:27 UTC
Yes I didn't see this as a radical change inordering, just the adding of a delay for corosync startups.

There is no issue when corosync is already running as it doesn't send a JOIN message anyway

Comment 17 Tomas Jelinek 2018-06-21 08:14:24 UTC
Sorry, I don't follow.

In comment 10 and comment 13 it has been agreed to start corosync first on all nodes with a delay for each node and then start pacemaker on all nodes simultaneously. The reason for this is to prevent fencing of late starting nodes. Correct?

Now you are telling me corosync and pacemaker should be started on each node in one step with a delay for each node. Correct? If so, this contradicts the previous paragraph, doesn't it?

Or should corosync start be requested with a delay for each node and once all requests have been sent pcs should blindly request pacemaker to start everywhere without waiting for corosync requests to finish? How can we start pacemaker somewhere if we don't know if corosync has been started there?

Are you pointing out this issue cannot actually be fixed in pcs and must be fixed in corosync? This has been known from the start.

Also please stop working with "the local node". What makes you think "the local node" is more significant than the other nodes and thus it should be the one which determines the status / destiny of the cluster? Pcs (the instance which runs the 'pcs cluster start --all' command and communicates with the node instances) doesn't know which node is the local one anyway.

Or maybe I misunderstood what you meant by the local node. Maybe you thought it like this:
1. Send requests to start corosync with a delay for each node.
2. Don't wait for the requests to finish. Once they all have been sent, send requests to start pacemaker.
3. Each node which receives a request to start pacemaker will wait for corosync to start on it and then it will proceed and start pacemaker.

How is this different from the original code where each node got one request to start a cluster which resulted in starting corosync first followed by starting pacemaker? Except the delay (which could be added to the original code) and more complicated code they seem to be the same to me. How is this preventing fencing of late starting nodes?

Comment 18 Christine Caulfield 2018-06-21 09:26:22 UTC
Sorry if this isn't clear, I think we're making it sound more complex than it actually is.

All I think we need is exactly what we currently have (as I've seen it in pcs) but with a short delay between starting corosync on each node.

Comment 19 Ken Gaillot 2018-06-21 14:19:01 UTC
(In reply to Tomas Jelinek from comment #17)
> Or maybe I misunderstood what you meant by the local node. Maybe you thought
> it like this:
> 1. Send requests to start corosync with a delay for each node.
> 2. Don't wait for the requests to finish. Once they all have been sent, send
> requests to start pacemaker.
> 3. Each node which receives a request to start pacemaker will wait for
> corosync to start on it and then it will proceed and start pacemaker.

Yes, this :)
 
> How is this different from the original code where each node got one request
> to start a cluster which resulted in starting corosync first followed by
> starting pacemaker? Except the delay (which could be added to the original
> code) and more complicated code they seem to be the same to me. How is this
> preventing fencing of late starting nodes?

The delay prevents the corosync join flood.

The issue with fencing is that delaying the start on some nodes would actually make start-up fencing of those nodes more likely. With the current code, all the nodes (typically) come up within a small enough window that pacemaker doesn't schedule fencing. If we stagger them, then once pacemaker gains quorum, it will have more time to schedule fencing of the remaining nodes (though now that we're talking about just a 250ms delay, it probably isn't a big deal on small clusters, but might be more likely with 5+ nodes).

If we wait until corosync startup is initiated (not completed) everywhere, it doesn't guarantee we won't have startup fencing, but it keeps us from making the situation worse than it is now. We may actually need startup fencing, if some node doesn't come up, so we don't want to wait until corosync completes everywhere (though of course we want to wait until corosync is running on a particular node before starting pacemaker there).

Comment 20 Radek Steiger 2018-06-21 14:42:07 UTC
@Ken: The current Tomas' patch does (or at least is supposed to) deal with corosync flood problem while keeping it safe from pacemaker's startup fencing. Are you actually proposing we should simplify the solution by fixing the join flood but introducing the possibility of startup fencing?

Comment 21 Jan Pokorný [poki] 2018-06-21 15:16:01 UTC
Note that it's my understanding that distributed systems typically
solve the problem of excessive symmetry mutually preventing optimal
progress (solved dead-locking being an imperative, though that doesn't
prevent worst case progress) simply by increasing the chance of
breaking the identical timing of the steps to perform at particular
members.  Assuming the lack of any further knowledge, the easiest
solution is to bet on randomness -- randomized delay shall break this
symmetry to a sufficient extent.

Wouldn't this be applicable here?  It would make a ton of sense for
corosync to implement such a mechanism, tunable in corosync.conf,
possibly with auto-scaling per the number of specified nodes and
perhaps per some other dynamic circumstances (number of differently
sourced JOIN requests seen in the very initial scanning phase?)
Indeed, pcs could emulate such a behaviour, but doesn't seem to
be the ultimate level to solve the problem at.

IOW. we are again hitting the limits of how reasoning about the
distributed (and generally parallel) systems is hard, especially
with the "localhost" optics.

Comment 22 Jan Pokorný [poki] 2018-06-21 21:55:21 UTC
Moreover, is this UDPU-specific?  If so, then multicast could be the
remedy here.  Consequently, any kind of dealing with the described
situation shall only be limited to affected scenarios if that's the
case.

Comment 23 Christine Caulfield 2018-06-22 07:18:07 UTC
The problem with doing it inside corosync is that it doesn't know how many other nodes are being started at the same time - pcs does. There's no point in putting a random start delay into corosync if only one node is being started.

I don't deny that there are things that could be done in corosync (and has already been mentioned here) but this is meant to be a 'get stuff working' solution for the system we have.

It's been a while since I tested this but multicast is no better than UDPU, in fact i think it's worse. We need to check knet though.

Comment 25 Jan Pokorný [poki] 2018-06-22 16:15:03 UTC
re [comment 23]:

> The problem with doing it inside corosync is that it doesn't know
> how many other nodes are being started at the same time - pcs does.

That's a bit hypocritical statement regarding the use cases:
- parallel SSH (commands sent to buch of end points at once)
- comparable boot + service startup timings amongst multiple
  machines started at once
- ...
- Ansible triggering corosync/cluster start, again with
  comparable timing of corosync's startup

Actually if you claim that all connection methods (I am still
uncertain, some testing is perhaps needed) are affected, it would
make sense to inverse the behaviours, only preserving the current
one in the form of --fastjoin or a similar switch, for cases one
is sure no other nodes are coming up at the same time.

Comment 26 Jan Friesse 2018-07-02 11:26:02 UTC
Actually Corosync already has ability to wait for a "random" time before sending join message (coorsync.conf - send_join), but I don't think it improves situation much.

As Chrissie told, multicast is not better, because from receiving node point of view, there is no difference. Multicast is better only on sending side and switch side (apply only for "good" and correctly configured switches).

What really solves problem is Corosync 3 with Knet, because it is able to fragment join message depending on real MTU.

Also even UDPU should be better in Corosync 3, because join message is largely reduced not sending 2*IPV6 per member.

Comment 28 Jan Pokorný [poki] 2018-08-03 12:55:28 UTC
re [comment 26]:

> Actually Corosync already has ability to wait for a "random" time
> before sending join message (corosync.conf - send_join), but I don't
> think it improves situation much.

Thanks, missed that.  Perhaps the respective wording (for needle only,
per what you say) should be suggestive that:

- collective cluster start-up of more than a handful of nodes is better
  coordinated by a third party that would prevent massive initial
  communication parallelism

- when the above is not possible,

> For configurations with less than 32 nodes, this parameter is not
> necessary

  statement seems disproved with this very bug, hence should be
  worder more carefully

Comment 29 Tomas Jelinek 2018-08-03 13:13 UTC
Created attachment 1472983 [details]
additional fix

Comment 30 Jan Friesse 2018-08-06 11:43:15 UTC
(In reply to Jan Pokorný from comment #28)
> re [comment 26]:
> 
> > Actually Corosync already has ability to wait for a "random" time
> > before sending join message (corosync.conf - send_join), but I don't
> > think it improves situation much.
> 
> Thanks, missed that.  Perhaps the respective wording (for needle only,
> per what you say) should be suggestive that:
> 
> - collective cluster start-up of more than a handful of nodes is better
>   coordinated by a third party that would prevent massive initial
>   communication parallelism
> 
> - when the above is not possible,

Or just wait for a while to collect results of this BZ fix (= pcs change). If it helps = solved, if doesn't, we can keep speculating.

> 
> > For configurations with less than 32 nodes, this parameter is not
> > necessary
> 
>   statement seems disproved with this very bug, hence should be
>   worder more carefully

We could add something like "usually".

Comment 32 Jan Pokorný [poki] 2018-08-06 14:05:03 UTC
re [comment 30]:

Agreed the more evidence the better prior to doc tweaking,
and that "usually" could be enough to disrupt blind trust
with carefully listening administrators.
Thanks for considering that.

Comment 39 errata-xmlrpc 2018-10-30 08:06:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3066


Note You need to log in before you can comment on or make changes to this bug.