Red Hat Bugzilla – Bug 1572886
add delay to `pcs cluster start --all` to avoid corosync JOIN flood
Last modified: 2018-10-30 04:07:26 EDT
Description of problem: Corosync is prone to flooding network with JOIN messages which may occasionally result in corosync to not join the cluster. More nodes are in cluster the chance of hitting this problem increases - especially if corosync starts at the same time on all nodes. Adding small delay to sequence of starting corosync on each node in case --all parameter is use will minimize conditions for this problem to emerge. The delay can vary based on total number of nodes in cluster. For informational purposes this is how corosync fails to join the cluster uon above described scenario: Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 66, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 67, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 68, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 69, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 70, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 71, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 72, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 73, aru 0 Mar 20 09:24:51 [localhost] corosync[3455]: [TOTEM ] install seq 63 aru 63 high seq received 63 Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN ] Denied connection, corosync is not ready Mar 20 09:24:51 [localhost] corosync[3455]: [QB ] Denied connection, is not ready (3455-3459-27) Mar 20 09:24:51 [localhost] corosync[3455]: [MAIN ] cs_ipcs_connection_destroyed() Version-Release number of selected component (if applicable): pcs-0.9.158-6.el7_4.1.x86_64 corosync-2.4.0-9.el7_4.2.x86_64 How reproducible: Randomly Steps to Reproduce: Due to random nature of reproducer there are no clear steps. The fact is that issue was reported when high amount of nodes attempts to start cluster at the same time - `pcs cluster start --all` leads to this situation more often in big clusters. With adoption of ansible massive deployments this issue becomes more obvious. Actual results: Command `pcs cluster start -all` starts corosync on all cluster nodes at the same time which in some cases leads to corosync JOIN message to flood the net and ends up in nodes not joining cluster. Expected results: Command `pcs cluster start -all` starts corosync on all cluster nodes with certain delay to avoid situation of flooding with JOIN messages Additional info:
This is possible to do in pcs. However, we need a few points to be clarified first: 1) What is a sane delay? Should it apply for each node or for a group of nodes? Should it be applied like this: start node1, wait, start node2, wait, start node3, wait... or like this start nodes 1, 2 and 3, wait, start nodes 4, 5 and 6, wait... 2) Currently, pcs starts corosync and then pacemaker on each node independently. Should this be coordinated - first start corosync on all nodes, then start pacemaker on all nodes? Why I am asking: consider a delay of 1 second for each node in a 16-node cluster. The first node starts corosync and pacemaker, the second node does the same 1 second later and so on. Once 9 nodes has started, quorum is acquired and pacemaker starts running resources. After that the rest of the nodes gets started and pacemaker may start moving resources to new nodes. This is something we definitely should avoid. I see two ways to do it: set the delay small enough or by start corosync and pacemaker in a coordinated manner.
It could make sense to start corosync everywhere first, then pacemaker. That would keep pacemaker from fencing slow-joining nodes as soon as it gains quorum (which will be a bigger problem if we're intentionally delaying some starts). The main drawback would be when some nodes will end up not starting (e.g. powered off) or take a long time to start -- the whole cluster will be blocked until they time out. BTW pacemaker *should* be able to handle simultaneous start-up, so it wouldn't need to be delayed if done in a separate step. You might get an election storm early on, but it should settle before too long. It would be a bit complicated, but maybe start corosync in groups, with the delay between starting groups, but don't wait for one group's starts to complete before starting the next (wait the delay only). Wait to start pacemaker until all corosync starts have been initiated, but only wait for the local corosync start to complete (i.e. don't wait for all corosync starts to complete).
Chrissie, can you share your point of view regarding this bz? Thanks!
Created attachment 1453211 [details] proposed fix Thanks for the ideas. > Wait to start pacemaker until all corosync starts have been initiated, but only wait for the local corosync start to complete (i.e. don't wait for all corosync starts to complete). This does not work in cases when the local corosync is already running or is not going to be started at all. This may very well happen as 'pcs cluster start --all' uses the same code as 'pcs cluster start node1 node2...' and we want to avoid JOIN flood for both cases anyway. I went with the following solution: 1. Start corosync on all nodes with 250ms delay between nodes: send a request to node1 for starting corosync, wait 250ms, send a request to node2, wait 250ms... 2. Each request waits for the systemd unit finish. 3. Once all requests finish, start pacemaker on all nodes in parallel. This is simple enough and should fix the problem. If it does not, we can twiddle with the delay or put nodes into groups. Note the pcs patch has no influence when nodes are started without the help of pcs, such as parallel nodes reboot. This would require a patch in corosync.
(In reply to Tomas Jelinek from comment #14) > Created attachment 1453211 [details] > proposed fix > > Thanks for the ideas. > > > Wait to start pacemaker until all corosync starts have been initiated, but only wait for the local corosync start to complete (i.e. don't wait for all corosync starts to complete). > > This does not work in cases when the local corosync is already running or is > not going to be started at all. This may very well happen as 'pcs cluster > start --all' uses the same code as 'pcs cluster start node1 node2...' and we > want to avoid JOIN flood for both cases anyway. > > I went with the following solution: > 1. Start corosync on all nodes with 250ms delay between nodes: send a > request to node1 for starting corosync, wait 250ms, send a request to node2, > wait 250ms... > 2. Each request waits for the systemd unit finish. > 3. Once all requests finish, start pacemaker on all nodes in parallel. The problem I see with this is that some nodes may never come up, and this would prevent pacemaker from ever starting anywhere (or at least until some long timeout). I'm not sure why --all vs only some nodes would need to be handled differently. I would simply wait until step 1 above is complete (whether successful or not) for all nodes that are going to be started, then wait until step 2 is complete locally before starting pacemaker locally.
Yes I didn't see this as a radical change inordering, just the adding of a delay for corosync startups. There is no issue when corosync is already running as it doesn't send a JOIN message anyway
Sorry, I don't follow. In comment 10 and comment 13 it has been agreed to start corosync first on all nodes with a delay for each node and then start pacemaker on all nodes simultaneously. The reason for this is to prevent fencing of late starting nodes. Correct? Now you are telling me corosync and pacemaker should be started on each node in one step with a delay for each node. Correct? If so, this contradicts the previous paragraph, doesn't it? Or should corosync start be requested with a delay for each node and once all requests have been sent pcs should blindly request pacemaker to start everywhere without waiting for corosync requests to finish? How can we start pacemaker somewhere if we don't know if corosync has been started there? Are you pointing out this issue cannot actually be fixed in pcs and must be fixed in corosync? This has been known from the start. Also please stop working with "the local node". What makes you think "the local node" is more significant than the other nodes and thus it should be the one which determines the status / destiny of the cluster? Pcs (the instance which runs the 'pcs cluster start --all' command and communicates with the node instances) doesn't know which node is the local one anyway. Or maybe I misunderstood what you meant by the local node. Maybe you thought it like this: 1. Send requests to start corosync with a delay for each node. 2. Don't wait for the requests to finish. Once they all have been sent, send requests to start pacemaker. 3. Each node which receives a request to start pacemaker will wait for corosync to start on it and then it will proceed and start pacemaker. How is this different from the original code where each node got one request to start a cluster which resulted in starting corosync first followed by starting pacemaker? Except the delay (which could be added to the original code) and more complicated code they seem to be the same to me. How is this preventing fencing of late starting nodes?
Sorry if this isn't clear, I think we're making it sound more complex than it actually is. All I think we need is exactly what we currently have (as I've seen it in pcs) but with a short delay between starting corosync on each node.
(In reply to Tomas Jelinek from comment #17) > Or maybe I misunderstood what you meant by the local node. Maybe you thought > it like this: > 1. Send requests to start corosync with a delay for each node. > 2. Don't wait for the requests to finish. Once they all have been sent, send > requests to start pacemaker. > 3. Each node which receives a request to start pacemaker will wait for > corosync to start on it and then it will proceed and start pacemaker. Yes, this :) > How is this different from the original code where each node got one request > to start a cluster which resulted in starting corosync first followed by > starting pacemaker? Except the delay (which could be added to the original > code) and more complicated code they seem to be the same to me. How is this > preventing fencing of late starting nodes? The delay prevents the corosync join flood. The issue with fencing is that delaying the start on some nodes would actually make start-up fencing of those nodes more likely. With the current code, all the nodes (typically) come up within a small enough window that pacemaker doesn't schedule fencing. If we stagger them, then once pacemaker gains quorum, it will have more time to schedule fencing of the remaining nodes (though now that we're talking about just a 250ms delay, it probably isn't a big deal on small clusters, but might be more likely with 5+ nodes). If we wait until corosync startup is initiated (not completed) everywhere, it doesn't guarantee we won't have startup fencing, but it keeps us from making the situation worse than it is now. We may actually need startup fencing, if some node doesn't come up, so we don't want to wait until corosync completes everywhere (though of course we want to wait until corosync is running on a particular node before starting pacemaker there).
@Ken: The current Tomas' patch does (or at least is supposed to) deal with corosync flood problem while keeping it safe from pacemaker's startup fencing. Are you actually proposing we should simplify the solution by fixing the join flood but introducing the possibility of startup fencing?
Note that it's my understanding that distributed systems typically solve the problem of excessive symmetry mutually preventing optimal progress (solved dead-locking being an imperative, though that doesn't prevent worst case progress) simply by increasing the chance of breaking the identical timing of the steps to perform at particular members. Assuming the lack of any further knowledge, the easiest solution is to bet on randomness -- randomized delay shall break this symmetry to a sufficient extent. Wouldn't this be applicable here? It would make a ton of sense for corosync to implement such a mechanism, tunable in corosync.conf, possibly with auto-scaling per the number of specified nodes and perhaps per some other dynamic circumstances (number of differently sourced JOIN requests seen in the very initial scanning phase?) Indeed, pcs could emulate such a behaviour, but doesn't seem to be the ultimate level to solve the problem at. IOW. we are again hitting the limits of how reasoning about the distributed (and generally parallel) systems is hard, especially with the "localhost" optics.
Moreover, is this UDPU-specific? If so, then multicast could be the remedy here. Consequently, any kind of dealing with the described situation shall only be limited to affected scenarios if that's the case.
The problem with doing it inside corosync is that it doesn't know how many other nodes are being started at the same time - pcs does. There's no point in putting a random start delay into corosync if only one node is being started. I don't deny that there are things that could be done in corosync (and has already been mentioned here) but this is meant to be a 'get stuff working' solution for the system we have. It's been a while since I tested this but multicast is no better than UDPU, in fact i think it's worse. We need to check knet though.
re [comment 23]: > The problem with doing it inside corosync is that it doesn't know > how many other nodes are being started at the same time - pcs does. That's a bit hypocritical statement regarding the use cases: - parallel SSH (commands sent to buch of end points at once) - comparable boot + service startup timings amongst multiple machines started at once - ... - Ansible triggering corosync/cluster start, again with comparable timing of corosync's startup Actually if you claim that all connection methods (I am still uncertain, some testing is perhaps needed) are affected, it would make sense to inverse the behaviours, only preserving the current one in the form of --fastjoin or a similar switch, for cases one is sure no other nodes are coming up at the same time.
Actually Corosync already has ability to wait for a "random" time before sending join message (coorsync.conf - send_join), but I don't think it improves situation much. As Chrissie told, multicast is not better, because from receiving node point of view, there is no difference. Multicast is better only on sending side and switch side (apply only for "good" and correctly configured switches). What really solves problem is Corosync 3 with Knet, because it is able to fragment join message depending on real MTU. Also even UDPU should be better in Corosync 3, because join message is largely reduced not sending 2*IPV6 per member.
re [comment 26]: > Actually Corosync already has ability to wait for a "random" time > before sending join message (corosync.conf - send_join), but I don't > think it improves situation much. Thanks, missed that. Perhaps the respective wording (for needle only, per what you say) should be suggestive that: - collective cluster start-up of more than a handful of nodes is better coordinated by a third party that would prevent massive initial communication parallelism - when the above is not possible, > For configurations with less than 32 nodes, this parameter is not > necessary statement seems disproved with this very bug, hence should be worder more carefully
Created attachment 1472983 [details] additional fix
(In reply to Jan Pokorný from comment #28) > re [comment 26]: > > > Actually Corosync already has ability to wait for a "random" time > > before sending join message (corosync.conf - send_join), but I don't > > think it improves situation much. > > Thanks, missed that. Perhaps the respective wording (for needle only, > per what you say) should be suggestive that: > > - collective cluster start-up of more than a handful of nodes is better > coordinated by a third party that would prevent massive initial > communication parallelism > > - when the above is not possible, Or just wait for a while to collect results of this BZ fix (= pcs change). If it helps = solved, if doesn't, we can keep speculating. > > > For configurations with less than 32 nodes, this parameter is not > > necessary > > statement seems disproved with this very bug, hence should be > worder more carefully We could add something like "usually".
re [comment 30]: Agreed the more evidence the better prior to doc tweaking, and that "usually" could be enough to disrupt blind trust with carefully listening administrators. Thanks for considering that.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3066