Bug 1771236 - Dynamic management of the individual tickets while not interfering with the remaining state (was: unable to grant a newly created ticket) [RHEL 8]
Summary: Dynamic management of the individual tickets while not interfering with the r...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: booth
Version: 8.0
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 8.0
Assignee: Jan Friesse
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks: 1771748
TreeView+ depends on / blocked
 
Reported: 2019-11-12 02:29 UTC by Reid Wahl
Modified: 2023-08-10 15:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1771748 (view as bug list)
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4584261 0 None None None 2019-11-13 23:38:38 UTC

Description Reid Wahl 2019-11-12 02:29:38 UTC
Description of problem:

When a ticket is newly created, it is not immediately added to the cib, and booth cannot grant it.

In the test below, test RPMs that fix Bug 1768172 are installed. The issue is also reproducible without that fix in place. 

Test environment:
~~~
Cluster 1:
fastvm-rhel-8-0-23
fastvm-rhel-8-0-24

Cluster 2:
fastvm-rhel-8-0-33
fastvm-rhel-8-0-34

Arbitrator:
fastvm-rhel-8-0-52
~~~


Defined function to sync booth config:
~~~
booth_sync()
{
    SYNC="pcs booth sync"
    PULL="pcs booth pull"
    LHOST=fastvm-rhel-8-0-23
    $SYNC
    ssh fastvm-rhel-8-0-52 "$PULL $LHOST"
    ssh fastvm-rhel-8-0-33 "$PULL $LHOST && $SYNC"
}
~~~


Demonstration:
~~~
[root@fastvm-rhel-8-0-23 ~]# booth list
[root@fastvm-rhel-8-0-23 ~]# crm_ticket -l
[root@fastvm-rhel-8-0-23 ~]# pcs booth ticket add apacheticket
[root@fastvm-rhel-8-0-23 ~]# booth_sync
Sending booth configuration to cluster nodes...
fastvm-rhel-8-0-24: Booth config saved.
fastvm-rhel-8-0-23: Booth config saved.
Fetching booth config from node 'fastvm-rhel-8-0-23'...
Warning: Booth configuration file '/etc/booth/booth.conf' already exists
Warning: Booth key file '/etc/booth/booth.key' already exists
Booth config saved.
Fetching booth config from node 'fastvm-rhel-8-0-23'...
Warning: Booth configuration file '/etc/booth/booth.conf' already exists
Warning: Booth key file '/etc/booth/booth.key' already exists
Booth config saved.
Sending booth configuration to cluster nodes...
fastvm-rhel-8-0-34: Booth config saved.
fastvm-rhel-8-0-33: Booth config saved.

[root@fastvm-rhel-8-0-23 ~]# pcs constraint ticket add apacheticket apachegroup
[root@fastvm-rhel-8-0-23 ~]# pcs booth ticket grant apacheticket
Error: unable to grant booth ticket 'apacheticket' for site '192.168.22.71', reason: Nov 11 18:00:44 fastvm-rhel-8-0-23 booth: [26687]: error: ticket "apacheticket" does not exist

[root@fastvm-rhel-8-0-23 ~]# pcs cluster cib | grep ticket
      <rsc_ticket ticket="apacheticket" rsc="apachegroup" id="ticket-apacheticket-apachegroup"/>
[root@fastvm-rhel-8-0-23 ~]# crm_ticket -l
apacheticket	revoked          
[root@fastvm-rhel-8-0-23 ~]# pcs cluster stop --all && pcs cluster start --all
fastvm-rhel-8-0-24: Stopping Cluster (pacemaker)...
fastvm-rhel-8-0-23: Stopping Cluster (pacemaker)...
fastvm-rhel-8-0-24: Stopping Cluster (corosync)...
fastvm-rhel-8-0-23: Stopping Cluster (corosync)...
fastvm-rhel-8-0-24: Starting Cluster...
fastvm-rhel-8-0-23: Starting Cluster...

[root@fastvm-rhel-8-0-23 ~]# pcs cluster cib | grep ticket
      <rsc_ticket ticket="apacheticket" rsc="apachegroup" id="ticket-apacheticket-apachegroup"/>
    <tickets>
      <ticket_state id="apacheticket" granted="false" owner="0" expires="1573524113" term="0"/>
    </tickets>

[root@fastvm-rhel-8-0-23 ~]# pcs booth ticket grant apacheticket
~~~


Logs show the following for the successful grant:
~~~
Nov 11 18:09:51 fastvm-rhel-8-0-23 boothd-site[27403]: [info] apacheticket (Init/0/0): granting ticket
Nov 11 18:09:51 fastvm-rhel-8-0-23 boothd-site[27403]: [info] apacheticket (Init/0/0): starting new election (term=0)
Nov 11 18:09:51 fastvm-rhel-8-0-23 booth[30557]: [info] grant request sent, waiting for the result ...
Nov 11 18:09:57 fastvm-rhel-8-0-23 boothd-site[27403]: [info] apacheticket (Cndi/0/0): elections finished
Nov 11 18:09:57 fastvm-rhel-8-0-23 boothd-site[27403]: [info] apacheticket (Lead/0/599999): granted successfully here
Nov 11 18:09:57 fastvm-rhel-8-0-23 crm_ticket[30622]: notice: Invoked: crm_ticket -t apacheticket -g --force -S owner -v1950506022 -S expires -v1573525197 -S term -v0
Nov 11 18:09:57 fastvm-rhel-8-0-23 pacemaker-controld[27164]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Nov 11 18:09:57 fastvm-rhel-8-0-23 booth[30557]: [info] grant succeeded!
~~~


The cluster restart produces the following logs, which may be related to whatever triggered the write of the <ticket_state id="apacheticket"> element to the CIB.
~~~
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [info] BOOTH site 1.0 (build 1.0) daemon is starting
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [error] cannot change working directory to /var/lib/booth/cores
Nov 11 18:01:53 fastvm-rhel-8-0-23 crm_ticket[27410]: notice: Invoked: crm_ticket -g -t any-ticket-name
Nov 11 18:01:53 fastvm-rhel-8-0-23 crm_ticket[27410]: warning: Ticket modification not allowed
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [info] New "crm_ticket" found, using atomic ticket updates.
Nov 11 18:01:53 fastvm-rhel-8-0-23 crm_ticket[27428]: notice: Invoked: crm_ticket -t apacheticket -q
Nov 11 18:01:53 fastvm-rhel-8-0-23 crm_ticket[27428]: warning: Could not query ticket XML: No such device or address
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [error] apacheticket (Init/0/0): crm_ticket xml output empty
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [warning] apacheticket: no site matches; site got reconfigured?
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [error] command "crm_ticket -t 'apacheticket' -q" exit code 105
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [info] apacheticket (Init/0/0): broadcasting state query
Nov 11 18:01:53 fastvm-rhel-8-0-23 boothd-site[27403]: [info] BOOTH site daemon started, node id is 0x74425C26 (1950506022).
~~~

-----

Version-Release number of selected component (if applicable):

booth-site-1.0-5.f2d38ce.git.el8.noarch
booth-core-1.0-5.f2d38ce.git.el8.x86_64
pacemaker-2.0.1-4.el8_0.4.x86_64

-----

How reproducible:

Most or all of the time. I think the grant command has to be run fairly soon after the `pcs booth ticket add` and `pcs constraint ticket add` commands in order to observe the issue.

-----

Steps to Reproduce:

Outlined here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_high_availability_clusters/assembly_configuring-multisite-cluster-configuring-and-managing-high-availability-clusters#proc-configuring-multisite-with-booth-configuring-multisite-cluster

1. Start with a fairly clean pair of clusters. No tickets in /etc/booth/booth.conf, no <rsc_ticket> constraints, no <tickets> element.
2. Create a group of dummy resources called apachegroup.
3. Create a ticket in the booth configuration (`pcs booth ticket add apacheticket`).
4. Sync/pull the updated booth configuration to other cluster node(s), the arbitrator, and the other cluster.
5. Optional: Add a ticket constraint on each cluster (`pcs constraint ticket add apacheticket apachegroup`).
6. Attempt to grant the ticket (`pcs booth ticket grant apacheticket`).

-----

Actual results:

[root@fastvm-rhel-8-0-23 ~]# pcs booth ticket grant apacheticket
Error: unable to grant booth ticket 'apacheticket' for site '192.168.22.71', reason: Nov 11 18:00:44 fastvm-rhel-8-0-23 booth: [26687]: error: ticket "apacheticket" does not exist

-----

Expected results:

Successful ticket grant

-----

Additional info:

My booth-core and booth-site test RPMs with the fix for Bug 1768172 (test_atomicity failure) do not resolve this issue.

Comment 1 Reid Wahl 2019-11-12 21:43:28 UTC
I think the reason this worked when the documentation was being written, is that in the documentation the booth-site resource is created AFTER the ticket is added. So the booth service starts and broadcasts state or whatever it needs to do, after the config has been updated to include the ticket.

Comment 2 Reid Wahl 2019-11-14 01:16:03 UTC
Disabling and re-enabling the booth-site resource acts as a workaround but may disturb existing resources that are managed by ticket constraints, if any exist.

Comment 3 Reid Wahl 2019-11-14 01:35:16 UTC
There doesn't seem to be a straightforward booth command to refresh the configuration based on the config file and/or push tickets to the CIB. All `pcs booth ticket add` seems to do is update booth.conf, which doesn't get read until the booth daemon is restarted.

Comment 4 Jan Pokorný [poki] 2019-12-03 15:33:47 UTC
Thanks for the detailed report and first estimation where the
problem lies, which was a spot-on.

Likewise, first estimation of the solution appears as simple as
"add ability for booth to respond to new (abstract) signal indicating
a request to reload a configuration (when syntactically flawless)".

Alas, we are penetrating dangerous zone of distributed systems, and
hence that it cannot be as straightforward is apparent.

Many questions emerge, this fist pressing one:

> How to deal with uncertainty that other booth sites/abritrators
> do have the same knowledge at that very moment?

and second

> How to transition into the "consensus" that all such players will
> atomically transition from dealing with old configuration to
> dealing with new configuration (once the previous question is
> answered)?

and third

> How to deal with the actual old-new configuration delta, meaning
> how to formally specify responses to:
> - brand new ticket gets created (easy?)
> - existing ticket gets removed altogether (not so much?)
> - existing ticket gets its parameters changed
>   (possibly a nightmare-verging complexity given that timeouts
>   already ticking may just get adjusted)
> ?

Hopefully it is now clear that for time being, the easiest way may be
to implement a feature akin to "status" that will -- rather than query
static file-backed configuration at that time -- request living
configuration as being operated with right at the moment if the daemon
is currently running (file-based fallback otherwise?).
This is really a design limitation of booth that it is optimized for
static configuration use cases, something that was missed when
integration with pcs was added.

This piece of information might be then used to implement another
feature (booth-builtin, such as "config-diff" subcommand, or something
internal to pcs) that would made pcs warn the user in such "you want
to make a live change which is currently not supported".

Will give it more thoughts, but feedback welcome.

Comment 5 Reid Wahl 2019-12-03 17:55:21 UTC
(In reply to Jan Pokorný [poki] from comment #4)
> Many questions emerge, this fist pressing one:
> 
> > How to deal with uncertainty that other booth sites/abritrators
> > do have the same knowledge at that very moment?

Definitely a valid concern.

I suspect we're already prone to this uncertainty at least in terms of configurations, though this would require a manual mistake to be made. If the node running boothd on Cluster 1 gets rebooted/restarted while it has a different configuration compared to Cluster 2, that would cause boothd on Cluster 1 to re-read the config, and in turn I'd expect a conflicting state.

You may be referring more to ongoing state communication; even if so, I wonder to what extent the challenges in ensuring the same knowledge would grow with this change compared to whatever challenges are already present in maintaining identical state.

> and second
> 
> > How to transition into the "consensus" that all such players will
> > atomically transition from dealing with old configuration to
> > dealing with new configuration (once the previous question is
> > answered)?
> 

Hmm :)

> and third
> 
> > How to deal with the actual old-new configuration delta, meaning
> > how to formally specify responses to:
> > - brand new ticket gets created (easy?)
> > - existing ticket gets removed altogether (not so much?)
> > - existing ticket gets its parameters changed
> >   (possibly a nightmare-verging complexity given that timeouts
> >   already ticking may just get adjusted)
> > ?
> 

Would it make sense to create a new booth_config object ("booth_conf_new") and then point booth_conf to the new object if the config read succeeds? Or something similar with just a booth_conf->ticket{,_allocated,_count}_new. Trying to think of an approach that would prune removed tickets and ensure that the change gets applied in one step. Best thing I've come up with is send a signal to the existing boothd process to have it load the full config or the ticket config into a new object, and then replace the existing one and perhaps broadcast the new one somehow.

> Hopefully it is now clear that for time being, the easiest way may be
> to implement a feature akin to "status" that will -- rather than query
> static file-backed configuration at that time -- request living
> configuration as being operated with right at the moment if the daemon
> is currently running (file-based fallback otherwise?).
> This is really a design limitation of booth that it is optimized for
> static configuration use cases, something that was missed when
> integration with pcs was added.
> 
> This piece of information might be then used to implement another
> feature (booth-builtin, such as "config-diff" subcommand, or something
> internal to pcs) that would made pcs warn the user in such "you want
> to make a live change which is currently not supported".
> 
> Will give it more thoughts, but feedback welcome.

Comment 6 Jan Pokorný [poki] 2019-12-17 23:25:23 UTC
The following is to explain a possible workaround with some risks
involved.


0. let's have 3 running clusters N{1,...,3} of 2 nodes each (denoted
   Nx_{a,b}), also, suppose there are some pre-existing tickets T{1,...,x}
   that are actively utilized in the ticket constraints in all these
   clusters (otherwise, one wouldn't expose anything at risk regarding
   (multi)cluster resource management, apparently;
   booth is fully set up in a standard way, and the peer network is
   established;
   the intention is to create a new ticket in this established booth
   formation and have it actively maintained with these peers
   (i.e. make the ticket fully usable by the time this procedure
   finishes)


1. edit the static booth configuration at one node, say N1_a,
   then synchronize this static configuration amongst the nodes
   
   with pcs, there's currently a slightly cumbersome procedure akin to:

   * at N1_a:

     - pcs booth ticket add Tnew [options]

   * from one of N2_{a,b} and one of N3_{a,b}, do individually:

     - pcs booth pull N1_a

     - pcs booth sync [--skip-offline]


   * possible risks within a single cluster:

     - starting with a to-be-gague node that is effectively at older
       configuration version than some other nodes (within this cluster
       or at another site), making hence these other places eventually
       revert to that older version (for a lack of built-in versioning,
       which just requires us to assume that we can only consider
       adding new tickets and address-based isomorphism as valid
       differences not threatening the synchronization?)

       * testing reference: QA contact can detail whether this gets
         at tested at all with said pull/sync commands, or add tests
         to that effect and refer to them from this bug if not 
   
     - node hosting a current cluster representative --  or just
       the booth-site resource instance there -- fails for
       arbitrary reason once the new configuration has already
       been disseminated (at least to the node that is to come
       up in place of a failed oone)

       . generic: booth peers will suddenly be pushed out of assumed
                  agreement on which tickets to deal with, since the old
                  (just failed) incarnation of cluster representative
                  had another configuration than is curently being
                  used by a new incarnation of cluster representative
                  within booth formation (either this or other node)

         ^ this shall be fine if just a new ticket addition is concerned
           (we will hence assumed solely this type of a change in the
           ticket maintenance)

       . variant: either it haven't acted as a leader for any tickets
                  at all _OR_ the cluster representative of booth
                  manages to  get restarted before a ticket(s) it is
                  a leader of get(s) expired

         ^ this shall be fine on its own

       . variant: it has acted as a leader for some ticket(s) _AND_
                  some of them will get expired by the time
                  the new cluster representative takes over these
                  tickets silently

          ^ time is ticking, the replacement peer representative
            shall rather come up fast enought so as not to hit the
            the window in which a ticket "under leadership" expires,
            otherwise: possibly undesired move of resources may occur
            in response, cluster nodes that hosted dependent resources
            may even be fenced when configured like that (non-default),
            "pcs resource cleanup" manual intervention may be needed
            otherwise so as to allow dependent resources to ever be
            allowed again on that node, plus uninvestigated "booth
            qourum loss" scenario circumstances per "possible risks
            escalated to the whole booth formation"

   * possible risks escalated to the whole formation:

       - "booth quorum loss" due to accumulated "single cluster failures"
         (see above part)

         . unexplored variants: what if some new re-elections of the
           leaders occur within the "restart" window, can it destabilize
           anything (1 cluster representative out of, say 3, dead, since it
           is restarting right at the moment, other ticket this one was not
           a leader of is to be prolonged, can the coordinated confirmation
           detect this cluster representative became unresponsive and hence
           assume it dead for every and all tickets, and when it actually
           contributes to the overall finding "2/3 sites dead", it can
           likely dismantle whole booth formation altogether?)


2. pick a random cluster and select a node within that is the current
   booth site representative

   * from this point on, we will work on local to this node, unless
     stated otherwise

   * "optimizting the problem": we cannot expect any "helpful" asymmetry
     for the absolutely generic case ... indeed, it would be wise to
     start with clusters not currently in possession of any tickets,
     and only have one "senstive section", that is, updating configuration
     at the remaining one(s), but we can have the leadership of the tickets
     assigned uniformly, and cross-site resource motion may be undesired,
     so we cannot rely on that


3. figure out all the tickets the representative from step 2. is currently
   a leader of (rough approximation, use with some deliberation):

   * pcs booth status \
       | grep -E "$(ip -o a show temporary | tr -s ' ' | cut -d' ' -f4 \
                      | sed 's|\(.*\)/.*|-eticket:.*\\s\1|')" 
     (or pcs status --full as an initial source of data, since it will
     trigger `crm-ticket --details` down the line)

   * note: this command currently needs to be run right on the true booth
           per-cluster representative, not at an arbitrary node, because of
           imposed (and, usability wise, questionable) locality:
           https://github.com/ClusterLabs/pcs/issues/230

     let's suppose we discovered tickets Told1 and Told2

   * optional: we can easily list all the resources depending on those
     very tickets, out of curiosity:

     crm_ticket -c -t ms-ip-ticket

   * important: the initial ticket listing in this step gives us an
     important pointer regarding _when_ we can affort to intervene, it's
     rather a vital piece of information, e.g.:

>    ticket: ms-ip-ticket, leader: 10.37.164.52, expires: 2019-12-17 22:20:18

     - given it is 22:17:02 currently, we have about 3 minutes window
       in which we need to have our cluster-local booth restarted, incl.
       all the surrounding steps starting with 4.


4. assuming we have enough time till the nearest ticket expiration per 3.,
   we now need to unmanage the resource standing for a booth
   representative of the cluster, assuming pcs's default convention
   of booth-booth-service:

   * pcs resource unmanage booth-booth-service


5. subsequently, we need to restart this booth resource 

   * pcs resource debug-stop booth-booth-service \
       && pcs resource debug-start booth-booth-service

   * this sequence as well as the immediately surrounding (un)manage
     wrapping can (and shall) be optimized so as to keep the sensitive
     window minimal:

     https://github.com/ClusterLabs/pcs/issues/233


6. finally, we can manage that resource again

   * pcs resource manage booth-booth-service


7. if all went right, bravo, you may have more clusters/sites yet to
   deal with, though, pick one and continue from point 3.

8. check that the ticket sits well in the booth formation,
   now you can start bind some resources to its presence

* * *

Wanted to discuss some more hazards, but got exhausted enough at this
point -- the main risk here is an incorrect timing regarding any of
the tickets pertaining the cluster-local leadership at step 3.
(prior to starting 4.).

I will at least mention justification of why we don't just let the
cluster reprentative of booth to fail over to other node.
The answer is simple, any movement within the cluster increases
the chance of something going overboard, especially if more resources
are involved.  In the basic booth-site setup, there are two such
resources.  Respective IPaddr2 resource may work perfectly on a currently
associated node (we assume a progress from a healty, tested run-time
arrangement), but may be broken elsewhere, for instance, and this
configuration procedure shall rather not be combined with an unexpected
fire drill.  Consider also non-uniformly configured firewall, or for some
reason sticky stale ARP entries.  Or, who knows, perhaps the current node
is the only eligible to run either of these resources, since the others
are effectively banned for prior failures there (until cleanup is
manually performed).

Keeping as much of the run-time environment the same is hence preferable,
I think, and that's what sketched minimal-movement procedure is meant to
deliver.  Still, if it will be the current node we worked at that is to
fail, there's still a chance that normal processes in the cluster will
restore everything at another node just right (and in time prior to
ticket expiration in the best case), so there's still some hope-for
room that way.

Definitely, I want to study and experiment with these question marks:

- what exactly constitutes the point of communication when nothing
  interesting happens (assumption: some portion of time prior to ticket
  experiation, there's some liveness check around election of the new
  leader)  -- presumably the tolerated window of utter silence is not
  till the very end of the earliest ticket expiration(!)

- to what extent are the tickets independent, whether problems
  detected around one ticket is escalated to apply for all other
  tickets equally (presumably yes)

- whether it is equally possible to not run cluster-by-cluster,
  but instead to interleave steps 4., 5. and 6. across all the
  clusters (at one point, all cluster representatives will be
  unmanaged, then restarted, then remanaged);
  gut's feeling is that it would be more risky, since simply
  it wouldn't be a single wannabe joining otherwise quorate
  booth partition, but there would be a chaos of constituting
  full partition from ground up by each of these wannabes
  at that very moment, hence effectively dropping the ticket
  assingment altogher as a side-effect

Will report when I know more -- sorry, no formal documention
is available, as is a said state of affairs in our dept.
(nor a knowledge-shaping based testing at our QA sibling dept.)

* * *

Summary of directly referenced pcs "issues":
https://github.com/ClusterLabs/pcs/issues/230
https://github.com/ClusterLabs/pcs/issues/233

Summary of indirectly referenced pacemaekr bug/RFE reports:
https://bugs.clusterlabs.org/show_bug.cgi?id=5413

Comment 7 Jan Pokorný [poki] 2019-12-17 23:28:15 UTC
For point 1., I've forgotten to add another reference:

https://github.com/ClusterLabs/pcs/issues/226

Updating the "reference appendix":


Summary of directly referenced pcs "issues":
https://github.com/ClusterLabs/pcs/issues/226
https://github.com/ClusterLabs/pcs/issues/230
https://github.com/ClusterLabs/pcs/issues/233

Summary of indirectly referenced pacemaker bug/RFE reports:
https://bugs.clusterlabs.org/show_bug.cgi?id=5413

Comment 9 Jan Pokorný [poki] 2020-01-09 21:27:19 UTC
Note that steps 4-6 (inclusive) from [comment 6] would preferrably
(vitally if more reliability is asked/"non-intervention is holy"
adhered) be carried in the interval between:

- start of the new expiration window plus several (five? or more)
  seconds (so that leader had a chance to broadcast "heartbeat"
  and collect all the respective "ack" replies)

- half of the respective "expire" period, (which is 10 minutes
  by default, therefore resolved to 5 minutes after the new
  expiration window has started)

Apparently, it is easily doable with a single ticket, but problematic
with more of them (with presumably mutually random time offsets;
you may use one ticket as a gauge for the right timing, but accidentally
get into troubles with another).

Assuming the non-intervention (desire to observe as little movements
inter- and intra-cluster wise) is holy, that'd be still too risky
once more tickets already handled.

* * *

So, after more deliberation, I've come to conslusion that we can
have some extent of dynamism implemented in booth, and it shall
work quite well despite my original concerns, that I am to
rectify along below:

> (In reply to Jan Pokorný [poki] from comment #4)
>> Many questions emerge, this first pressing one:
>> 
>>> How to deal with uncertainty that other booth sites/abritrators
>>> do have the same knowledge at that very moment?
>
> Definitely a valid concern.
>
> I suspect we're already prone to this uncertainty at least in terms of
> configurations, though this would require a manual mistake to be made.
> If the node running boothd on Cluster 1 gets rebooted/restarted while it
> has a different configuration compared to Cluster 2, that would cause
> boothd on Cluster 1 to re-read the config, and in turn I'd expect
> a conflicting state.

True.  What I haven't realized is that it's the leader role that
dictates the rules ("expire" timeout, etc.) for given ticket
exclusively, so the system will sort of work regardless differing
per-ticket configurations as long as there is at least match on the
ticket names and addressability of the cluster representatives
(from which point of thought booth formation) works well
-- we'd make this very information immutable, only tickets
open to dynamic reconfiguration under some conditions.

> You may be referring more to ongoing state communication; even if so, I
> wonder to what extent the challenges in ensuring the same knowledge
> would grow with this change compared to whatever challenges are already
> present in maintaining identical state.

That challenge is primarily in "non-intervention is holy" case (see
above).  Note that having the pre-existing tickets ticking on their
own pace (reload instead of restart), we can trivially handle these
cases of new or unused of the tickets:

- ticket addition
  - it is being introduced anew, so in revoked state, so it has no
    immediate impact of resources in the existing configuration
    (if they depend on this new ticket, it's natural they will
    be stopped)

- ticket removal
  - as long as the ticket is not granted (or not waiting for the
    remaining acks for that to happen, perhaps), it is safe to
    remove such a ticket right away (non-existing ~ revoked),
    otherwise a monitoring can be installed to "garbage collect"
    the ticket whenever it will be revoked for a whole expire
    timeout (so "eventually")

- ticket modification
  - a pending modification can be installed that will trigger when
    the currently running per-cluster-representative will become
    (or repeat being) the leader (using a similar scheduled-at-event
    mechanism as with the garbage collection mentioned above)

Note that effect of such "pending to eventually trigger" would
need to be more elaborated regarding multiple config updates,
that is perhaps the least visible part now.

It would be a responsibility of some outer tool to have distributed
the configuration changes statically across all nodes of all clusters
involved, but as already induced, similar responsibility exists
initially so this is nothing new.

There would be "booth grant [-s SITE] [-c CONFIG] @reload-solo"
command that would trigger the reload at just one site.  Tnat
is enough for pcs to arrange for the whole procedure, i.e.,
disseminate the config files and trigger "reload-solo" at
each of the active cluster representative and all the arbitrators.

Note, there is no need for additional type of messages nor new
commands, which is really easy on compatibility (older booth
instances would such refuse that granting for a bad ticket
specification).  If desired, it could be moreover wrapped in
"booth reload-solo", but not a must.  Easiness of compatibility
also measn that there would not be random SIGHUP signals sent
that would otherwise make older versions terminate :-)

>> How to transition into the "consensus" that all such players will
>> atomically transition from dealing with old configuration to
>> dealing with new configuration (once the previous question is
>> answered)?

As mentioned, the overall approach can be started rather relaxed
initially, primary goal is not to affect concurrently ticking
tickets (along to what is being modified), but there could actually
be a nice and elaborate extension in the future, that would be
triggered with overloaded ticket names, for instance:

- booth daemons would implicitly broadcast an artificial
  "@reload-hash-<HASH-OF-THE-CONFIGURATION-FILE>" ticket,
  where HASH-OF-THE-CONFIGURATION-FILE would be updated with
  each expire time of such a magic ticket, so that instruction
  to grant this ticket (or likewise, wrapped into "booth reload"
  command) would in this case be propagated further so that
  the ticket would actually become granted as long as the
  <HASH-OF-THE-CONFIGURATION-FILE> is the same everywhere
  (amongst running cluster representatives and arbitrators alike,
  which is weaker than "everywhere" but still good enough)

- booth instances, upon observing this ticket (they themselves
  broadcasted) getting granted, would proceed with the same
  procedure that "@reload" would trigger, and when they were ready,
  they would broadcast an artificial
  "c@reloaded-hash-<HASH-OF-THE-CONFIGURATION-FILE>-reloaded"
  ticket for which "booth reload" would attempt to conclude
  "granted" under some timeout

- when this "granted" spotted, it would revoke both such
  artificial tickets, done

But that can come later, just trying to think of the overall
design, to have some consistency here.

This way, pcs would "just" disseminate the configuration files,
and trigger "reload" exactly once.

> and third
> 
>> How to deal with the actual old-new configuration delta, meaning
>> how to formally specify responses to:
>> - brand new ticket gets created (easy?)
>> - existing ticket gets removed altogether (not so much?)
>> - existing ticket gets its parameters changed
>>   (possibly a nightmare-verging complexity given that timeouts
>>   already ticking may just get adjusted)
>> ?

> Would it make sense to create a new booth_config object
> ("booth_conf_new") [...]

I think I was overthinking it, relaxed approach is a significant
improvement over "get the timing right" attempt to restart the
daemon in the best moment possible, which is oh so prone to
missing the vital events.

* * *

I am willing to start on prototyping this starting with the relaxed
"reload-solo" per above, but it will need to wait for other priorities
I am afraid.


Note You need to log in before you can comment on or make changes to this bug.