Description of problem: Assume that a ticket has a before-acquire-handler that checks some external condition to determine whether a site should be allowed to acquire a ticket. The handler succeeds at ticket grant time, so the ticket is granted (say, to site 1). Then later, when it's time to renew the ticket (after renewal-freq), the handler runs again and fails. The ticket is released, and site 2 can try to acquire it. When site 2 tries to acquire the released ticket, it does NOT run the before-acquire-handler. Thus, it may acquire a ticket when it is not safe to do so, because the check does not execute. Only after the renewal-freq duration passes at site 2 will the before-acquire-handler run. In the demonstration below, I (like the customer who reported this) use a geostore attribute to mark whether a site is allowed to acquire a ticket. DEMONSTRATION: Booth environment: Site 1 (192.168.22.71): fastvm-rhel-8-0-23 fastvm-rhel-8-0-24 Site 2 (192.168.22.81): fastvm-rhel-8-0-33 fastvm-rhel-8-0-34 Arbitrator (192.168.22.52): fastvm-rhel-8-0-52 # # ticket configuration [root@fastvm-rhel-8-0-23 ~]# cat /etc/booth/booth.conf ~~~ ... ticket = "apacheticket" expire = 120 renewal-freq = 60 retries = 4 timeout = 10 before-acquire-handler = /tmp/before-acquire-handler.sh ~~~ # # before-acquire-handler at site 1 # # site 2 is the same but uses booth_ip=192.168.22.81 # # Exit 0 if SAFE_TO_ACTIVATE geostore attribute is set to 1, else exit 1 [root@fastvm-rhel-8-0-23 ~]# cat /tmp/before-acquire-handler.sh ~~~ #!/bin/bash begin_date=$(date) h_name=$(hostname -s) ticket_name=apacheticket booth_ip=192.168.22.71 SAFE_TO_ACTIVATE=$(geostore get -t "$ticket_name" -s "$booth_ip" SAFE_TO_ACTIVATE) echo "$begin_date - $h_name: Floating Cluster IP=$booth_ip, Ticket Name=$ticket_name, ACTIVATE=$SAFE_TO_ACTIVATE" >> /tmp/before-acquire-handler.log [[ $SAFE_TO_ACTIVATE -eq 1 ]] && exit 0 || exit 1 ~~~ # # Initially, SAFE_TO_ACTIVATE=0 at both sites [root@fastvm-rhel-8-0-23 ~]# geostore get -t apacheticket -s 192.168.22.71 SAFE_TO_ACTIVATE 0 [root@fastvm-rhel-8-0-23 ~]# geostore get -t apacheticket -s 192.168.22.81 SAFE_TO_ACTIVATE 0 # # Grant fails as it should because SAFE_TO_ACTIVATE != 1 [root@fastvm-rhel-8-0-23 ~]# pcs booth ticket grant apacheticket 192.168.22.71 Error: unable to grant booth ticket 'apacheticket' for site '192.168.22.71', reason: Jan 10 20:05:57 fastvm-rhel-8-0-23 booth: [4856]: info: grant request sent, waiting for the result ... Jan 10 20:05:57 fastvm-rhel-8-0-23 booth: [4856]: error: before-acquire-handler for ticket "apacheticket" failed, grant denied # # Set SAFE_TO_ACTIVATE=1 at site 1 [root@fastvm-rhel-8-0-23 ~]# geostore set -t apacheticket -s 192.168.22.71 SAFE_TO_ACTIVATE 1 Jan 10 20:06:49 fastvm-rhel-8-0-23 geostore: [5278]: info: set succeeded! # # Grant succeeds as it should [root@fastvm-rhel-8-0-23 ~]# pcs booth ticket grant apacheticket 192.168.22.71 [root@fastvm-rhel-8-0-23 ~]# booth list ticket: apacheticket, leader: 192.168.22.71, expires: 2020-01-10 20:09:38 # # Set SAFE_TO_ACTIVATE=0 so that the before-acquire-handler fails upon ticket renewal [root@fastvm-rhel-8-0-23 ~]# geostore set -t apacheticket -s 192.168.22.71 SAFE_TO_ACTIVATE 0 Jan 10 20:07:56 fastvm-rhel-8-0-23 geostore: [5810]: info: set succeeded! # # Watch it fail at renewal [root@fastvm-rhel-8-0-23 ~]# tail -f /var/log/messages ... Jan 10 20:08:38 fastvm-rhel-8-0-23 boothd-site[31555]: [warning] apacheticket (Lead/19/59806): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 10 20:08:38 fastvm-rhel-8-0-23 boothd-site[31555]: [warning] apacheticket (Lead/19/59806): we are not allowed to acquire ticket Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6181]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6181]: notice: Invoked: crm_ticket -t apacheticket -S owner -v 0 Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6200]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6200]: notice: Invoked: crm_ticket -t apacheticket -S expires -v 1578715718 Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: Transition 303 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-351.bz2): Complete Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-schedulerd[11241]: notice: Calculated transition 303, saving inputs in /var/lib/pacemaker/pengine/pe-input-351.bz2 Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6210]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6210]: notice: Invoked: crm_ticket -t apacheticket -S term -v 19 Jan 10 20:08:38 fastvm-rhel-8-0-23 boothd-site[31555]: [info] setting crm_ticket attributes successful Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-schedulerd[11241]: notice: Calculated transition 304, saving inputs in /var/lib/pacemaker/pengine/pe-input-352.bz2 Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6217]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log Jan 10 20:08:38 fastvm-rhel-8-0-23 crm_ticket[6217]: notice: Invoked: crm_ticket -t apacheticket -r --force Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: Transition 304 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-352.bz2): Complete Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-controld[11242]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 10 20:08:38 fastvm-rhel-8-0-23 pacemaker-schedulerd[11241]: notice: * Stop dummy1 ( fastvm-rhel-8-0-23 ) due to node availability # # Site 2 acquires the ticket *without* running before-acquire-handler Jan 10 20:08:37 fastvm-rhel-8-0-33 boothd-site[20210]: [info] apacheticket (Fllw/19/58738): 192.168.22.71 wants to give the ticket away (ticket release) Jan 10 20:08:37 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful Jan 10 20:08:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] apacheticket (Fllw/20/0): starting new election (term=20) Jan 10 20:08:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] apacheticket (Lead/20/119999): granted successfully here Jan 10 20:08:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful Jan 10 20:08:38 fastvm-rhel-8-0-33 pacemaker-controld[1365]: notice: Result of start operation for dummy1 on fastvm-rhel-8-0-33: 0 (ok) # # Site 2 fails before-acquire-handler upon ticket renewal Jan 10 20:09:38 fastvm-rhel-8-0-33 boothd-site[20210]: [warning] apacheticket (Lead/20/59808): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 10 20:09:38 fastvm-rhel-8-0-33 boothd-site[20210]: [warning] apacheticket (Lead/20/59807): we are not allowed to acquire ticket Jan 10 20:09:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful Jan 10 20:09:38 fastvm-rhel-8-0-33 pacemaker-controld[1365]: notice: Result of stop operation for dummy1 on fastvm-rhel-8-0-33: 0 (ok) Jan 10 20:09:39 fastvm-rhel-8-0-33 boothd-site[20210]: [info] setting crm_ticket attributes successful ----- Version-Release number of selected component (if applicable): booth-core-1.0-5.f2d38ce.git.el8.x86_64 booth-site-1.0-5.f2d38ce.git.el8.noarch pacemaker-2.0.2-3.el8_1.2.x86_64 ----- How reproducible: Always ----- Steps to Reproduce: 1. Configure a booth ticket with a before-acquire-handler. Ensure that the handler is set up to succeed at site 1 and to fail at site 2. 2. Grant the ticket to site 1. 3. Cause the before-acquire-handler to fail at ticket renewal time at site 1. 4. Observe site 1 release the ticket and site 2 try to acquire it. ----- Actual results: Site 2 does not run the before-acquire-handler and successfully acquires the released ticket. ----- Expected results: Site 2 runs the before-acquire-handler and is not allowed to acquire the ticket. ----- Additional info: I noticed this closed issue is from the customer who also reported it via a Red Hat support case: https://github.com/ClusterLabs/booth/issues/79.
Thanks for the report, this behaviour is counter-intuitive, indeed. Lets separate two mutually-amplifying phenomena here: 1. back then, the handlers (executable predicates) were executed synchronously, which appears more intuitive than what we have now - https://github.com/ClusterLabs/booth/commit/7c30fb4c8580f9784d75bc4dc27223c63491d3df - https://github.com/ClusterLabs/booth/commit/5575f95d032116c93d13eb84d71f80604ad6d90d 2. for the scenario at hand (leader refrains from it's role for particular ticker, here because a violation with "shall it be held like that, to begin with" predicate) * * * ad 1.: It is likely well-meant and based on the assumption that the detection of precondition failure will never be timely (subject to polling in renewal-freq periods), so the system must exercise some tolerance to non-swiftness already, hence blocking (possibly forever) till the verdict is known is not all that justifiable when it can be acted upon later on, while the state keeps on unfolding without blocking. * * * ad 2.: I tried to crank up an easy patch that might possibly impose immediate on-failover rechecking via configured executable predicate [1]. Scratch build that will disappear in a while: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25742150 Would be grateful if you can give it a test drive, Ried. While there still may (and likely will) be an intermittent flip for the ticket location, it shall settle to the actually expected state (ticket without leader, i.e., not granted anywhere) in the order of lower seconds as long as the script finishes "immediately" (and it shall in this case). * * * [1] this patch: commit 1183161e359dc5aed4dee2b7ee16cdae4d532d38 Author: Jan Pokorný <jpokorny> Date: Mon Jan 13 20:59:04 2020 +0100 problem minimizing: ticket failover re-evaluated "ASAP" Related: https://bugzilla.redhat.com/show_bug.cgi?id=1790009 Signed-off-by: Jan Pokorný <jpokorny> diff --git a/src/raft.c b/src/raft.c index 462fc3b..7ae043f 100644 --- a/src/raft.c +++ b/src/raft.c @@ -146,8 +146,14 @@ static void won_elections(struct ticket_config *tk) save_committed_tkt(tk); +#if 1 + /* actually, only schedule election if before-acquire-handler fails, + the point of this is to cause wakeup and re-evaluation */ + schedule_election(tk, OR_LOCAL_FAIL); +#else ticket_broadcast(tk, OP_HEARTBEAT, OP_ACK, RLT_SUCCESS, 0); tk->ticket_updated = 0; +#endif }
> it shall settle to the actually expected state (ticket without leader, > i.e., not granted anywhere) in the order of lower seconds as long as > the script finishes "immediately" (and it shall in this case). as opposed to after "renewal-freq" period (here, 60 seconds) as can be observed before the change, from [comment 0]: > Jan 10 20:08:38 fastvm-rhel-8-0-33 boothd-site[20210]: [info] > apacheticket (Lead/20/119999): granted successfully here > [...] > Jan 10 20:09:38 fastvm-rhel-8-0-33 boothd-site[20210]: [warning] > apacheticket (Lead/20/59808): handler "/tmp/before-acquire-handler.sh" > failed: exit code 1
> Scratch build that will disappear in a while: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25742150 also, conveniently: http://brew-task-repos.usersys.redhat.com/repos/scratch/jpokorny/booth/1.0/6.poki.f2d38ce.git.el8_1/booth-1.0-6.poki.f2d38ce.git.el8_1-scratch.repo
(In reply to Jan Pokorný [poki] from comment #1) > ad 2.: > > I tried to crank up an easy patch that might possibly impose immediate > on-failover rechecking via configured executable predicate [1]. > > Scratch build that will disappear in a while: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25742150 > > Would be grateful if you can give it a test drive, Ried. > > While there still may (and likely will) be an intermittent flip for > the ticket location, it shall settle to the actually expected state > (ticket without leader, i.e., not granted anywhere) in the order > of lower seconds as long as the script finishes "immediately" > (and it shall in this case). The test run went well. As you can see below, site 2 acquired the ticket for approximately one second, after which time the handler failed. Jan 13 13:18:54 fastvm-rhel-8-0-33 boothd-site[12872]: [info] apacheticket (Fllw/22/58742): 192.168.22.71 wants to give the ticket away (ticket release) Jan 13 13:18:54 fastvm-rhel-8-0-33 boothd-site[12872]: [info] setting crm_ticket attributes successful Jan 13 13:18:55 fastvm-rhel-8-0-33 boothd-site[12872]: [info] apacheticket (Fllw/23/0): starting new election (term=23) Jan 13 13:18:55 fastvm-rhel-8-0-33 boothd-site[12872]: [info] apacheticket (Lead/23/119999): granted successfully here Jan 13 13:18:56 fastvm-rhel-8-0-33 boothd-site[12872]: [warning] apacheticket (Lead/23/119363): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 13 13:18:56 fastvm-rhel-8-0-33 boothd-site[12872]: [warning] apacheticket (Lead/23/119362): we are not allowed to acquire ticket Jan 13 13:18:56 fastvm-rhel-8-0-33 boothd-site[12872]: [info] setting crm_ticket attributes successful This is certainly an improvement, as it stops site 2 in its tracks almost immediately. I can have the customer test the RPMs for his use case, if you would like. For his use case, and especially for the general case, I'm not sure this will suffice. (Perhaps you only mean this patch as a testing/mitigation step anyway.) One possibility that comes to mind is that a cluster resource may fail a stop operation after probe, triggering fencing, because its underlying physical or software resource is missing (e.g., a physical LUN is not presented or an app script's symlink is removed until the site is activated). One could reasonably argue that a resource prone to such behavior should be designed better, but we're all flawed humans. For this and perhaps other situations that we may not be able to anticipate, it seems we would still want the handler to run prior to ticket acquisition.
> For his use case, and especially for the general case, I'm not sure > this will suffice. Oh, I totally see the practical weakness of the guarantee, now arising primarily from what I've mentioned as 1. above. First instinct was to close the gap 2., and it was rather successful per your confirmation, thanks. > (Perhaps you only mean this patch as a testing/mitigation step > anyway.) Yep. I was looking around some more, could you please try configuring "require-after = 5" (or so, rule of thumb perhaps being "expected time to execute the script/s + 5 s") for the ticket at hand, please? There's some good hope it might actually be the thing that would postpone (true, antithesis to timely place-ticket-anywhere resolution, but OTOH making it safe, at least in theory, regarding these asynchronous can-move-there predicate evaluations, as we know, HA solutions are based on a fitting tweaks to trade-offs exactly like this). It'd be an utter win if we could make do just with that easy patch and extra discretion at the configuration side.
Oops. "acquire-after", indeed :)
(In reply to Jan Pokorný [poki] from comment #5) > I was looking around some more, could you please try configuring > "require-after = 5" (or so, rule of thumb perhaps being "expected > time to execute the script/s + 5 s") for the ticket at hand, please? No luck with acquire-after = 5 :( Jan 14 09:41:41 fastvm-rhel-8-0-33 boothd-site[15814]: [info] apacheticket (Fllw/23/58734): 192.168.22.71 wants to give the ticket away (ticket release) Jan 14 09:41:41 fastvm-rhel-8-0-33 boothd-site[15814]: [info] setting crm_ticket attributes successful Jan 14 09:41:41 fastvm-rhel-8-0-33 boothd-site[15814]: [info] apacheticket (Fllw/24/0): starting new election (term=24) Jan 14 09:41:41 fastvm-rhel-8-0-33 boothd-site[15814]: [info] apacheticket (Lead/24/119999): granted successfully here Jan 14 09:41:42 fastvm-rhel-8-0-33 boothd-site[15814]: [warning] apacheticket (Lead/24/119293): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 14 09:41:42 fastvm-rhel-8-0-33 boothd-site[15814]: [warning] apacheticket (Lead/24/119293): we are not allowed to acquire ticket Jan 14 09:41:42 fastvm-rhel-8-0-33 boothd-site[15814]: [info] setting crm_ticket attributes successful Jan 14 09:41:52 fastvm-rhel-8-0-33 boothd-site[15814]: [warning] apacheticket (Init/24/0): no answers to our VtFr request to step down (try #1), we are alone Then I bumped it up to 30, with no change. For this run, I also removed `timeout = 10` from my booth.conf in case that was interfering. Jan 14 09:45:26 fastvm-rhel-8-0-33 boothd-site[17675]: [info] apacheticket (Fllw/24/58722): 192.168.22.71 wants to give the ticket away (ticket release) Jan 14 09:45:26 fastvm-rhel-8-0-33 boothd-site[17675]: [info] setting crm_ticket attributes successful Jan 14 09:45:26 fastvm-rhel-8-0-33 boothd-site[17675]: [info] apacheticket (Fllw/25/0): starting new election (term=25) Jan 14 09:45:26 fastvm-rhel-8-0-33 boothd-site[17675]: [info] apacheticket (Lead/25/119999): granted successfully here Jan 14 09:45:27 fastvm-rhel-8-0-33 boothd-site[17675]: [warning] apacheticket (Lead/25/118871): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 14 09:45:27 fastvm-rhel-8-0-33 boothd-site[17675]: [warning] apacheticket (Lead/25/118871): we are not allowed to acquire ticket Jan 14 09:45:27 fastvm-rhel-8-0-33 boothd-site[17675]: [info] setting crm_ticket attributes successful Jan 14 09:45:32 fastvm-rhel-8-0-33 boothd-site[17675]: [warning] apacheticket (Init/25/0): no answers to our VtFr request to step down (try #1), we are alone It was worth a shot, and I'm willing to test other parameters if you identify anything you think would help. Maybe implementing something like this client option within the source for this auto-acquisition? Or could be totally irrelevant :) -C wait for ticket commit to CIB: The client waits for the ticket commit to CIB (only for grant requests). If one or more sites are unreachable, this takes the ticket expire time (plus, if defined, the acquire-after time).
Ried, no problem, thanks for trying. One more question though, since this is something not exposed in the logs directly: was it just declaratively asserting ticket for a bit, or was that information reflected on the pacemaker side as well? Asking since it wouldn't matter what boothd logged, then. Just the tangible effect is what counts.
(In reply to Jan Pokorný [poki] from comment #8) > Ried, no problem, thanks for trying. > > One more question though, since this is something > not exposed in the logs directly: > > was it just declaratively asserting ticket for a bit, > or was that information reflected on the pacemaker > side as well? > > Asking since it wouldn't matter what boothd logged, then. > Just the tangible effect is what counts. This is the entirety of what Pacemaker logged on fastvm-rhel-8-0-33 at site 2 (the node that acquired the ticket in booth): Time wasn't synced with chrony on site 2 so I ran another test. Here are the logs: Site 1: /var/log/pacemaker/pacemaker.log: Jan 14 12:12:45 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Init/25/0): granting ticket Jan 14 12:12:45 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Init/25/0): starting new election (term=25) Jan 14 12:12:45 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Lead/25/119999): granted successfully here Jan 14 12:12:45 fastvm-rhel-8-0-23 crm_ticket [31681] (crm_log_args) notice: Invoked: crm_ticket -t apacheticket -S owner -v 1950506022 ... Jan 14 12:12:46 fastvm-rhel-8-0-23 crm_ticket [31693] (crm_log_args) notice: Invoked: crm_ticket -t apacheticket -S SAFE_TO_ACTIVATE -v 0 ... Jan 14 12:13:46 fastvm-rhel-8-0-23 booth: [31463]: warning: apacheticket (Lead/25/59826): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 14 12:13:46 fastvm-rhel-8-0-23 booth: [31463]: warning: apacheticket (Lead/25/59826): we are not allowed to acquire ticket Jan 14 12:13:46 fastvm-rhel-8-0-23 crm_ticket [32190] (crm_log_args) notice: Invoked: crm_ticket -t apacheticket -S owner -v 0 ... Jan 14 12:13:46 fastvm-rhel-8-0-23 booth: [31463]: info: setting crm_ticket attributes successful ... Jan 14 12:13:46 fastvm-rhel-8-0-23 crm_ticket [32193] (crm_log_args) notice: Invoked: crm_ticket -t apacheticket -r --force ... Jan 14 12:13:46 fastvm-rhel-8-0-23 crm_ticket [32193] (modify_ticket_state) debug: Update <ticket_state id="apacheticket" SAFE_TO_ACTIVATE="0" owner="0" expires="1579032826" term="25" granted="false" last-granted="1579032765"/> ... Jan 14 12:13:47 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Fllw/26/0): 192.168.22.81 votes for none, ignoring (duplicate ticket release?) Jan 14 12:13:52 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Fllw/26/0): 192.168.22.81 votes for none, ignoring (duplicate ticket release?) ... Jan 14 12:13:56 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Fllw/26/0): starting new election (term=26) Jan 14 12:13:56 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Lead/26/119999): granted successfully here Jan 14 12:13:56 fastvm-rhel-8-0-23 booth: [31463]: warning: apacheticket (Lead/26/119567): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 14 12:13:56 fastvm-rhel-8-0-23 booth: [31463]: warning: apacheticket (Lead/26/119567): we are not allowed to acquire ticket Jan 14 12:13:56 fastvm-rhel-8-0-23 crm_ticket [32268] (crm_log_args) notice: Invoked: crm_ticket -t apacheticket -S owner -v 0 ... Jan 14 12:13:56 fastvm-rhel-8-0-23 booth: [31463]: info: setting crm_ticket attributes successful ... Jan 14 12:13:57 fastvm-rhel-8-0-23 booth: [31463]: info: apacheticket (Init/26/0): candidate status lost, ignoring VtFr from 192.168.22.81 Site 2: /var/log/pacemaker/pacemaker.log: Jan 14 12:12:45 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_perform_op) info: + /cib/status/tickets/ticket_state[@id='apacheticket']: @owner=1950506022 Jan 14 12:12:45 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_perform_op) info: + /cib/status/tickets/ticket_state[@id='apacheticket']: @expires=1579032885 Jan 14 12:13:46 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_perform_op) info: + /cib/status/tickets/ticket_state[@id='apacheticket']: @owner=0 Jan 14 12:13:46 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_perform_op) info: + /cib/status/tickets/ticket_state[@id='apacheticket']: @expires=1579032826 Jan 14 12:13:47 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_perform_op) info: + /cib/status/tickets/ticket_state[@id='apacheticket']: @expires=1579032827 Jan 14 12:13:47 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_perform_op) info: + /cib/status/tickets/ticket_state[@id='apacheticket']: @term=26 Jan 14 12:13:47 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=fastvm-rhel-8-0-33/crm_ticket/4, version=0.98.10) Jan 14 12:13:47 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/crm_ticket/4) Jan 14 12:13:47 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=fastvm-rhel-8-0-33/crm_ticket/4, version=0.98.10) Jan 14 12:13:52 fastvm-rhel-8-0-33 pacemaker-based [12641] (cib_process_ping) info: Reporting our current digest to fastvm-rhel-8-0-34: 99874a5333cc60295ba99081fd28e4ac for 0.98.10 (0x558dcc339010 0) /var/log/messages: Jan 14 12:12:45 fastvm-rhel-8-0-33 boothd-site[9789]: [info] setting crm_ticket attributes successful Jan 14 12:13:46 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Fllw/25/58738): 192.168.22.71 wants to give the ticket away (ticket release) Jan 14 12:13:46 fastvm-rhel-8-0-33 boothd-site[9789]: [info] setting crm_ticket attributes successful Jan 14 12:13:46 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Fllw/26/0): starting new election (term=26) Jan 14 12:13:46 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Lead/26/119999): granted successfully here Jan 14 12:13:47 fastvm-rhel-8-0-33 boothd-site[9789]: [warning] apacheticket (Lead/26/119485): handler "/tmp/before-acquire-handler.sh" failed: exit code 1 Jan 14 12:13:47 fastvm-rhel-8-0-33 boothd-site[9789]: [warning] apacheticket (Lead/26/119485): we are not allowed to acquire ticket Jan 14 12:13:47 fastvm-rhel-8-0-33 boothd-site[9789]: [info] setting crm_ticket attributes successful Jan 14 12:13:52 fastvm-rhel-8-0-33 boothd-site[9789]: [warning] apacheticket (Init/26/0): no answers to our VtFr request to step down (try #1), we are alone Jan 14 12:13:56 fastvm-rhel-8-0-33 boothd-site[9789]: [warning] apacheticket (Init/26/0): 192.168.22.71 voted previously for 192.168.22.81 and now wants to vote for 192.168.22.71 (ignored) Jan 14 12:13:56 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Init/26/0): 192.168.22.71 votes for none, ignoring (duplicate ticket release?) Jan 14 12:13:57 fastvm-rhel-8-0-33 boothd-site[9789]: [warning] apacheticket (Init/26/0): no answers to our VtFr request to step down (try #2), we are alone Jan 14 12:14:01 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Init/26/0): 192.168.22.71 votes for none, ignoring (duplicate ticket release?) Jan 14 12:14:02 fastvm-rhel-8-0-33 boothd-site[9789]: [warning] apacheticket (Init/26/0): no answers to our VtFr request to step down (try #3), we are alone Jan 14 12:14:06 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Init/26/0): 192.168.22.71 votes for none, ignoring (duplicate ticket release?) Jan 14 12:14:07 fastvm-rhel-8-0-33 boothd-site[9789]: [warning] apacheticket (Init/26/0): no answers to our VtFr request to step down (try #4), we are alone Jan 14 12:14:11 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Init/26/0): 192.168.22.71 votes for none, ignoring (duplicate ticket release?) Jan 14 12:14:12 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Init/26/0): giving up on sending retries Jan 14 12:14:16 fastvm-rhel-8-0-33 boothd-site[9789]: [info] apacheticket (Init/26/0): 192.168.22.71 votes for none, ignoring (duplicate ticket release?) Make of that what you will. As far as resources are concerned, I see no indication either in node 33's logs or in the schedulerd logs of node 34 (the standby'd DC node) that the ticket-constrained resource tried to start or was even allowed to start. All the schedulerd messages continued to say "cannot run anywhere."
On the first sight, it looks positively to me, and might be a gist of the proper solution ... assuming pre-commit delay is configured in a way it will reliably cover[*] for the natural delay till the before-acquire script or scripts finish(es). [*] "reliably cover" is a sole liability and responsibility of the users, indeed, since only they have an idea how long the execution of their own executables will take
(In reply to Jan Pokorný [poki] from comment #10) > On the first sight, it looks positively to me, and might be a gist > of the proper solution ... assuming pre-commit delay is configured > in a way it will reliably cover[*] for the natural delay till the > before-acquire script or scripts finish(es). > > [*] "reliably cover" is a sole liability and responsibility of the > users, indeed, since only they have an idea how long the execution > of their own executables will take I agree, as long as the script does not require an abnormally long time, etc., on any particular execution. It's not uncommon for users to be surprised by a script taking longer to complete than it otherwise should, when something is going wrong either on that system or elsewhere in the environment (e.g., storage or networking). I'd hate for that to end with a node getting fenced due to stop failure after being improperly granted a ticket, when it otherwise would remain online. This is a big improvement already and is likely to work in a great many situations -- perhaps it will even work in *ALL* situations since Pacemaker does not appear to probe resources prematurely. I have to admit I don't understand the potential problem with a more synchronous check for a given ticket's before-acquire-handler described in comment #1, provided there is some timeout value after which we give up. In general, I'd prefer that something work correctly with as little user discretion and tuning as possible, lest they miss documentation or introduce a mistake. poki++
*** Bug 1761298 has been marked as a duplicate of this bug. ***