Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1761298

Summary: Booth ticket manager before-acquire-handler is utterly broken
Product: Red Hat Enterprise Linux 8 Reporter: John <jss>
Component: boothAssignee: Jan Friesse <jfriesse>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: cfeist, cluster-maint
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-22 14:47:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John 2019-10-14 03:06:53 UTC
Description of problem:

Booth
Booth ticket manager is a problem.

The documentation is appalling. The developer has an open ticket in github vs RHEL7 since 2015.

Nevertheless I have managed to get the basics working, until I try to use the "before-acquire-handler" feature. It just *does not *work*.

Version-Release number of selected component (if applicable):
1.0.8

How reproducible:
Add something like this to your configruation for a ticket:
    before-acquire-handler = /usr/local/bin/set-san.sh

/usr/local/bin/set-san.sh is just a simple test script at this point. When i run it on cmdline it returns successfully. I can run it under sudo to hacluster user, and yes, it still runs successfully. But according to booth, it fails.

#!/bin/sh
MYDATE=`date`
MYHOSTNAME=`hostname -s`
echo "${MYDATE} - Enable SAN for replication from ${MYHOSTNAME}" >> /var/log/booth/set-san.log



Steps to Reproduce:
1. Do anything with booth
2. wonder where the logs are
3. Get booth working, then try to use the before-acquire-handler. It *cannot* be made to work.

Actual results:
Booth appears to not even run the script I've specified. It just says:

[root@blah ]# pcs booth ticket grant BloopTikkit ${SOME_SITE}
Error: unable to grant booth ticket 'BloopTikkit' for site 'someip', reason: Oct 14 13:59:44 somesite.com booth: [21713]: error: before-acquire-handler for ticket "BloopTikkit" failed, grant denied

The result is the same even if I move the script out of the way, so that it cannot be run. I have tried everything, and nothing works, so I can only include booth is not even attempting to run the script.

Expected results:
Booth attempts to run the specified handler.

Additional info:

Additional info? Where might I obtain additional info? Is there a booth logfile I can look at? Can i increase debugging output from booth? Yes? No? How would I know, when it's not documented?

Is this really how Red Hat expects customers to manage geographically distributed clusters. Seriously? I've googled for clues and assistance as to what is going wrong, and there is nothing. Does anyone use this product?

Comment 2 John 2019-10-14 04:20:07 UTC
This is just unbelievable: I've removed the before-acquire-handler from my booth config, and synced it to all nodes in my 2 clusters, and i *STILL* get the same error message about the before-acquire-handler when i try to grant a ticket.



# pcs booth config
authfile = /etc/booth/booth.key
site = someip
site = anotherip
arbitrator = myip
ticket = "BloopTikkit"
    expire        = 300
    acquire-after = 60
    timeout       = 10
    retries       = 10
    renewal-freq  = 120

    #activate with eg #  geostore set -t BloopTikkit -s ${CFO_SITE} ACTIVATE 1
    attr-prereq = auto ACTIVATE eq 1


# pcs booth ticket grant BloopTikkit ${CFO_SITE}
Error: unable to grant booth ticket 'BloopTikkit' for site 'meh', reason: Oct 14 15:15:12 woo.com booth: [6235]: error: before-acquire-handler for ticket "BloopTikkit" failed, grant denied


I've stopped and restarted booth on my booth arbitrator node, and still get this message about before-acquire-handler.

So it appears just *attempting* to use this broken option is enough to leave one's booth configuration completely broken and messed up, for good.

Just hopeless.

Comment 3 John 2019-10-14 04:22:16 UTC
I'm probably being a bit rude, but I'm just sick of having my time wasted by dysfunctional and poorly documented software like this.

Comment 4 John 2019-10-14 05:03:13 UTC
I've tried getting rid of the old ticket, removing constraints that reference it, and adding a new ticket.
No good. Even after adding the new ticket, and syncing booth config, when i try to grant the new ticket it says the ticket does not exist, even when i can see it in the config.

So now I'm trying to start again, and i get this:

# pcs booth destroy
Error: booth instance 'booth' is used (running in systemd)
Error: booth instance 'booth' is used (enabled in systemd)

# pcs booth stop
booth@booth stopped

# pcs booth destroy
Error: booth instance 'booth' is used (enabled in systemd)

THIS SHOULD NOT BE HAPPENING

Comment 5 John 2019-10-14 05:49:29 UTC
# pcs booth disable
booth@booth disabled

pcs booth destroy
#success

hooray
Still, this booth thing is a complete, utter mess.

I've now got it working again, after doing remove and destroy on every node, and starting with a completely fresh config. So that's a really robust system isn't it. Make the simple mistake of trying to use a feature which is supposed to work but doesn't and your whole setup is broken and has to be obliterated and recreated.
Great.

Comment 6 John 2019-10-14 06:04:51 UTC

To summarise:

most of the pcs stuff seems to work like a breeze, i was able to setup two clusters in no time at all, then connect to them using pcsd web gui which seems to work well enough...
except for the fact that this bug:
   https://bugzilla.redhat.com/show_bug.cgi?id=1207405
has not been properly fixed - the PCSD web GUI still incredibly slow after node(s) go down so the problem STILL exists in EL7.7 with all updates to 2019-10-11.

I was able to setup booth ticket manager and have it control a dummy service correctly.
But as soon as I attempted to use the booth before-acquire-handler, it broke my whole booth configuration.

Oh, there is one other issue.
The stonith vmware SOAP fence agent is unreliable - it runs for a while, and then i come in in the morning and its failed. So that is not good enough. 

Looks to me like Red Hat High Availability should really be renamed "Medium Availability at Best".

Comment 7 Jan Pokorný [poki] 2019-10-22 13:12:53 UTC
John, any feedback may turn very helpful in the end, so no doubts
thank you for sharing even this mixture of user experience and
feelings.

There are things I can respond to right away, while for others, I need
to get my hands down on booth in production like scenarios or defer to
my colleagues.  Litle patience is therefore appreciated.

* * *

Re: easy items


1/ logging

> wonder where the logs are

> Where might I obtain additional info? Is there a booth logfile
> I can look at?

By default, booth daemons log into syslog, which in case of RHEL 7 most
likely means you should be able to get to the sought messages from these
deamons using a variation of the following command as root:

  journalctl -t booth -t boothd-arbitrator -t boothd-site

Note that you can actually enforce in-file logging, using admittedly
undocumented environment variables as follow:

  * HA_logfile (expects full path for the casual log file)

  * HA_debugfile (expects full path for the detailed log file)

  * HA_debug (expects non-negative number predestining the depth
              of the debug verbosity)


For an arbitrator, you can configure this in a straightforward
way, just add Environment= directive using

  systemctl edit booth-arbitrator.service


Otherwise, when booth executed as a cluster resource, perhaps your best
bet would be to add assignments for these variables directly in
/etc/sysconfig/pacemaker file.

* * *

Will follow with other points.

Please, let me know if the verbose output sheds some light into your
issues in the interim (or if the above did not help you activating
that, to begin with).

Comment 8 John 2019-10-28 07:00:30 UTC
Hi Jan,

thanks for your reply, I will try to be more patient... I know i can be a bit blunt sometimes, so I am sorry for that.

I've been busy ansible-ising my cluster build & management operations, to make my testing more repeatable etc.

I also shifted to testing on RHEL8, to see if things were any better there, but it looks like the Booth ticket manager is *even worse* on EL8.
I cannot get it to work *at all* on EL8.

I get my booth cluster resources going (floating IP and booth service), everything looks good, but then:
 - I add a ticket and sync across clusters and to my booth arbitrator, and 
 - "crm_ticket --info" shows ticket as "revoked" 
 - But "booth list" does not even display ticket. This is disappointing, but I recall comment in documentation saying booth does not manage ticket until it is granted, so maybe this is normal.
 - I optimistically attempt to grant ticket to one of my clusters with Booth, and it fails telling me the ticket does not exist.
 - Booth says ticket does not exist, but i can see it in my booth.conf, and with crm_ticket.

Incredibly frustrating.
I really do not like this Booth thing at all, it seems incredibly bug-ridden.

I will now take a look at some more detailed logging as per your suggestions (thank you!).

I have also obtained a Red Hat HA evaluation license and logged a support case with Red Hat re this Booth issue and several others (as I will need to see these issues get resolved before we buy a license to deploy several clusters...)
So will hopefully receive some assistance through that, and will keep updating this ticket with anything relevant to the Booth issues.

Cheers & regards,
John

Comment 10 Jan Friesse 2020-05-14 14:40:15 UTC
Moving to RHEL 8 as RHEL 7 is going to maintenance phase and this bug is more RFE (and dup of bug 1790009)

Comment 11 Jan Friesse 2020-07-22 14:47:03 UTC
So I think this BZ was really mostly about some misunderstandings.

1. /usr/local/bin/set-san.sh - I've tried same script and it works just fine. I believe there were some problem - probably when trying to write to /var/log/booth/set-san.log so script failed. Logging would contain more info (at least exit code)

2. Reload is not implemented (bug 1771236 is generally about it) that's why booth has to be restarted when config changes. pcs doesn't restart booth.

3. It's not enough to change/restart only arbitrator booth. The before-acquire-handler is per-node and executed on the node which is supposed to acquire ticket.

4. Debugging information were already provided

There is real problem with before-acquire-handler documented in bug 1790009 which should be handled.

So closing this bug as a dup of bug 1790009 (and, if it would be possible, also bug 1790009)

*** This bug has been marked as a duplicate of bug 1790009 ***

Comment 12 John 2020-07-22 21:34:17 UTC
Jan, there was no "misunderstanding".

Re your points:
1) I know the script set-san.sh works. It is (or was) booth that does not work.... not the set-san.sh script.
   Yes, on my first few tries, script could not write to log. I fixed that ages ago, but booth still did not implement the before-acquire-handler correctly.

2) My first reaction to this point is... maybe pcs *SHOULD* restart booth... if it would actually fix this bug, which is what people are trying to achieve when they take the time and effort to lodge bug reports like this one. It has been some time since I did any work on the Red Hat HA services, because they were unusable in the state they were in, so I've been waiting for these bugs to be fixed. But if i remember correctly, the booth service on cluster nodes runs under pcs as a pcs service. If pcs needs to restart it in order for things to work correclty, then pcs can and *should* restart it. 

3) Are you suggesting this bug is fixed if we restart booth on the cluster nodes? I am pretty sure I would have tested this at some point... but I will test it again.

4) Yes, people have given me some help with where to look in logs, so that is fine.

I will update packages on my HA test cluster this weekend, then refresh (destroy and recreate) my cluster.
I wil then test this again, making sure i restart booth on each cluster node.
If it fixed, I will be greatly relieved, but if this bug is still present, i am going to be furious.

I cannot see bug 1771236. Please grant me access to this bug immediately, so i can see if it is relevant.

Thank you for looking at this issue.

John.

Comment 13 John 2020-07-22 21:35:12 UTC
I will also need access to bug 1790009, thank you.

Comment 14 John 2020-07-22 21:39:27 UTC
Sorry, nevermind re bug 1790009 - i *do* have access to that bug.

But i do not have access to bug 1771236. 
Please grant me access to that one.

Thanks.

Comment 15 John 2020-07-22 21:53:39 UTC
Ah.
Hangon.

Sorry, i have refreshed my memory re these tickets - and, Jan, i think you are correct about this ticket being a bit of a misunderstanding caused by the set-san.log not being writeable. 

Yep, I raised this ticket when my script was failing due to inability to write log.
After I fixed that, i encountered other problems, which are being addressed in bug 1790009.

So yep it's fine to close this ticket, with my apologies.

I will update & refresh my test cluster this weekend and see how things are progressing.
Is it possible to grant me access to bug 1771236 so i can take a look at that and see if it is relevant to problems i've seen?

Thanks again.

Comment 16 Jan Friesse 2020-07-23 07:11:54 UTC
@John,
the biggest problem you see is really inability of booth to reload its config. pcs could restart booth, and that would work most of the time. But not when the election process is in the progress. And this is the problem and reason why we need bigger hammer there (bug 1771236). That is also the reason why I don't create/assign this BZ to pcs.

So basically, when (if) you change ticket config, you must ensure to:
1. sync config to all nodes (this is what pcs is somehow able to do)
2. restart all booth daemon instances on all nodes (so all sites, and arbitrators)

If you just change the before-acquire-handler it should be enough to restart booth on the node where before-acquire-handler will be executed (but in practice it means restart booth of all sites).

Also restarting booth on site is not very well handled, because it is running as an pacemaker resource.

No matter what, I can agree booth really needs work to become "runtime" changes friendly - because now it has exactly no ability to change anything at runtime. Said that, it sounds easy, but in reality it is usually pretty painful process. We tried to do same with corosync, and honestly, first version which really has ability to change most of the things runtime is current master (soon becoming 3.1.0).

I've added you to CC of bug 1771236 what should (if my memory doesn't fool me) give you an access to the bug.