Bug 1761298
| Summary: | Booth ticket manager before-acquire-handler is utterly broken | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | John <jss> |
| Component: | booth | Assignee: | Jan Friesse <jfriesse> |
| Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 | CC: | cfeist, cluster-maint |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-22 14:47:03 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This is just unbelievable: I've removed the before-acquire-handler from my booth config, and synced it to all nodes in my 2 clusters, and i *STILL* get the same error message about the before-acquire-handler when i try to grant a ticket.
# pcs booth config
authfile = /etc/booth/booth.key
site = someip
site = anotherip
arbitrator = myip
ticket = "BloopTikkit"
expire = 300
acquire-after = 60
timeout = 10
retries = 10
renewal-freq = 120
#activate with eg # geostore set -t BloopTikkit -s ${CFO_SITE} ACTIVATE 1
attr-prereq = auto ACTIVATE eq 1
# pcs booth ticket grant BloopTikkit ${CFO_SITE}
Error: unable to grant booth ticket 'BloopTikkit' for site 'meh', reason: Oct 14 15:15:12 woo.com booth: [6235]: error: before-acquire-handler for ticket "BloopTikkit" failed, grant denied
I've stopped and restarted booth on my booth arbitrator node, and still get this message about before-acquire-handler.
So it appears just *attempting* to use this broken option is enough to leave one's booth configuration completely broken and messed up, for good.
Just hopeless.
I'm probably being a bit rude, but I'm just sick of having my time wasted by dysfunctional and poorly documented software like this. I've tried getting rid of the old ticket, removing constraints that reference it, and adding a new ticket. No good. Even after adding the new ticket, and syncing booth config, when i try to grant the new ticket it says the ticket does not exist, even when i can see it in the config. So now I'm trying to start again, and i get this: # pcs booth destroy Error: booth instance 'booth' is used (running in systemd) Error: booth instance 'booth' is used (enabled in systemd) # pcs booth stop booth@booth stopped # pcs booth destroy Error: booth instance 'booth' is used (enabled in systemd) THIS SHOULD NOT BE HAPPENING # pcs booth disable booth@booth disabled pcs booth destroy #success hooray Still, this booth thing is a complete, utter mess. I've now got it working again, after doing remove and destroy on every node, and starting with a completely fresh config. So that's a really robust system isn't it. Make the simple mistake of trying to use a feature which is supposed to work but doesn't and your whole setup is broken and has to be obliterated and recreated. Great. To summarise: most of the pcs stuff seems to work like a breeze, i was able to setup two clusters in no time at all, then connect to them using pcsd web gui which seems to work well enough... except for the fact that this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1207405 has not been properly fixed - the PCSD web GUI still incredibly slow after node(s) go down so the problem STILL exists in EL7.7 with all updates to 2019-10-11. I was able to setup booth ticket manager and have it control a dummy service correctly. But as soon as I attempted to use the booth before-acquire-handler, it broke my whole booth configuration. Oh, there is one other issue. The stonith vmware SOAP fence agent is unreliable - it runs for a while, and then i come in in the morning and its failed. So that is not good enough. Looks to me like Red Hat High Availability should really be renamed "Medium Availability at Best". John, any feedback may turn very helpful in the end, so no doubts thank you for sharing even this mixture of user experience and feelings. There are things I can respond to right away, while for others, I need to get my hands down on booth in production like scenarios or defer to my colleagues. Litle patience is therefore appreciated. * * * Re: easy items 1/ logging > wonder where the logs are > Where might I obtain additional info? Is there a booth logfile > I can look at? By default, booth daemons log into syslog, which in case of RHEL 7 most likely means you should be able to get to the sought messages from these deamons using a variation of the following command as root: journalctl -t booth -t boothd-arbitrator -t boothd-site Note that you can actually enforce in-file logging, using admittedly undocumented environment variables as follow: * HA_logfile (expects full path for the casual log file) * HA_debugfile (expects full path for the detailed log file) * HA_debug (expects non-negative number predestining the depth of the debug verbosity) For an arbitrator, you can configure this in a straightforward way, just add Environment= directive using systemctl edit booth-arbitrator.service Otherwise, when booth executed as a cluster resource, perhaps your best bet would be to add assignments for these variables directly in /etc/sysconfig/pacemaker file. * * * Will follow with other points. Please, let me know if the verbose output sheds some light into your issues in the interim (or if the above did not help you activating that, to begin with). Hi Jan, thanks for your reply, I will try to be more patient... I know i can be a bit blunt sometimes, so I am sorry for that. I've been busy ansible-ising my cluster build & management operations, to make my testing more repeatable etc. I also shifted to testing on RHEL8, to see if things were any better there, but it looks like the Booth ticket manager is *even worse* on EL8. I cannot get it to work *at all* on EL8. I get my booth cluster resources going (floating IP and booth service), everything looks good, but then: - I add a ticket and sync across clusters and to my booth arbitrator, and - "crm_ticket --info" shows ticket as "revoked" - But "booth list" does not even display ticket. This is disappointing, but I recall comment in documentation saying booth does not manage ticket until it is granted, so maybe this is normal. - I optimistically attempt to grant ticket to one of my clusters with Booth, and it fails telling me the ticket does not exist. - Booth says ticket does not exist, but i can see it in my booth.conf, and with crm_ticket. Incredibly frustrating. I really do not like this Booth thing at all, it seems incredibly bug-ridden. I will now take a look at some more detailed logging as per your suggestions (thank you!). I have also obtained a Red Hat HA evaluation license and logged a support case with Red Hat re this Booth issue and several others (as I will need to see these issues get resolved before we buy a license to deploy several clusters...) So will hopefully receive some assistance through that, and will keep updating this ticket with anything relevant to the Booth issues. Cheers & regards, John Moving to RHEL 8 as RHEL 7 is going to maintenance phase and this bug is more RFE (and dup of bug 1790009) So I think this BZ was really mostly about some misunderstandings. 1. /usr/local/bin/set-san.sh - I've tried same script and it works just fine. I believe there were some problem - probably when trying to write to /var/log/booth/set-san.log so script failed. Logging would contain more info (at least exit code) 2. Reload is not implemented (bug 1771236 is generally about it) that's why booth has to be restarted when config changes. pcs doesn't restart booth. 3. It's not enough to change/restart only arbitrator booth. The before-acquire-handler is per-node and executed on the node which is supposed to acquire ticket. 4. Debugging information were already provided There is real problem with before-acquire-handler documented in bug 1790009 which should be handled. So closing this bug as a dup of bug 1790009 (and, if it would be possible, also bug 1790009) *** This bug has been marked as a duplicate of bug 1790009 *** Jan, there was no "misunderstanding". Re your points: 1) I know the script set-san.sh works. It is (or was) booth that does not work.... not the set-san.sh script. Yes, on my first few tries, script could not write to log. I fixed that ages ago, but booth still did not implement the before-acquire-handler correctly. 2) My first reaction to this point is... maybe pcs *SHOULD* restart booth... if it would actually fix this bug, which is what people are trying to achieve when they take the time and effort to lodge bug reports like this one. It has been some time since I did any work on the Red Hat HA services, because they were unusable in the state they were in, so I've been waiting for these bugs to be fixed. But if i remember correctly, the booth service on cluster nodes runs under pcs as a pcs service. If pcs needs to restart it in order for things to work correclty, then pcs can and *should* restart it. 3) Are you suggesting this bug is fixed if we restart booth on the cluster nodes? I am pretty sure I would have tested this at some point... but I will test it again. 4) Yes, people have given me some help with where to look in logs, so that is fine. I will update packages on my HA test cluster this weekend, then refresh (destroy and recreate) my cluster. I wil then test this again, making sure i restart booth on each cluster node. If it fixed, I will be greatly relieved, but if this bug is still present, i am going to be furious. I cannot see bug 1771236. Please grant me access to this bug immediately, so i can see if it is relevant. Thank you for looking at this issue. John. I will also need access to bug 1790009, thank you. Sorry, nevermind re bug 1790009 - i *do* have access to that bug. But i do not have access to bug 1771236. Please grant me access to that one. Thanks. Ah. Hangon. Sorry, i have refreshed my memory re these tickets - and, Jan, i think you are correct about this ticket being a bit of a misunderstanding caused by the set-san.log not being writeable. Yep, I raised this ticket when my script was failing due to inability to write log. After I fixed that, i encountered other problems, which are being addressed in bug 1790009. So yep it's fine to close this ticket, with my apologies. I will update & refresh my test cluster this weekend and see how things are progressing. Is it possible to grant me access to bug 1771236 so i can take a look at that and see if it is relevant to problems i've seen? Thanks again. @John, the biggest problem you see is really inability of booth to reload its config. pcs could restart booth, and that would work most of the time. But not when the election process is in the progress. And this is the problem and reason why we need bigger hammer there (bug 1771236). That is also the reason why I don't create/assign this BZ to pcs. So basically, when (if) you change ticket config, you must ensure to: 1. sync config to all nodes (this is what pcs is somehow able to do) 2. restart all booth daemon instances on all nodes (so all sites, and arbitrators) If you just change the before-acquire-handler it should be enough to restart booth on the node where before-acquire-handler will be executed (but in practice it means restart booth of all sites). Also restarting booth on site is not very well handled, because it is running as an pacemaker resource. No matter what, I can agree booth really needs work to become "runtime" changes friendly - because now it has exactly no ability to change anything at runtime. Said that, it sounds easy, but in reality it is usually pretty painful process. We tried to do same with corosync, and honestly, first version which really has ability to change most of the things runtime is current master (soon becoming 3.1.0). I've added you to CC of bug 1771236 what should (if my memory doesn't fool me) give you an access to the bug. |
Description of problem: Booth Booth ticket manager is a problem. The documentation is appalling. The developer has an open ticket in github vs RHEL7 since 2015. Nevertheless I have managed to get the basics working, until I try to use the "before-acquire-handler" feature. It just *does not *work*. Version-Release number of selected component (if applicable): 1.0.8 How reproducible: Add something like this to your configruation for a ticket: before-acquire-handler = /usr/local/bin/set-san.sh /usr/local/bin/set-san.sh is just a simple test script at this point. When i run it on cmdline it returns successfully. I can run it under sudo to hacluster user, and yes, it still runs successfully. But according to booth, it fails. #!/bin/sh MYDATE=`date` MYHOSTNAME=`hostname -s` echo "${MYDATE} - Enable SAN for replication from ${MYHOSTNAME}" >> /var/log/booth/set-san.log Steps to Reproduce: 1. Do anything with booth 2. wonder where the logs are 3. Get booth working, then try to use the before-acquire-handler. It *cannot* be made to work. Actual results: Booth appears to not even run the script I've specified. It just says: [root@blah ]# pcs booth ticket grant BloopTikkit ${SOME_SITE} Error: unable to grant booth ticket 'BloopTikkit' for site 'someip', reason: Oct 14 13:59:44 somesite.com booth: [21713]: error: before-acquire-handler for ticket "BloopTikkit" failed, grant denied The result is the same even if I move the script out of the way, so that it cannot be run. I have tried everything, and nothing works, so I can only include booth is not even attempting to run the script. Expected results: Booth attempts to run the specified handler. Additional info: Additional info? Where might I obtain additional info? Is there a booth logfile I can look at? Can i increase debugging output from booth? Yes? No? How would I know, when it's not documented? Is this really how Red Hat expects customers to manage geographically distributed clusters. Seriously? I've googled for clues and assistance as to what is going wrong, and there is nothing. Does anyone use this product?