Description of problem: The pacemaker package was built without the stonithd flag which means it cannot support the use of 'legacy' plugins. While these plugins have been labelled as 'legacy' many are still very widely relied upon and this stops the ability to use them. How reproducible: Anyone who relies upon fencing agents that haven't been migrated to the new fence API will not be able to upgrade to RHEL/CentOS/Fedora versions where pacemaker was packaged without --with sonithd Steps to Reproduce: 1. Install pacemaker 2. Install rcs_serial or any other plugin (such as those previously provided by cluster-glue) to suit your HA cluster that hasn't been migrated to the new fence API (there are many). Actual results: Cannot use rcs_serial fence agent as pacemaker compiled without legacy plugin support Expected results: Pacemaker recognises plugins and these can be listed by running stonith -L Additional info: I have detailed more information about pacemaker and legacy plugins here although this bug report is only to ensure that the package is built --with stonithd: https://gist.github.com/sammcj/9a8be565b29032bc2a9e
Not going to happen basically. We made a conscious decision to drop them 5 or so years ago, that isn't going to change. Your best bet is to file bugs against the fence-agents package for the specific agents you need created.
The existing implementation is at: http://hg.linux-ha.org/glue/file/9da0680bc9c0/lib/plugins/stonith/rcd_serial.c and includes some documentation as to the function and purpose of this agent, as does: http://www.scl.co.uk/rcd_serial/README.rcd_serial Perhaps Sam can provide some information as to why this agent is preferred over other kinds. This is the first time I'm heard of it.
Hi Andrew, Sorry for my delayed response. Reasons rcd_serial is a very good agent: - It has no dependency on power state. - It has no dependency on network state. - It has no dependency on node operational state. - It has no dependency on external hardware. - It costs less that $5 to build. - It is incredibly simple and reliable. Essentially the most common STONITH agent type in use is probably those that control UPS / PDUs. While this sounds like a good idea in theory there are a number of issues with relying on a UPS / PDU: - Units that have remote power control over individual outlets are very expensive and if an upgrade is undertaken a rake-wide outage may be required depending on the existing infrastructure. - Often these units are managed via the network, requiring the network and all that that entails to be functioning as expected. It also may require an additional NIC that may or may not fit into your storage units. - There are almost always two PDUs / UPSs to manage, until very recently the PDU STONITH agents only supported sending an action to a single unit, while they now support sending them to two units in modern packages there a number of situations that are complex to manage and predict - i.e. what if one unit responds, cuts the power and the other doesn't? Who's in charge? Do we fail over? etc... that's a LOT of logic for a STONITH action. - I've seen several PDUs fail, it's not pretty and often the management interface is the first thing to go.
I just coupled together a quick blog post with some diagrams and pictures explaining the use of rcd_serial: https://smcleod.net/rcd-stonith/
@Sam: The ideal solution is to port it to fence-agents (on github), I have no objection about adding to upstream/fedora/... releases. If you can/are willing to write that fence agent great, if not - can you give me access to that device where I can test it? imho it should took <1 hour to write it.
(In reply to Marek Grac from comment #5) Coding is not my strength - especially when it comes to C. Unfortunately I can't give you direct access to an environment where you can test the agent due to security constraints - however I would be more than happy to test it for you?
I spoke too soon, I might have something shortly thanks to the help of one of Infoxchange's lovely developers. Stay tuned.
We've just written a new (UNTESTED!) python agent for this - https://github.com/sammcj/fence_rcd_serial/blob/master/fence_rcd_serial.py I've been struggling to find a definition of how the agents must be structured / how they're called etc... does a template / MVP exist?
@Sam: Great, I can do a final polishing for you - that's not a problem. MVP = fence_dummy -- https://github.com/ClusterLabs/fence-agents/tree/master/fence/agents/dummy Very simple fence agent: fence_rsa -- https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/rsa/fence_rsa.py I believe that I understand what's going on. I have just two small questions? 1) it is possible to check status of computer? 2) it is possible to monitor if 'fencing device' is working/attached/... ?
Thanks that would be great! - What's your GitHub username? I'll add you as a contributor. The Python code has been tested and operates the device exactly as expected so it's only the agent wrapper that needs to be fixed up. 1 & 2. The device is so simple that really the best way to check its status is to check that you can open the TTY 2)7,) I've included in that check but perhaps I'm not passing it back in the correct way? It's really so simple that to be able to monitor anything more than can the port be opened would increase complexity and thus the potential for failure. Le time know if there is anything I can do to help, be safe that I am in the AEST Timezone however so it's 7PM at present. Again, thank you so much for your assistance.
@Sam, you have a pull request (from marxsk) on github. You will have to use latest fencing lib from fence-agents (master branch) because I had to patch it a bit. Current status) * monitoring is done via open/close serial port * reboot action is only one that works - problem is that it looks like working also on my laptop.
(In reply to Marek Grac from comment #11) Thank you, merged the PR and added you as a contributor. I've only just woken up so I'll test when I get to work in an hour or so. - any idea how the reboot action could be measured as successful? I guess you could add one test that pings the other nose and if it can ping it and can't after the reboot action is sent you could call it successful - obviously that measurement would only work in cases where the other nose was still pingable but it could be better than nothing? I'm VERY hesitant to change the circuit design as its beauty is in its simplicity.
Tested, working as expected. Thank you so much for your help @Marek, Where do we go from here?
PR to ClusterLabs created: https://github.com/ClusterLabs/fence-agents/pull/10/files
@Sam: great, I have accepted it to upstream. 1) Why do you want fence_rcd_serial_check as it is currently only copy of fence agent itself? 2) Can you please extend 'longdesc' a bit, so others understands what's going on? -- I will put it in as a new subpackage to Fedora when I will prepare a next version.
(In reply to Marek Grac from comment #15) > 1) Why do you want fence_rcd_serial_check as it is currently only copy of > fence agent itself? > 2) Can you please extend 'longdesc' a bit, so others understands what's > going on? I see what you're saying, my mistake on including that. I'm adding a better logdesc and a circuit diagram for the device and I'll then submit a PR. ETA today or early tomorrow.
PR created: https://github.com/ClusterLabs/fence-agents/pull/11 Added circuit diagram for bonus points.
Thanks Sam. I will release new upstream (+fedora) version next week and it will be included.