Bug 1240868 - Need an equivalent fence agent for rcd_serial
Summary: Need an equivalent fence agent for rcd_serial
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: fence-agents   
(Show other bugs)
Version: 22
Hardware: All
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Marek Grac
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-08 00:28 UTC by Sam McLeod
Modified: 2015-08-12 12:28 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-08-12 12:28:17 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

Description Sam McLeod 2015-07-08 00:28:49 UTC
Description of problem:

The pacemaker package was built without the stonithd flag which means it cannot support the use of 'legacy' plugins.

While these plugins have been labelled as 'legacy' many are still very widely relied upon and this stops the ability to use them.

How reproducible:

Anyone who relies upon fencing agents that haven't been migrated to the new fence API will not be able to upgrade to RHEL/CentOS/Fedora versions where pacemaker was packaged without --with sonithd

Steps to Reproduce:

1. Install pacemaker
2. Install rcs_serial or any other plugin (such as those previously provided by cluster-glue) to suit your HA cluster that hasn't been migrated to the new fence API (there are many).

Actual results:

Cannot use rcs_serial fence agent as pacemaker compiled without legacy plugin support

Expected results:

Pacemaker recognises plugins and these can be listed by running stonith -L

Additional info:

I have detailed more information about pacemaker and legacy plugins here although this bug report is only to ensure that the package is built --with stonithd: https://gist.github.com/sammcj/9a8be565b29032bc2a9e

Comment 1 Andrew Beekhof 2015-07-15 07:15:27 UTC
Not going to happen basically.  
We made a conscious decision to drop them 5 or so years ago, that isn't going to change.

Your best bet is to file bugs against the fence-agents package for the specific agents you need created.

Comment 2 Andrew Beekhof 2015-07-15 23:15:38 UTC
The existing implementation is at:
   http://hg.linux-ha.org/glue/file/9da0680bc9c0/lib/plugins/stonith/rcd_serial.c

and includes some documentation as to the function and purpose of this agent, as does:
   http://www.scl.co.uk/rcd_serial/README.rcd_serial

Perhaps Sam can provide some information as to why this agent is preferred over other kinds. This is the first time I'm heard of it.

Comment 3 Sam McLeod 2015-07-20 23:58:27 UTC
Hi Andrew,

Sorry for my delayed response.

Reasons rcd_serial is a very good agent:

- It has no dependency on power state.
- It has no dependency on network state.
- It has no dependency on node operational state.
- It has no dependency on external hardware.
- It costs less that $5 to build.
- It is incredibly simple and reliable.

Essentially the most common STONITH agent type in use is probably those that control UPS / PDUs.

While this sounds like a good idea in theory there are a number of issues with relying on a UPS / PDU:

- Units that have remote power control over individual outlets are very expensive and if an upgrade is undertaken a rake-wide outage may be required depending on the existing infrastructure.

- Often these units are managed via the network, requiring the network and all that that entails to be functioning as expected. It also may require an additional NIC that may or may not fit into your storage units.

- There are almost always two PDUs / UPSs to manage, until very recently the PDU STONITH agents only supported sending an action to a single unit, while they now support sending them to two units in modern packages there a number of situations that are complex to manage and predict - i.e. what if one unit responds, cuts the power and the other doesn't? Who's in charge? Do we fail over? etc... that's a LOT of logic for a STONITH action.

- I've seen several PDUs fail, it's not pretty and often the management interface is the first thing to go.

Comment 4 Sam McLeod 2015-07-21 00:44:20 UTC
I just coupled together a quick blog post with some diagrams and pictures explaining the use of rcd_serial: https://smcleod.net/rcd-stonith/

Comment 5 Marek Grac 2015-07-21 11:20:50 UTC
@Sam:

The ideal solution is to port it to fence-agents (on github), I have no objection about adding to upstream/fedora/... releases.

If you can/are willing to write that fence agent great, if not - can you give me access to that device where I can test it?

imho it should took <1 hour to write it.

Comment 6 Sam McLeod 2015-07-22 01:10:42 UTC
(In reply to Marek Grac from comment #5)

Coding is not my strength - especially when it comes to C.

Unfortunately I can't give you direct access to an environment where you can test the agent due to security constraints - however I would be more than happy to test it for you?

Comment 7 Sam McLeod 2015-07-22 03:17:05 UTC
I spoke too soon, I might have something shortly thanks to the help of one of Infoxchange's lovely developers.

Stay tuned.

Comment 8 Sam McLeod 2015-07-22 05:39:27 UTC
We've just written a new (UNTESTED!) python agent for this - https://github.com/sammcj/fence_rcd_serial/blob/master/fence_rcd_serial.py

I've been struggling to find a definition of how the agents must be structured / how they're called etc... does a template / MVP exist?

Comment 9 Marek Grac 2015-07-22 08:53:33 UTC
@Sam:

Great, I can do a final polishing for you - that's not a problem. 

MVP = fence_dummy -- https://github.com/ClusterLabs/fence-agents/tree/master/fence/agents/dummy

Very simple fence agent: fence_rsa -- https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/rsa/fence_rsa.py

I believe that I understand what's going on. I have just two small questions?

1) it is possible to check status of computer?
2) it is possible to monitor if 'fencing device' is working/attached/... ?

Comment 10 Sam McLeod 2015-07-22 09:07:09 UTC
Thanks that would be great!

- What's your GitHub username? I'll add you as a contributor.

The Python code has been tested and operates the device exactly as expected so it's only the agent wrapper that needs to be fixed up.

1 & 2. The device is so simple that really the best way to check its status is to check that you can open the TTY 2)7,) I've included in that check but perhaps I'm not passing it back in the correct way?

It's really so simple that to be able to monitor anything more than can the port be opened would increase complexity and thus the potential for failure.

Le time know if there is anything I can do to help, be safe that I am in the AEST Timezone however so it's 7PM at present.

Again, thank you so much for your assistance.

Comment 11 Marek Grac 2015-07-22 14:54:26 UTC
@Sam,

you have a pull request (from marxsk) on github. You will have to use latest fencing lib from fence-agents (master branch) because I had to patch it a bit.

Current status)
* monitoring is done via open/close serial port
* reboot action is only one that works - problem is that it looks like working also on my laptop.

Comment 12 Sam McLeod 2015-07-22 20:33:44 UTC
(In reply to Marek Grac from comment #11)

Thank you, merged the PR and added you as a contributor. I've only just woken up so I'll test when I get to work in an hour or so.

- any idea how the reboot action could be measured as successful? I guess you could add one test that pings the other nose and if it can ping it and can't after the reboot action is sent you could call it successful - obviously that measurement would only work in cases where the other nose was still pingable but it could be better than nothing? I'm VERY hesitant to change the circuit design as its beauty is in its simplicity.

Comment 13 Sam McLeod 2015-07-26 22:20:29 UTC
Tested, working as expected.
Thank you so much for your help @Marek,
Where do we go from here?

Comment 14 Sam McLeod 2015-07-26 22:31:01 UTC
PR to ClusterLabs created: https://github.com/ClusterLabs/fence-agents/pull/10/files

Comment 15 Marek Grac 2015-07-27 09:54:25 UTC
@Sam:

great, I have accepted it to upstream.

1) Why do you want fence_rcd_serial_check as it is currently only copy of fence agent itself?

2) Can you please extend 'longdesc' a bit, so others understands what's going on?

--
I will put it in as a new subpackage to Fedora when I will prepare a next version.

Comment 16 Sam McLeod 2015-07-27 23:22:46 UTC
(In reply to Marek Grac from comment #15)

> 1) Why do you want fence_rcd_serial_check as it is currently only copy of
> fence agent itself?

> 2) Can you please extend 'longdesc' a bit, so others understands what's
> going on?


I see what you're saying, my mistake on including that.

I'm adding a better logdesc and a circuit diagram for the device and I'll then submit a PR. ETA today or early tomorrow.

Comment 17 Sam McLeod 2015-07-28 00:41:25 UTC
PR created: https://github.com/ClusterLabs/fence-agents/pull/11

Added circuit diagram for bonus points.

Comment 18 Marek Grac 2015-08-05 06:30:52 UTC
Thanks Sam. I will release new upstream (+fedora) version next week and it will be included.


Note You need to log in before you can comment on or make changes to this bug.