Description of problem: I have the problem that the rps10.so library always gives me a "segmentation fault" on execution. Also if I only try to query it. So running a "clustonith -v -l" gives me the following output (debugging turned on): ----------------------------------------------------------------------------- [root@eval13 stonithlib]# clustonith -v -l Determining Switch Type...st_new entered. st_new returning. st_new entered. st_new returning. st_status entered (WTI_RPS10) st_status calling RPSConnect (WTI_RPS10) RPSConnect entered. Calling dtrtoggle (WTI_RPS10) dtrtoggle Complete (WTI_RPS10) RPSConnect_Ready: Waiting for Ready RPSConnect_Complete: sending status command Sending *? Segmentation fault Since I put a few more debugging code into the sources (fprints only) I can see that the problem occu res by executing the FD_SET calls. ----------------------------------------------------------------------------- static int RPSSendCommand (struct WTI_RPS10 *ctx, int outlet, char command, int timeout) { char writebuf[10]; /* all commands are 9 chars long! */ int return_val; /* system call result */ fd_set wfds, xfds; /* list of FDs for select() */ struct timeval tv; /* */ FD_ZERO(&wfds); FD_ZERO(&xfds); if (outlet == 10) { /* Send to ALL outlets */ snprintf (writebuf, sizeof(writebuf), "%s*%c\r", WTIpassword, command); } else { snprintf (writebuf, sizeof(writebuf), "%s%d%c\r", WTIpassword, outlet, command); } if (gbl_debug) printf ("Sending %s\n", writebuf); /* Make sure the serial port won't block on us. use select() */ FD_SET(ctx->fd, &wfds); if (gbl_debug) printf ("FD_SET on wfds done"); FD_SET(ctx->fd, &xfds); if (gbl_debug) printf ("FD_SET on xfds done"); tv.tv_sec = timeout; tv.tv_usec = 0; return_val = select(ctx->fd+1, NULL, &wfds,&xfds, &tv); if (gbl_debug) printf ("select done"); ----------------------------------------------------------------------------- Commenting those out gives the following output: [root@eval13 stonithlib]# clustonith -v -l Determining Switch Type...st_new entered. st_new returning. st_new entered. st_new returning. st_status entered (WTI_RPS10) st_status calling RPSConnect (WTI_RPS10) RPSConnect entered. Calling dtrtoggle (WTI_RPS10) dtrtoggle Complete (WTI_RPS10) RPSConnect_Ready: Waiting for Ready RPSConnect_Complete: sending status command Sending *? FD_SET on wfds done FD_SET on xfds done select done FAILED Unable to determine power switch type. Unable to determine default power switch type. ----------------------------------------------------------------------------- And I get that in syslog: Aug 15 11:19:02 eval13 clustonith: Did not find string: 'RPS-10 Ready' fromWTI RPS10 Power Switch. Aug 15 11:19:12 eval13 clustonith: WTI_RPS10: Timeout writing to /dev/ttyS0 Aug 15 11:19:12 eval13 clustonith[2239]: <err> clu_stonith_check: stonith device with IPaddr eval7 ha s bad status Aug 15 11:19:12 eval13 clustonith[2239]: <err> clu_stonith_init: failed clu_stonith_check(). Aug 15 11:19:12 eval13 clustonith[2239]: <err> clu_stonith_type: failed init. ----------------------------------------------------------------------------- But accessing the rps10 through /dev/ttyS0 via Minicom works fine. ANY IDEA? Best regards, Alex. *************************************************** Alexander Landgraf Senior System Engineer Advanced UniByte GmbH Birnenweg 15 72766 Reutlingen Voice: +49 7121/483-281 Fax: +49 7121/483-289 email: alexander.landgraf WWW: http://www.advanced-unibyte.de *************************************************** Version-Release number of selected component (if applicable): clumanager-1.0.19-2 How reproducible: always Steps to Reproduce: 1. just run clustonith -v -l mit WTI serial power switches Actual results: Expected results: Additional info:
I bet that it's the same behavior as: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=92460 (this one was in cluquorumd though) There was a nasty bug in RPSConnectComplete where the bounds weren't being checked on the file descriptor, causing a segmentation fault. Try the following RPM. Note that it's not Red Hat errata, but it may solve the problem. Let me know if it does or not. Additionally, remember that the RPS-10 doesn't have the notion of a host list the way that network power switches do.
Created attachment 94197 [details] Patch to rps10.c to fix segmentation fault
syslog(LOG_ERR, "%s: device %s is not open!", WTIid, ctx->device); ... and what happend and will happen if "the device is not open"? Alex.
These are unofficial testing RPMs and so are unsupportable by Red Hat Support: http://people.redhat.com/lhh/.testing/clumanager-1.0.23-0.5.i386.rpm http://people.redhat.com/lhh/.testing/clumanager-1.0.23-0.5.src.rpm
Well. And do you think that patching rps10.c in clumanager-1.0.19-xx solves the problem that the cluster service doesn't start through if the RPS-10 switches are configured into the cluster.conf? The question will ask if that fix also causes the cluster services itself to behave well if the rps10.so lib is patched in that way. Does it recognize the switches and is it able to do powercycling on the nodes? :) Alex.
... and I can't use clumanager 1.0.23-0.5 as long it is unsupported. Does the 10.0.23-xx rps10 module work in the 1.0.19? Alex.
Segfault should be fixed; look in to other funny behavior ;)
Segfault should be fixed but it will work with the old clumanager? And which other funny behavior is already known?
The patch I sent fixes a segfault which occurs when the STONITH initialization fails. Generally, this is caused by the STONITH module's inability to have exclusive access to the device or the inability to set the proper mode. These can be caused by one or more of the following: - Pointing at the wrong /dev/ttySx in the cluster configuration - Kernel has serial console enabled on the same device connected to the RPS-10 - minicom, getty, agetty, or other program has the serial port open for some reason - Port not connected. The code is fairly well-tested; this might be a configuration problem. Here's the output I get from 1.0.19-2 with everything configured properly: [root@wind utils]# clustonith -l -v -v -v -v Checking cluster state...Not Active Reading cluster config file Determining local node id...1 Determining Switch Type...WTI RPS10 Power Switch (#1) STONITH stonith_new("rps10")...0x8083ce8 wind STONITH destroy(0x8083ce8) If I unplug the RPS10 module or have another application open which has access to the port, I get: [root@wind utils]# clustonith -l -v -v -v Checking cluster state...Not Active Reading cluster config file Determining local node id...1 Determining Switch Type...Segmentation fault This segmentation fault is in RPSSendCommand, at the first FD_SET on line 327 (of the unmodified rps10.c from 1.0.19), and is consistent with what you are seeing. FYI, the other funny behavior fixed in 1.0.23 is the fact that clustonith reports some operations for the "wrong" host (note that it reported for "wind"). This isn't really a bug, but it throws some people off.
Well. But I used minicom to verify that the correct port ist used, closed minicom and did a clustonith... then. I had no configuration error for sure. All of ... - Pointing at the wrong /dev/ttySx in the cluster configuration -> NO - Kernel has serial console enabled on the same device connected to the RPS- 10 -> NO - minicom, getty, agetty, or other program has the serial port open for some reason -> NO (lsof) - Port not connected -> NOT EVEN .. was properly checked :(. Should the segfault occure however? For me the error seem to occure in the syscall FD_SET, not in send. Is that Possible? Alex.
With the above fix, the segmentation fault doesn't occur (ie, the FD_SET never happens, so the segfault doesn't occur). I can not, however, reproduce your specific behavior with everything in place as I know it. As a last-ditch effort, ensure that the DIP switches on your RPS10 are configured as follows: 1: Down 2: Down 3: Up 4: Down The STONITH module _only_ runs at 9600bps. Failing that, can you file an incident with Red Hat Support for me? https://enterprise.redhat.com/issue-tracker/
The dips are setup correct. I'm also communicating with 9600 baud under minicom. What do you mean with "Failing that, can you file an incident with Red Hat Support for me?"? With "Failing that" you mean that if that's also setup correct I shall fill out the form? (sorry I'm German) Alex.
Yes, please file an incident with RH Support so this can be properly tracked. They're better equipped to isolate the problem and how to reproduce it. (For instance, we both have proper setups, but yours doesn't work and mine does.) http://enterprise.redhat.com/issue-tracker Please include in the incident: - output from 'sysreport' - contents /etc/cluster.conf from both members (they should be the same, but just in case...)
okay ... I got the cluster.confs and the sysreports, but I don't get into http://enterprise.redhat.com/issue-tracker. What do I have to do herefor? Am I really able to? Best regards, Alex.
I really don't think the segfault is what's causing the power switch problems. Here's why: As you know, the segmentation fault occurs in the ctx->fd == -1. Indexing an array at index -1 is just *asking* for trouble. There are only a few ways the file descriptor could be -1: - The serial port couldn't be opened. - The serial port attributes could not be set correctly. - Data pending on the serial port could not be flushed. - We did not receive expected data from the RPS10. If any of those happened, a corresponding message would have appeared in the system log: - "WTI_RPS10: Can't open /dev/ttyS1 : <reason>" - "WTI_RPS10: Can't set attributes /dev/ttyS1 : <reason>" - "WTI_RPS10: Can't flush /dev/ttyS1 : <reason>" - "Did not find string 'RPS-10 Ready' from RPS10..." So anyway, the first 3 occurred successfully, based on the Bugzilla data and looking at the code: RPSConnect entered. Calling dtrtoggle (WTI_RPS10) <-- port open! dtrtoggle Complete (WTI_RPS10) <-- DTR toggle success RPSConnect_Ready: Waiting for Ready RPSConnect_Complete: sending status command It should have said "Got Ready", followed by "Got NL", and should never have proceeded in to RPSConnect_Complete(). The fact that it gets a segmentation fault in RPSConnect_Complete is orthagonal to the fact that it missed the "RPS-10 Ready" string. Because the DTR toggle succeeded, it looks like you have a valid configuration (looks can be deceiving, though!) ). So, what we need to do is devise a plan for determining whether there's some sort of difference in the output of the WTI RPS-10 (I doubt it, I suppose it is the logical next step). You'll need a little script-able serial dumb-terminal program (minicom won't work for this). You can get it from here: http://people.redhat.com/lhh/ser-1.0.2-1.src.rpm http://people.redhat.com/lhh/ser-1.0.2-1rhel2.1.i386.rpm - Plug the RPS-10 into the wall (for power..). - Plug the serial port into one of your machines (for this example, I will use "ttyS0" as the serial port), but don't plug any machines into its power port. - Run: script foo.txt - Run: ser /dev/ttyS0 9600 - ser will complain that there's no carrier detect - that's normal, the RPS-10 doesn't use DCD. - Push "Ctrl-A Ctrl-Z". It should say "[HANGUP]". After about 10 seconds, it should say "RPS-10 Ready" - Issue a few commands to the RPS-10 unit. - To issue a command, type the following while holding the "Ctrl" key: bxxbxx - Issue the "0T" command (type ^B^X^X^B^X^X0T<ENTER>) - Issue the "0?" command (type ^B^X^X^B^X^X0?<ENTER>) - Press Ctrl-A Ctrl-X to quit ser. - Type exit. script will now exit, and say: "Script done, file os foo.txt" - Upload unaltered foo.txt into Bugzilla...
okay ... I gonna install the package and do the tests ... but only tonight or tomorrow morning. All my machines in the Lab are currently used by customer demos and installations. Best regards, Alex. PS: however? Why wouldn't that work with minicom. As I remember I got the same results using minicom. I had been able to switch ports off and on ... and I got an "RPS-10 Ready" after sending a hangup :)
minicom doesn't capture all the output (including control characters); using "ser" run from within "script" does. The idea is that it might be something really simple - like the way carriage-returns and line-feeds are generated/handled by the RPS-10 units you have - I'm just trying to cover all the bases :) Or it could be something even more simple - like the RPS10 driver isn't waiting long enough for the "RPS-10 Ready" message ... we'll cross that bridge when we come to it.
okay ... CR and LF. Makes sense. I'll let you know ... certainly tomorrow! Thanks, Alex.
Created attachment 95093 [details] ser output talking to a WTI RPS10 power switch for debugging purposes
... here's the output. But it doesn't look so bad :(. Best regards, Alex. PS: where are you located? USA? GB?
... after inserting the "if ( ... < 0 )"s into the code I get the following: console: [root@eval4 stonith]# clustonith -vS Determining Switch Type...FAILED Unable to determine power switch type. Unable to determine default power switch type. messages: Oct 10 13:56:04 eval4 clustonith: Did not find string: 'RPS-10 Ready' fromWTI RPS10 Power Switch. Oct 10 13:56:04 eval4 clustonith: WTI_RPS10: device /dev/ttyS0 is not open! Oct 10 13:56:04 eval4 clustonith[3037]: <err> clu_stonith_check: stonith device with IPaddr eval8 has bad status Oct 10 13:56:04 eval4 clustonith[3037]: <err> clu_stonith_init: failed clu_stonith_check(). Oct 10 13:56:04 eval4 clustonith[3037]: <err> clu_stonith_type: failed init.
Ah ha! Your unit is saying: "PRS-10 Ready", not "RPS-10 Ready". That would definitely break the expect-ish code we use to talk to the power controller... Very strange. I'll have a test-package pretty soon. Hold tight.
Created attachment 95100 [details] Patch to change expect-text to allow "PRS" from clumanager-1.0.x/: patch -p0 < rps10-prs.patch from clumanager-1.0.x/src/stonithlib/: patch -p2 < rps10-prs.patch
1.0.23 with the patch applied: http://people.redhat.com/lhh/clumanager-1.0.24-0.1.i386.rpm http://people.redhat.com/lhh/clumanager-1.0.24-0.1.src.rpm
... well. "PRS". That something I really oversaw. Very, very strange. Might it be an issue caused by WTI? I will call them and ask how that may have happened. I'm really sorry but I obiously never really parsed the letters very thoroughly or carefully. I'm will report the results to you. Best regards and have a good weekend, Alex.
I wouldn't worry about the "PRS" vs "RPS" thing. I think it is more important whether or not the driver works, and whether you can then get your machines back online.
Well ... [root@eval4 root]# clustonith -vr eval4 Determining Switch Type...WTI RPS10 Power Switch (#0) Successfully power cycled host eval4. [root@eval4 root]# messages: Oct 13 10:09:55 eval4 clustonith: Host eval4 being rebooted. ... and it really switches off an on :)! Looks much better now :). But what I really don't understand is the name I have to put behind -r option. Eval4 is the current host .. but the PowerSwitch attached to it switches the other host. In that case eval8. Is there a good explanation which name I have to use under which condition? Is the cluquorumd doing the right thing when running .. I mean switching the other node? And how to I have to use the WTI NPS-230 and APC AP-9212 Switches? The NPS-230 is fully redundant. So can I however also use two of those boxes? Or do I have to use just a single one (one IP)? And what do I have to configure as IP or Name in cluconfig for node4? Choose one of the following power switches: o NONE o RPS10 o BAYTECH o APCSERIAL o APCMASTER o WTI_NPS o SW_WATCHDOG Power switch [RPS10]: WTI_NPS Enter IP address or hostname used to access the power switch [eval4]: temp0 Looking for host temp0 (may take a few seconds)... Warning: Host temp0 not responding Keep your selection? [yes]: yes Enter the name of the outlet that power cycles member 'eval4' [eval4]: Enter the password for the power switch [10]: ... well. Do I have to put the IP and Port here which switches eval4? And does that work correctly using two power switches? And also with APC switches? Thanks very much - so far - however :). Maybe you're able to give me a last input about power switches (questions above). Maybe there's also a technical white paper? Best regards, Alex.
First, I need to point out: Cluquorumd + clupowerd work _properly_ when network power switches are in use. Clustonith is difficult to operate properly when network power switches are in use. This is because in the RHEL-2.1 version of clumanager, power switches are indexed by member #. Because of this, the "clustonith" utility, when it was originally written, was designed with "One Power Switch Per Member" - which is why when network-power switches are in use, it behaves erratically. It's basically backwards logic: Serial power switches are handled as though they're wired to the _opposite_ member; Network power controllers are handled as if wired to the current member. With the 1.0.24-0.1 version, you have a workaround. There's a command line option, "-o" which specifies "Function using the 'other' member's assigned ID" - which causes it to function. Here's when to use "clustonith -o -r <other_member>": (1) When *two* network-based power switches are in use. (2) When calling "clustonith" on member #1 when one network power switch is in use. Here's when to use "clustonith -r <other_member>": (1) Two serial power switches are in use (2) Calling "clustonith" on member #0 when one network power switch is in use. (the two "2" notes above may be backwards...) FYI, this strange behavior is not present in the RHEL3 beta, and is a "won't-fix" for RHEL-2.1 (besides the already mentioned "-o" workaround).
Well. Now we have the problem that the APC AP9212s are no longer available. The successor switch is the AP7920. Did you already recognize that? For our last project we had not been able to get AP9212s any longer. But the AP7920 has a much better Network Module. The old one (AP9112 always hungup by network broadcasts ... e.g. nmbd broadcasts). The newer doesn't! Do you have any input for me? When do you plan to support the AP7920s? Best regards, Alex.
The 1.0.24-x.x RPM also has a driver for the AP9225 called "apcplus", but if you use the AP9225, you can't use it in Daisy-Chain mode. We don't have a 7920 for development.
... well. But since the AP9212 are no longer available in the market (you'll see soon) I'm sure that you will need to evaluate the AP7920s soon. Are there already any plans to do that? What about support in AS3.0? Another urgent thing I would need is the support for software RAIDs as block device in cluster services. Will there be any support for that in RH-AS-3.0? Or do you already have appropriate scripts to do that? Best regards, Alex.
We don't currently have a 7920 for development. You can try using the AP9225 "apcplus" driver. It may or may not work. In either case, we can't support it until we get a 7920 for development. I'll ask around and see what I can do, bu Software-RAID is not supported for clustering in RHEL 2.1 or RHEL 3, neither are host bus adapters with on-board RAID (ex: the Adaptec AAC series RAID controllers). We recommend fibre-attached RAID arrays for best performance/reliability in HA clusters.
Well. Isn't the AP9225 a serially attached switch? Due to a distance of about 200m between the two nodes I need to use the network based switches. -- Concerning the software RAID I just do the following: I write my own /usr/share/cluster/services/svclib_raid1 script which I put into the /usr/share/cluster/services/service script to start the raid before the "device start" function is called during startup and to stop the raid after the "device stop" function is called during shutdown. The only problem I have is that if you provide a newer clumanager.rpm my own scripts will be overwritten. Do you have any idea how to solve that problem? I would suggest that you provide not only a single script per service in the cluster service. It should be possible to define if the user script has to be run before starting a cluster's service (prestart) or after a cluster's service is stopped (poststop). Default ist poststart and prestop, right? Best regards, Alex.
The 9225 does have serial ports, but Red Hat Cluster Manager only uses the network-based capabilities of it, provided by an AP9606 Web/SNMP management card. Unfortunately, for your purposes, the AP9225 may not be suitable for your environment - the AP9606 is the same card used in the AP9211 and AP9212 :( You can submit a feature request via a Bugzilla RFE for the APC AP7920 against "Red Hat Enterprise Linux Public Beta" / "taroon-rc" / "clumanager", but I can't make any promises as to when/if it will get done. WRT Software RAID: Integration w/ the cluster software is *not* the problem. Software RAID can not coordinate access to the RAID set across multiple machines. Thus, the risk for data corruption is non-trivial and *always* present when software RAID is used for clustered services. For more information: http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/cluster-manager/ch-hardware.html#S2-HARDWARE-SHAREDSTOR
"The 9225 does have serial ports, but Red Hat Cluster Manager only uses the network-based capabilities of it, provided by an AP9606 Web/SNMP management card." Well ... we found out that the 9225+9606 ist just the american 110V model's name for the europeen AP9212. And these are "end of life" @ APC. I would be real glad if you could figure out which APC switches RedHat plans to support next - and in which version of the clumanager. Best regards, Alex.
PER BUG 108148 ------ Additional Comments From lhh 2003-10-28 10:26 ------- CVS build of clumanager: http://people.redhat.com/lhh/.testing/clumanager-1.2.5-0.1.89.2.8.i386.rpm http://people.redhat.com/lhh/.testing/clumanager-1.2.5-0.1.89.2.8.src.rpm Fixes: #103721, #106465, #107274, #107276, #108148
This is not something that can be tested in-house, due to not having power switches which say "PRS-10 Ready", instead of "RPS-10 Ready". The fact that the submitter seems happy with the fix makes me willing to mark this closed.
Well. The "PRS-10" <-> "RPS-10" fix worked pretty well. Thanks again. So you may close the call. Best regards, Alex.