Created attachment 362560 [details] src rpm, the cib (a pretty standard drbd/mysql setup), messages logs from both nodes and gdb traces from 3 crashes Description of problem: OpenAIS crashes intermittently, leaving a non-responsive process with the following symptoms: [root@dbsba-scratch-ha2 crm]# crm resource status Error signing on to the CIB service: connection failed [root@dbsba-scratch-ha2 crm]# crm_verify -x cib.xml -VVV crm_verify[25552]: 2009/09/22_19:58:40 info: main: =#=#=#=#= Getting XML =#=#=#=#= crm_verify[25552]: 2009/09/22_19:58:40 notice: unpack_config: On loss of CCM Quorum: Ignore crm_verify[25552]: 2009/09/22_19:58:40 info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 [root@dbsba-scratch-ha2 crm]# openais-cfgtool -s Printing ring status. Could not initialize openais configuration API error 6 Version-Release number of selected component (if applicable): I originally found this issue with the prebuilt set of packages on http://download.opensuse.org/repositories/server:/ha-clustering/CentOS_5/x86_64/ which was openais-0.80.5-15 running on CentOS 5.3 x86_64. I discussed the issue with sdake on IRC, who suspected missing revisions 1998 or 1831 in the package could be the culprit. So I rebuilt the package using the latest whitetank SVN source (applying the same 2 patches from the above rpm src). I have included the rpm src in the attached file for reference. The issue still reproduces with this code however. How reproducible: I have had trouble reproducing this, since it seems intermittent. I did notice a correlation however that it most often happens when it another openais node connects to the cluster, for example when or rebooting or stop/starting the openais process on another node. Steps to Reproduce: 1. Start openais on one node 2. Reboot a node that shares a connection 3. Attempt to connect to the crm 4. If successful go to 2, until you fail to connect Actual results: You fail to connect - the aisexec process is still listed in ps -A, but is non responsive. Expected results: It remains responsive even when other nodes join the cluster. Additional info: The attached tarball includes the src rpm, the cib (a pretty standard drbd/mysql setup), messages logs from both nodes and gdb traces from 3 crashes. If you want to correlate with the logs the crash times were approximately Sep 23 23:03, 23:26 and Sep 23 19:53. The 3rd attempt is probably the "cleanest" reproduction - both nodes had come up from a reboot (with openais startup on boot disabled, so I could start via gdb), the second node was then rebooted 3 times - twice without causing a problem, the last time causing the crash. Please let me know if there is any specific tests you would like me to do, or patches you would like me to test.
Andrew Can you please take a look at this issue. Regards -steve
For what it's worth I also tried this on a clean CentOS 5.3 install with the latest stable versions of openais, cluster-glue, resource-agents and pacemaker all installed from source, and can still reproduce this. I separately also updated the src rpms to build the latest versions, but had the same issues there. I'll submit the spec diffs to Andrew privately in case they will be of use going forward.
It looks like the issue (likely) occurs on RHEL also: <misch> Anybody using openais/pacemaker from the opensuse builderver on RHEL5.3? I have a problem that openais does not start correctly after system reboot. <misch> openais-cfgtool -s shows now no active rings. <misch> pacemaker is not started by openais.
Detailed log of a failed automatic start and a successful manually start at: http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg01745.html Michael.
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
use the ulimit command to set the ulimit for core files to unlimited. ulimit -c unlimited Install the debuginfo package. Please submit a backtrace of the crash. To do this, run gdb /usr/sbin/openais /var/lib/openais/core.XXXX where XXXX is the PID (ls should show you) run the bt command It will show the signal that triggered the problem (ABORT or SEGV) as well as the backtrace. We have fixed several problems in corosync and whitetank (openais) which could be the culprit of your problems. I'd like to know if this is a unknown or known issue.
Just to be clear - are the traces I originally attached to this ticket insufficient? I am happy to try this again, if that would be useful.
Owen, My apologies I had not seen the backtraces. I investigated the traces and they are fixed upstream in corosync but not yet backported. Try the workaround of using timestamp: off in your configuration files. The root of the problem is that getenv/setenv/strftime/localtime are not thread safe, and pacemaker uses setenv (while strftime/localtime use getenv) at about the same time, resulting in segfaults. We are investigating further issues surrounding the use of getenv apis inside corosync while pacemaker continues to use setenv. Not sure if those issues affect openais whitetank tip. Try the timestamp: off workaround for now. Regards -steve
Great - I will try this and report back. Thanks so much for looking into this!
So I tried this out, this time using CentOS 5.4, and the builds from http://www.clusterlabs.org/rpm/epel-5/x86_64/ (including the updates from yesterday). corosync-1.2.0-1.el5.x86_64.rpm with pacemaker-1.0.7-4.el5.x86_64.rpm I added "timestamp: off" to the logging section for corosync.conf and the problem still occasionally reproduced in that I could still sometimes start up corosync, and fail to reconnect with crm (on the same node). I couldn't reproduce the original issue where restarting the second node would cause the first to crash. I checked that the corosync I am using includes the fix from https://bugzilla.redhat.com/show_bug.cgi?id=544022 so I wonder if this is something else. Even when I turned off all logging and debug in corosync.conf I could still reproduce - although it did "feel" less frequent with all logging disabled. Catching a gdb backtrace was harder, but I did catch one which is attached. This one occured when to_syslog was "yes", but everything else (including debug) was off. I also used a little script to hammer both of the nodes by stop-starting corosync and bouncing the resources (drbd/mysql) back and forth - this seemed quite effective it causing a variety of problems - some expected, some unexpected :)
Created attachment 390419 [details] GDB and syslog from segfault. Tester shell script.
Owen, We are aware of more problems related to startup that are not yet fixed in corosync dealing with thread safety and the syslog api. I am working on a workaround to resolve this segfault problem. Regards -steve
Great, thanks! Let me know if there is anything you would like me to test. In the meantime, do you happen to know if there are any earlier versions without this issue I could work with to get the rest of my config worked out?
There are no older versions without this problem or workraounds to this problem atm.
The following patch should provide a work-around: http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/b69340f97426
Can you confirm this is no longer an issue with the latest from F-12?
Yes - I ran my "corosync tennis" script over the weekend (using the clusterlabs build from just before posted your comment) and could not reproduce a crash at all. Guessing this is the correct closing status. Thanks for looking at this!