Bug 525552 - OpenAIS crashes intermittently, especially when connecting to a new node
Summary: OpenAIS crashes intermittently, especially when connecting to a new node
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: openais
Version: 12
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Andrew Beekhof
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-24 19:26 UTC by Owen Barton
Modified: 2010-06-28 23:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-06-28 23:30:29 UTC


Attachments (Terms of Use)
src rpm, the cib (a pretty standard drbd/mysql setup), messages logs from both nodes and gdb traces from 3 crashes (1.25 MB, application/x-gzip)
2009-09-24 19:26 UTC, Owen Barton
no flags Details
GDB and syslog from segfault. Tester shell script. (3.38 KB, application/x-gzip)
2010-02-12 01:24 UTC, Owen Barton
no flags Details

Description Owen Barton 2009-09-24 19:26:20 UTC
Created attachment 362560 [details]
src rpm, the cib (a pretty standard drbd/mysql setup), messages logs from both nodes and gdb traces from 3 crashes

Description of problem:

OpenAIS crashes intermittently, leaving a non-responsive process with the following symptoms:

[root@dbsba-scratch-ha2 crm]# crm resource status
Error signing on to the CIB service: connection failed
[root@dbsba-scratch-ha2 crm]# crm_verify -x cib.xml -VVV
crm_verify[25552]: 2009/09/22_19:58:40 info: main: =#=#=#=#= Getting XML =#=#=#=#=
crm_verify[25552]: 2009/09/22_19:58:40 notice: unpack_config: On loss of CCM Quorum: Ignore
crm_verify[25552]: 2009/09/22_19:58:40 info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
[root@dbsba-scratch-ha2 crm]# openais-cfgtool -s
Printing ring status.
Could not initialize openais configuration API error 6

Version-Release number of selected component (if applicable):

I originally found this issue with the prebuilt set of packages on http://download.opensuse.org/repositories/server:/ha-clustering/CentOS_5/x86_64/ which was openais-0.80.5-15 running on CentOS 5.3 x86_64.

I discussed the issue with sdake on IRC, who suspected missing revisions 1998 or 1831 in the package could be the culprit. So I rebuilt the package using the latest whitetank SVN source (applying the same 2 patches from the above rpm src). I have included the rpm src in the attached file for reference.

The issue still reproduces with this code however.

How reproducible:

I have had trouble reproducing this, since it seems intermittent. I did notice a correlation however that it most often happens when it another openais node connects to the cluster, for example when or rebooting or stop/starting the openais process on another node.

Steps to Reproduce:
1. Start openais on one node
2. Reboot a node that shares a connection
3. Attempt to connect to the crm
4. If successful go to 2, until you fail to connect

Actual results:
You fail to connect - the aisexec process is still listed in ps -A, but is non responsive.

Expected results:
It remains responsive even when other nodes join the cluster.

Additional info:

The attached tarball includes the src rpm, the cib (a pretty standard drbd/mysql setup), messages logs from both nodes and gdb traces from 3 crashes. If you want to correlate with the logs the crash times were approximately Sep 23 23:03, 23:26 and Sep 23 19:53. The 3rd attempt is probably the "cleanest" reproduction - both nodes had come up from a reboot (with openais startup on boot disabled, so I could start via gdb), the second node was then rebooted 3 times - twice without causing a problem, the last time causing the crash.

Please let me know if there is any specific tests you would like me to do, or patches you would like me to test.

Comment 1 Steven Dake 2009-10-01 20:26:04 UTC
Andrew

Can you please take a look at this issue.

Regards
-steve

Comment 2 Owen Barton 2009-10-01 20:57:31 UTC
For what it's worth I also tried this on a clean CentOS 5.3 install with the
latest stable versions of openais, cluster-glue, resource-agents and pacemaker
all installed from source, and can still reproduce this.

I separately also updated the src rpms to build the latest versions, but had the same issues there. I'll submit the spec diffs to Andrew privately in case they will be of use going forward.

Comment 3 Owen Barton 2009-10-01 21:56:39 UTC
It looks like the issue (likely) occurs on RHEL also:
<misch> Anybody using openais/pacemaker from the opensuse builderver on RHEL5.3? I have a problem that openais does not start correctly after system reboot.
<misch> openais-cfgtool -s shows now no active rings.
<misch> pacemaker is not started by openais.

Comment 4 Michael Schwartzkopff 2009-10-02 06:38:04 UTC
Detailed log of a failed automatic start and a successful manually start at:
http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg01745.html

Michael.

Comment 5 Bug Zapper 2009-11-16 12:52:29 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 6 Steven Dake 2009-12-15 21:17:34 UTC
use the ulimit command to set the ulimit for core files to unlimited.

ulimit -c unlimited

Install the debuginfo package.

Please submit a backtrace of the crash.  To do this, run

gdb /usr/sbin/openais /var/lib/openais/core.XXXX where XXXX is the PID (ls should show you)

run the bt command

It will show the signal that triggered the problem (ABORT or SEGV) as well as the backtrace.

We have fixed several problems in corosync and whitetank (openais) which could be the culprit of your problems.  I'd like to know if this is a unknown or known issue.

Comment 7 Owen Barton 2010-02-02 21:16:58 UTC
Just to be clear - are the traces I originally attached to this ticket insufficient? I am happy to try this again, if that would be useful.

Comment 8 Steven Dake 2010-02-02 21:33:59 UTC
Owen,

My apologies I had not seen the backtraces.

I investigated the traces and they are fixed upstream in corosync but not yet backported.  Try the workaround of using timestamp: off in your configuration files.

The root of the problem is that getenv/setenv/strftime/localtime are not thread safe, and pacemaker uses setenv (while strftime/localtime use getenv) at about the same time, resulting in segfaults.

We are investigating further issues surrounding the use of getenv apis inside corosync while pacemaker continues to use setenv.  Not sure if those issues affect openais whitetank tip.

Try the timestamp: off workaround for now.

Regards
-steve

Comment 9 Owen Barton 2010-02-03 00:59:41 UTC
Great - I will try this and report back. Thanks so much for looking into this!

Comment 10 Owen Barton 2010-02-12 01:23:34 UTC
So I tried this out, this time using CentOS 5.4, and the builds from http://www.clusterlabs.org/rpm/epel-5/x86_64/ (including the updates from yesterday). corosync-1.2.0-1.el5.x86_64.rpm with pacemaker-1.0.7-4.el5.x86_64.rpm

I added "timestamp: off" to the logging section for corosync.conf and the problem still occasionally reproduced in that I could still sometimes start up corosync, and fail to reconnect with crm (on the same node). I couldn't reproduce the original issue where restarting the second node would cause the first to crash.

I checked that the corosync I am using includes the fix from https://bugzilla.redhat.com/show_bug.cgi?id=544022 so I wonder if this is something else. Even when I turned off all logging and debug in corosync.conf I could still reproduce - although it did "feel" less frequent with all logging disabled.

Catching a gdb backtrace was harder, but I did catch one which is attached. This one occured when to_syslog was "yes", but everything else (including debug) was off. I also used a little script to hammer both of the nodes by stop-starting corosync and bouncing the resources (drbd/mysql) back and forth - this seemed quite effective it causing a variety of problems - some expected, some unexpected :)

Comment 11 Owen Barton 2010-02-12 01:24:32 UTC
Created attachment 390419 [details]
GDB and syslog from segfault. Tester shell script.

Comment 12 Steven Dake 2010-02-12 05:26:33 UTC
Owen,

We are aware of more problems related to startup that are not yet fixed in corosync dealing with thread safety and the syslog api.  I am working on a workaround to resolve this segfault problem.

Regards
-steve

Comment 13 Owen Barton 2010-02-16 20:37:36 UTC
Great, thanks! Let me know if there is anything you would like me to test.

In the meantime, do you happen to know if there are any earlier versions without this issue I could work with to get the rest of my config worked out?

Comment 14 Steven Dake 2010-02-16 21:02:25 UTC
There are no older versions without this problem or workraounds to this problem atm.

Comment 15 Andrew Beekhof 2010-02-24 09:26:34 UTC
The following patch should provide a work-around:
   http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/b69340f97426

Comment 16 Andrew Beekhof 2010-05-20 13:03:46 UTC
Can you confirm this is no longer an issue with the latest from F-12?

Comment 17 Owen Barton 2010-06-28 23:30:29 UTC
Yes - I ran my "corosync tennis" script over the weekend (using the clusterlabs build from just before posted your comment) and could not reproduce a crash at all. Guessing this is the correct closing status.

Thanks for looking at this!


Note You need to log in before you can comment on or make changes to this bug.