Bug 166109

Summary: Segfault in clurgmgrd
Product: [Retired] Red Hat Cluster Suite Reporter: Henry Harris <henry.harris>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 4CC: axel.thimm, cluster-maint, kanderso
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0241 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-09 19:40:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
From /var/log/messages
none
Core dump files
none
core dump file
none
Contains core dump and /var/log/messages files
none
Patch which solves memory corruption bug none

Description Henry Harris 2005-08-16 23:32:45 UTC
Description of problem: Got the following error  -- kernel: clurgmgrd[4527]: 
segfault at 0000000040000ff0 rip 00000030c8e8e971 rsp 0000000040000ff0 error 6


Version-Release number of selected component (if applicable):
1.9.34-1

How reproducible: Have not reproduced


Steps to Reproduce:
1. Unknown
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Lon Hohberger 2005-08-26 16:25:46 UTC
Henry, were you able to reproduce this, or did it just randomly happen?

If you could give me a little more ... that'd be helpful =)


There is a segfault in clurgmgrd, when you give it certain service configs,
which will be fixed in U2... this might be that.


Comment 2 Henry Harris 2005-08-26 17:37:43 UTC
No, we have not reproduced it.  I only saw it the one time.  We did have about 
8 or 9 services configured, mostly service ip addresses, so you are probably 
correct that may be the one you've fixed in U2.

Comment 3 Henry Harris 2005-09-13 21:52:07 UTC
Lon, can you please describe what was the specific issue with the service 
configs that caused this problem?  We would like to be able to test that the 
issue is fixed with U2.  Thanks.

Comment 5 Henry Harris 2005-09-30 21:34:09 UTC
Just saw this problem repeated with rgmanager version 1.9.38-0.

Comment 6 Lon Hohberger 2005-10-03 13:56:41 UTC
Hmmm, that means it is a different problem.  The one that will be fixed in U2 is
solved -- and actually prevents a clean startup of rgmanager.



Comment 7 Henry Harris 2005-10-14 20:54:53 UTC
Created attachment 120002 [details]
From /var/log/messages

We have not been able to get a core dump from this error, even when we send a
SIGSEGV.  This is the info from /var/log/messages.  This occur occurs every 3-4
days or so.

Comment 8 Lon Hohberger 2005-10-17 15:18:35 UTC
Rgmanager currently catches SIGSEGV, but if it does so, it enters the signal
handler and never returns.   The process should still be running.  You can then
attach GDB to it and get backtraces (less useful than core files, but most
people run with core dumps disabled).

I'll make a change to rgmanager so that it will behave normally (e.g. drop core)
so we can debug this.  I have a feeling this is platform-specific.

Comment 9 Lon Hohberger 2005-10-21 15:28:19 UTC
Packages sent; current tree from CVS: RHEL4U2

Comment 10 Henry Harris 2005-11-14 17:13:55 UTC
Created attachment 121029 [details]
Core dump files

There are two core dump files attached from two different systems that both
segfaulted in clurgmgrd.

Comment 11 Lon Hohberger 2005-11-15 22:09:36 UTC
This is odd, the core files look like they're generated at addresses way outside
of rgmanager.

e.g.:

(gdb) thr a a bt

Thread 4 (process 31890):
#0  0x0000003eaab8e8e5 in ?? ()
#1  0x0000000000000000 in ?? ()

Thread 3 (process 32156):
#0  0x0000003eaabbe496 in ?? ()
#1  0x0000000000000000 in ?? ()

Thread 2 (process 15602):
#0  0x0000003eab6088da in ?? ()
#1  0x0000000000000000 in ?? ()

Thread 1 (process 15601):
#0  0x0000003eaab8e971 in ?? ()
Cannot access memory at address 0x5fe33fe0
(gdb) x/i 0x0000003eaab8e971
0x3eaab8e971:   Cannot access memory at address 0x3eaab8e971
(gdb) p main
$3 = {int (int, char **)} 0x4058c6 <main>




Comment 12 Lon Hohberger 2005-11-15 22:28:02 UTC
Arghhh!!  This bug looks a heck of a lot like this problem:

http://lkml.org/lkml/2005/9/20/207

Comment 13 Lon Hohberger 2005-11-18 20:14:56 UTC
I attached a userland script workaround for the TLB problem which should be able
to be used for testing:

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=121253

This applies the suggested workaround to each CPU on the machine.



Comment 14 Henry Harris 2005-11-18 20:49:28 UTC
Created attachment 121254 [details]
core dump file

Here's one more core dump if you wish to confirm it's the same problem.

Comment 16 Lon Hohberger 2005-11-21 14:59:42 UTC
Unfortunately, the core files seem to have the same symptoms.

The above script applies a temporary fix for a TLB flush filter problem on
x86_64, which I think is the cause of the segfault.  Unfortunately, running the
above script must be done on each machine each time the machine reboots, so, the
correct fix is to put it in the kernel.  There is a bugzilla open for the kernel
side as well (this bugzilla currently depends on it): 173506

In the event that the above does not solve the problem, I can enable an internal
memory allocation replacement which can trap if it detects memory corruption.

If you would like, I can build the mem-checking version as well so you only have
to run one set of tests.

Comment 17 Lon Hohberger 2005-11-21 15:22:49 UTC
Out of curiosity, is the machine on which this manifests running a 100% stock
RHEL kernel?  Theoretically, the TLB flush filter problem should not manifest on
RHEL4 kernels.

In the case that the theory is wrong or there is a non-stock RHEL kernel,
running the script on each machine should fix the problem.

Comment 19 Henry Harris 2005-11-21 18:40:36 UTC
Lon, these core dumps were from a non-standard 2.6.9-11 kernel.  However, the 
kernel/setup.c file is the same as the standard kernel.  Looking at the link 
in comment 12 above, it looks like this problem was found on a 2.6.11 kernel 
and fixed in a 2.6.14 kernel.  Is that correct?  We looked at the 
kernel/setup.c file in the standard 2.6.9-22 kernel and did not see the fix.
Are you sure that the fix should be in the 2.6.9-22 RHEL4 kernel?  We're a 
little confused about which kernel version should have the fix.

One other question -- if this is causing memory corruption, shouldn't we be 
seeing other kernel panics besides the clurgmgrd segfault.  We've seen a few 
random kernel panics on 2.6.9-22, so I'm thinking these could be the same TLB 
problem. 

Comment 20 Lon Hohberger 2005-11-21 20:07:21 UTC
The RHEL4 kernels are not supposedly a problem because they do not enable 4
level pagetables - whereas a custom kernel might.  When these are not enabled,
the workaround for the TLB filter problem is not needed, as I understand it.

If I recall correctly, this is on a customer's machine, and has not even been
reproducible elsewhere.  This, of course, begs the question of "What is
different?".  If the only answer is the kernel, then we need to try with a RHEL
kernel, or at least with the supplied workaround.

Sorry for the confusion about the latter point - When I said memory corruption,
I was referring to internal checking in clurgmgrd for dynamically-allocated
variables "just in case".  Both this and the above (using a stock RHEL4 kernel)
can be done at the same time.

Comment 21 Lon Hohberger 2005-11-21 20:12:22 UTC
That said - I still think the kernel-based TLB workaround is worth a check anyway.

Comment 22 Axel Thimm 2006-01-12 13:11:56 UTC
I'm not sure whether what I report is TLB related, this is a pure RHEL4U2 kernel
with latest released cluster rpms:

Jan 12 07:27:56 zs03 kernel: clurgmgrd[10860]: segfault at 000000007c060ff0 rip
0000003dea88edf1 rsp 000000007c060ff0 error 6
Jan 12 07:27:59 zs03 kernel: svc: bad direction 65536, dropping request
Jan 12 07:27:59 zs03 kernel: svc: bad direction 65537, dropping request
Jan 12 07:27:59 zs03 kernel: svc: bad direction 65536, dropping request
Jan 12 07:28:00 zs03 last message repeated 3 times
Jan 12 07:28:00 zs03 kernel: svc: bad direction 65537, dropping request
[...]

This is a Tyan 2882 dual-opteron system.


Comment 23 Lon Hohberger 2006-01-12 15:47:40 UTC
I found a strange segfault in magma_sm.so while testing a fix for another
problem which had to do with lockspace releasing/acquiring races.  That is, a
program A could make program B segfault if it released its lockspace at the
right time.

I have a fix for that problem.

Axel -- could you try to reproduce with 'ulimit -c unlimited' so that clurgmgrd
drops a core file? (Requires U2 rgmanager package...)

Comment 24 Henry Harris 2006-01-17 21:44:46 UTC
Created attachment 123335 [details]
Contains core dump and /var/log/messages files

Comment 25 Henry Harris 2006-01-17 22:03:26 UTC
To clarify, the attachments in comment #24 are from a seg fault that occurred 
last night with rgmanager 1.9.43 and is the first time we've seen this problem 
in quite a while.  Previously, we have verfied that the TLB flush filter 
workaround is in place in our kernel.

Comment 26 Lon Hohberger 2006-01-18 02:00:59 UTC
In analyzing what I've got here, rgmanager appears to have segfaulted in a
worker thread (as opposed to the main thread).  This narrows down the scope of
the problem greatly.  It probably occurred during a subsequent status-run for
your service (it occurred 5 seconds after the previous status run).

Here's the relevant info:

Jan 17 00:24:37 castle02 clurgmgrd: [7671]: <info> Executing
/mnt/crosswalk/email_notifier/run_email_notifier status
Jan 17 00:24:52 castle02 kernel: clurgmgrd[5652]: segfault at 000000004aa11fc0
rip 0000003a9dc6d0d6 rsp 000000004aa11fc8 error 6

If possible, can you send me the configuration which was running at the time
(directly, you don't have to put it in bugzilla)?  The only thing corrupt at
this point seems to be the stack pointers.  The internal structures are at least
partially intact, which is what I'm analyzing (to see if the resource trees got
corrupted somehow...).


Comment 35 Lon Hohberger 2006-01-18 17:56:26 UTC
Ok, here's the *concise* version as to what I've got so far.

There was one worker thread running, and it was handling a status check for the
service named "192.168.253.44" which only had an IP address.  The thread was in
the 'started' state, and its status was set to 'working on RG_STATUS request'.

Thread state:

(gdb) p *(resthread_t *) 0x2a965d5270
$3 = {_list_head = {le_next = 0x2a965d5270, le_prev = 0x2a965d5270}, rt_thread =
1262557536, rt_request = 4, rt_status = 112, 
  rt_name = "192.168.253.44", '\0' <repeats 241 times>, rt_queue = 0x4b411158,
rt_queue_mutex = 0x4b411190, 
  rt_queue_cond = 0x4b411160}

No other pending requests were queued.

The service itself is in the RG_STATE_STARTED state, and neither the
View-Formation mutex nor the key-list-mutex were held.  This means that
clurgmgrd was *not* trying to update the service state at the time of the segfault.

Service state:

(gdb) p *(rg_state_t *)0x2a958036c0
$154 = {rs_name = "192.168.253.44", '\0' <repeats 49 times>, rs_owner = 2,
rs_last_owner = 2, rs_state = 112, rs_restarts = 4, 
  rs_transition = 1137453517, rs_id = 0, rs_pad = 0}

For completeness, our local node ID:

(gdb) p myid 
$167 = 2

So, castle02 was the owner, and it was in a valid state.  Here's where we get to
narrow things down.  When the status check is run for a given check level, the
ra_last gets set to the current time.

(gdb) p ((resource_node_t *)0x644b30)->rn_actions[6]
$126 = {ra_name = 0x627990 "status", ra_timeout = 20, ra_last = 1137482692,
ra_interval = 120, ra_depth = 20}

Translated using ctime:

Tue Jan 17 02:24:52 2006

Offset by 2 hours (mountain time vs. eastern time):

Tue Jan 17 00:24:52 2006

The time of the kernel segfault message corresponds to the same second the level
20 IP address check time, suggesting that clurgmgrd crashed after the time()
call, but before the status check completed.  It may have completed before or
after the fork()/waitpid() call.  The thread was still processing the request --
it had not been set back to NONE yet, but we were not trying to restart the
service either, so it is very unlikely that clurgmgrd crashed in svc_stop() or
the like.

Kernel message:

Jan 17 00:24:52 castle02 kernel: clurgmgrd[5652]: segfault at 000000004aa11fc0
rip 0000003a9dc6d0d6 rsp 000000004aa11fc8 error 6

Comment 36 Lon Hohberger 2006-01-18 17:57:26 UTC
Sorry, clarification: it may have *crashed* before or
after the fork()/waitpid() call.

Comment 38 Lon Hohberger 2006-01-18 20:06:52 UTC
I have found a bug in clurgmgrd which could explain this.

(1) There's a reference-after-free which occurs every time a status check is run
(but not when other requests are run).  We call send_response(...), on a request
which has been freed.  send_response will close the fd if nonzero.

(2) pthread_atfork() is incorrectly called.  Over time, this not only chews up
memory unnecessarily.  This function initializes some memory.

Given a cluster with several services running, here's what I think is happening:

thread one                   Thread two
----------------------       ---------------------------
status check complete
free memory
                             status check start
                             pthread_atfork()
send_response
overwrite freed memory
pthread_exit
                             fork() <-- segfault
                             (status check doesn't complete)

The chances of this happening are extremely low, but nonzero, which would
explain why it only happens very rarely.

Comment 39 Lon Hohberger 2006-01-18 20:17:27 UTC
Created attachment 123399 [details]
Patch which solves memory corruption bug

I'm currently testing this patch for memory leaks; long term testing will be
necessary to ensure that it resolves this issue.

Comment 40 Lon Hohberger 2006-01-27 20:51:47 UTC
The patch passed several overnight stress tests.

Comment 43 Red Hat Bugzilla 2006-03-09 19:40:20 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0241.html