Bug 166109
Summary: | Segfault in clurgmgrd | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Henry Harris <henry.harris> | ||||||||||||
Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 4 | CC: | axel.thimm, cluster-maint, kanderso | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | RHBA-2006-0241 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2006-03-09 19:40:20 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Henry Harris
2005-08-16 23:32:45 UTC
Henry, were you able to reproduce this, or did it just randomly happen? If you could give me a little more ... that'd be helpful =) There is a segfault in clurgmgrd, when you give it certain service configs, which will be fixed in U2... this might be that. No, we have not reproduced it. I only saw it the one time. We did have about 8 or 9 services configured, mostly service ip addresses, so you are probably correct that may be the one you've fixed in U2. Lon, can you please describe what was the specific issue with the service configs that caused this problem? We would like to be able to test that the issue is fixed with U2. Thanks. Just saw this problem repeated with rgmanager version 1.9.38-0. Hmmm, that means it is a different problem. The one that will be fixed in U2 is solved -- and actually prevents a clean startup of rgmanager. Created attachment 120002 [details]
From /var/log/messages
We have not been able to get a core dump from this error, even when we send a
SIGSEGV. This is the info from /var/log/messages. This occur occurs every 3-4
days or so.
Rgmanager currently catches SIGSEGV, but if it does so, it enters the signal handler and never returns. The process should still be running. You can then attach GDB to it and get backtraces (less useful than core files, but most people run with core dumps disabled). I'll make a change to rgmanager so that it will behave normally (e.g. drop core) so we can debug this. I have a feeling this is platform-specific. Packages sent; current tree from CVS: RHEL4U2 Created attachment 121029 [details]
Core dump files
There are two core dump files attached from two different systems that both
segfaulted in clurgmgrd.
This is odd, the core files look like they're generated at addresses way outside of rgmanager. e.g.: (gdb) thr a a bt Thread 4 (process 31890): #0 0x0000003eaab8e8e5 in ?? () #1 0x0000000000000000 in ?? () Thread 3 (process 32156): #0 0x0000003eaabbe496 in ?? () #1 0x0000000000000000 in ?? () Thread 2 (process 15602): #0 0x0000003eab6088da in ?? () #1 0x0000000000000000 in ?? () Thread 1 (process 15601): #0 0x0000003eaab8e971 in ?? () Cannot access memory at address 0x5fe33fe0 (gdb) x/i 0x0000003eaab8e971 0x3eaab8e971: Cannot access memory at address 0x3eaab8e971 (gdb) p main $3 = {int (int, char **)} 0x4058c6 <main> Arghhh!! This bug looks a heck of a lot like this problem: http://lkml.org/lkml/2005/9/20/207 I attached a userland script workaround for the TLB problem which should be able to be used for testing: https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=121253 This applies the suggested workaround to each CPU on the machine. Created attachment 121254 [details]
core dump file
Here's one more core dump if you wish to confirm it's the same problem.
Unfortunately, the core files seem to have the same symptoms. The above script applies a temporary fix for a TLB flush filter problem on x86_64, which I think is the cause of the segfault. Unfortunately, running the above script must be done on each machine each time the machine reboots, so, the correct fix is to put it in the kernel. There is a bugzilla open for the kernel side as well (this bugzilla currently depends on it): 173506 In the event that the above does not solve the problem, I can enable an internal memory allocation replacement which can trap if it detects memory corruption. If you would like, I can build the mem-checking version as well so you only have to run one set of tests. Out of curiosity, is the machine on which this manifests running a 100% stock RHEL kernel? Theoretically, the TLB flush filter problem should not manifest on RHEL4 kernels. In the case that the theory is wrong or there is a non-stock RHEL kernel, running the script on each machine should fix the problem. Lon, these core dumps were from a non-standard 2.6.9-11 kernel. However, the kernel/setup.c file is the same as the standard kernel. Looking at the link in comment 12 above, it looks like this problem was found on a 2.6.11 kernel and fixed in a 2.6.14 kernel. Is that correct? We looked at the kernel/setup.c file in the standard 2.6.9-22 kernel and did not see the fix. Are you sure that the fix should be in the 2.6.9-22 RHEL4 kernel? We're a little confused about which kernel version should have the fix. One other question -- if this is causing memory corruption, shouldn't we be seeing other kernel panics besides the clurgmgrd segfault. We've seen a few random kernel panics on 2.6.9-22, so I'm thinking these could be the same TLB problem. The RHEL4 kernels are not supposedly a problem because they do not enable 4 level pagetables - whereas a custom kernel might. When these are not enabled, the workaround for the TLB filter problem is not needed, as I understand it. If I recall correctly, this is on a customer's machine, and has not even been reproducible elsewhere. This, of course, begs the question of "What is different?". If the only answer is the kernel, then we need to try with a RHEL kernel, or at least with the supplied workaround. Sorry for the confusion about the latter point - When I said memory corruption, I was referring to internal checking in clurgmgrd for dynamically-allocated variables "just in case". Both this and the above (using a stock RHEL4 kernel) can be done at the same time. That said - I still think the kernel-based TLB workaround is worth a check anyway. I'm not sure whether what I report is TLB related, this is a pure RHEL4U2 kernel with latest released cluster rpms: Jan 12 07:27:56 zs03 kernel: clurgmgrd[10860]: segfault at 000000007c060ff0 rip 0000003dea88edf1 rsp 000000007c060ff0 error 6 Jan 12 07:27:59 zs03 kernel: svc: bad direction 65536, dropping request Jan 12 07:27:59 zs03 kernel: svc: bad direction 65537, dropping request Jan 12 07:27:59 zs03 kernel: svc: bad direction 65536, dropping request Jan 12 07:28:00 zs03 last message repeated 3 times Jan 12 07:28:00 zs03 kernel: svc: bad direction 65537, dropping request [...] This is a Tyan 2882 dual-opteron system. I found a strange segfault in magma_sm.so while testing a fix for another problem which had to do with lockspace releasing/acquiring races. That is, a program A could make program B segfault if it released its lockspace at the right time. I have a fix for that problem. Axel -- could you try to reproduce with 'ulimit -c unlimited' so that clurgmgrd drops a core file? (Requires U2 rgmanager package...) Created attachment 123335 [details]
Contains core dump and /var/log/messages files
To clarify, the attachments in comment #24 are from a seg fault that occurred last night with rgmanager 1.9.43 and is the first time we've seen this problem in quite a while. Previously, we have verfied that the TLB flush filter workaround is in place in our kernel. In analyzing what I've got here, rgmanager appears to have segfaulted in a worker thread (as opposed to the main thread). This narrows down the scope of the problem greatly. It probably occurred during a subsequent status-run for your service (it occurred 5 seconds after the previous status run). Here's the relevant info: Jan 17 00:24:37 castle02 clurgmgrd: [7671]: <info> Executing /mnt/crosswalk/email_notifier/run_email_notifier status Jan 17 00:24:52 castle02 kernel: clurgmgrd[5652]: segfault at 000000004aa11fc0 rip 0000003a9dc6d0d6 rsp 000000004aa11fc8 error 6 If possible, can you send me the configuration which was running at the time (directly, you don't have to put it in bugzilla)? The only thing corrupt at this point seems to be the stack pointers. The internal structures are at least partially intact, which is what I'm analyzing (to see if the resource trees got corrupted somehow...). Ok, here's the *concise* version as to what I've got so far. There was one worker thread running, and it was handling a status check for the service named "192.168.253.44" which only had an IP address. The thread was in the 'started' state, and its status was set to 'working on RG_STATUS request'. Thread state: (gdb) p *(resthread_t *) 0x2a965d5270 $3 = {_list_head = {le_next = 0x2a965d5270, le_prev = 0x2a965d5270}, rt_thread = 1262557536, rt_request = 4, rt_status = 112, rt_name = "192.168.253.44", '\0' <repeats 241 times>, rt_queue = 0x4b411158, rt_queue_mutex = 0x4b411190, rt_queue_cond = 0x4b411160} No other pending requests were queued. The service itself is in the RG_STATE_STARTED state, and neither the View-Formation mutex nor the key-list-mutex were held. This means that clurgmgrd was *not* trying to update the service state at the time of the segfault. Service state: (gdb) p *(rg_state_t *)0x2a958036c0 $154 = {rs_name = "192.168.253.44", '\0' <repeats 49 times>, rs_owner = 2, rs_last_owner = 2, rs_state = 112, rs_restarts = 4, rs_transition = 1137453517, rs_id = 0, rs_pad = 0} For completeness, our local node ID: (gdb) p myid $167 = 2 So, castle02 was the owner, and it was in a valid state. Here's where we get to narrow things down. When the status check is run for a given check level, the ra_last gets set to the current time. (gdb) p ((resource_node_t *)0x644b30)->rn_actions[6] $126 = {ra_name = 0x627990 "status", ra_timeout = 20, ra_last = 1137482692, ra_interval = 120, ra_depth = 20} Translated using ctime: Tue Jan 17 02:24:52 2006 Offset by 2 hours (mountain time vs. eastern time): Tue Jan 17 00:24:52 2006 The time of the kernel segfault message corresponds to the same second the level 20 IP address check time, suggesting that clurgmgrd crashed after the time() call, but before the status check completed. It may have completed before or after the fork()/waitpid() call. The thread was still processing the request -- it had not been set back to NONE yet, but we were not trying to restart the service either, so it is very unlikely that clurgmgrd crashed in svc_stop() or the like. Kernel message: Jan 17 00:24:52 castle02 kernel: clurgmgrd[5652]: segfault at 000000004aa11fc0 rip 0000003a9dc6d0d6 rsp 000000004aa11fc8 error 6 Sorry, clarification: it may have *crashed* before or after the fork()/waitpid() call. I have found a bug in clurgmgrd which could explain this. (1) There's a reference-after-free which occurs every time a status check is run (but not when other requests are run). We call send_response(...), on a request which has been freed. send_response will close the fd if nonzero. (2) pthread_atfork() is incorrectly called. Over time, this not only chews up memory unnecessarily. This function initializes some memory. Given a cluster with several services running, here's what I think is happening: thread one Thread two ---------------------- --------------------------- status check complete free memory status check start pthread_atfork() send_response overwrite freed memory pthread_exit fork() <-- segfault (status check doesn't complete) The chances of this happening are extremely low, but nonzero, which would explain why it only happens very rarely. Created attachment 123399 [details]
Patch which solves memory corruption bug
I'm currently testing this patch for memory leaks; long term testing will be
necessary to ensure that it resolves this issue.
The patch passed several overnight stress tests. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0241.html |