Bug 706278 - openais crash and hang on VM cluster
Summary: openais crash and hang on VM cluster
Keywords:
Status: CLOSED DUPLICATE of bug 694181
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-19 22:13 UTC by Alan Conway
Modified: 2011-11-23 14:21 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-08 15:11:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Alan Conway 2011-05-19 22:13:26 UTC
Description of problem:

Conversation with sdake on Wed 19 May 2011

(05:55:56 PM) sdake: what is the test scenario
(05:57:31 PM) Alan Conway: running 4 brokers, on 2 brokers continuous heavy load, on the other two repeatedly stop/start brokers with load. The two that were doing stop/start were failing to restart with try again in cpg_init. The other two are the ones that are spinning.
(05:58:01 PM) sdake: is this a new test case?
(05:58:07 PM) sdake: or something that worked previously
(05:58:52 PM) Alan Conway: I don't know if this exact scenario is new,  but tests like this have been run before. The fact they are on VMs is new, I don't know if that's relevant.
(05:59:30 PM) sdake: yes may be relevant, is vm 1 cpu?
(06:06:10 PM) Alan Conway: cat /proc/cpuinfo shows 3 cpus

Comment 1 Steven Dake 2011-05-19 22:23:19 UTC
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff8f40000
Core was generated by `/usr/sbin/aisexec'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003a6687c39b in memcpy () from /lib64/libc.so.6
(gdb) where
#0  0x0000003a6687c39b in memcpy () from /lib64/libc.so.6
#1  0x000000000041b84f in __memcpy_ichk (conn=0x651d5c0, msg=0x41c47000, 
    mlen=10272) at /usr/include/bits/string3.h:51
#2  openais_response_send (conn=0x651d5c0, msg=0x41c47000, mlen=10272)
    at ipc.c:926
#3  0x000000000041bac1 in pthread_ipc_consumer (conn=0x651d5c0) at ipc.c:392
#4  0x0000003a6700673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003a668d44bd in clone () from /lib64/libc.so.6
(gdb) up
#1  0x000000000041b84f in __memcpy_ichk (conn=0x651d5c0, msg=0x41c47000, 
    mlen=10272) at /usr/include/bits/string3.h:51
warning: Source file is more recent than executable.
51	  return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
(gdb) up
#2  openais_response_send (conn=0x651d5c0, msg=0x41c47000, mlen=10272)
    at ipc.c:926
926		memcpy (conn_info->mem->res_buffer, msg, mlen);
(gdb) up
#3  0x000000000041bac1 in pthread_ipc_consumer (conn=0x651d5c0) at ipc.c:392
392				openais_response_send (conn_info, &res_overlay, 
(gdb) print conn_info
$1 = (struct conn_info *) 0x2aaaad721420
(gdb) print *conn_info
$2 = {fd = 0, thread = 0, thread_attr = {__size = '\000' <repeats 55 times>, 
    __align = 0}, service = 0, state = CONN_STATE_THREAD_INACTIVE, 
  notify_flow_control_enabled = 0, refcount = 0, shmkey = 0, semkey = 0, 
  shmid = 0, semid = 0, pending_semops = 0, mutex = {__data = {__lock = 0, 
      __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
      __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 39 times>, __align = 0}, mem = 0x0, outq_head = {
    next = 0x0, prev = 0x0}, private_data = 0x0, lib_exit_fn = 0, list = {
    next = 0x0, prev = 0x0}, setup_msg = '\000' <repeats 23 times>, 
  setup_bytes_read = 0}

Comment 2 Steven Dake 2011-05-19 22:30:04 UTC
Looks like the state of conn info was set to inactive via ipc_disconnect, after the check to exit the thread.  The thread was freed, yet there was an error to send to the ipc connection, but the conn_info data was gone.

Comment 3 Jan Friesse 2011-05-23 08:16:37 UTC
For me it looks much more like uninitialized conn info (memset), but there is no way how to come to such state.

Actually, I was trying to reproduce that issue (service qpid restart), but without any success.

Alan, what version of openais are you running? Is it possible to get access to that machine?

Comment 4 ppecka 2011-05-23 09:49:54 UTC
(In reply to comment #3)
> For me it looks much more like uninitialized conn info (memset), but there is
> no way how to come to such state.
> 
> Actually, I was trying to reproduce that issue (service qpid restart), but
> without any success.
> 
> Alan, what version of openais are you running? Is it possible to get access to
> that machine?

for more details please see https://bugzilla.redhat.com/show_bug.cgi?id=703590

Comment 5 Jan Friesse 2011-05-23 11:23:42 UTC
Also very interesting is openais_response_send (conn=0x651d5c0, msg=0x41c47000, mlen=10272). We are never sending such big messages as response, messages are few bytes (20-40). This really seems to be ether problem with:
- openais falling somewhere else this place is simply where problem appear
- VM (KVM/Xen problem)
- host machine memory error/...

First option which I would like to exclude is VM. So, is this bug reproducible on non VM environment?

Comment 6 Steven Dake 2011-05-23 13:04:35 UTC
Regarding comment #5, we do send a 10272 byte message at ipc.c:468 

                        res_overlay.header.size =
                                        ais_service[conn_info->service]->lib_service[header->id].response_size;
                        res_overlay.header.id =
                                ais_service[conn_info->service]->lib_service[header->id].response_id;
                        res_overlay.header.error = SA_AIS_ERR_INVALID_PARAM;
                        openais_response_send (conn_info, &res_overlay,
                                res_overlay.header.size);
                } else


because conn_info->service is corrupted, response_size is not a valid value.

Still think likely issue is following set of conditions:
ipc.c:408 executes (checks if ipc connection still alive)
ipc.c:793 executes (someone calls ipc_disconnect, which matches with test case)
ipc.c:423 executes (a now-freed send buffer is passed to a later function)
ipc.c:427 executes (returning reserved_msgs = -1 because freed/memset buffer is passed from ipc.c:423)
ipc.c:462 executes (reserved = -1, the conn info data is broken, resulting in incorrect result size)

Note this can happen given the multithreaded nature of the system.

Comment 7 Jan Friesse 2011-05-23 13:47:37 UTC
(In reply to comment #6)
> Regarding comment #5, we do send a 10272 byte message at ipc.c:468 
> 
>                         res_overlay.header.size =
>                                        
> ais_service[conn_info->service]->lib_service[header->id].response_size;
>                         res_overlay.header.id =
>                                
> ais_service[conn_info->service]->lib_service[header->id].response_id;
>                         res_overlay.header.error = SA_AIS_ERR_INVALID_PARAM;
>                         openais_response_send (conn_info, &res_overlay,
>                                 res_overlay.header.size);
>                 } else
> 
> 
Ya

> because conn_info->service is corrupted, response_size is not a valid value.
> 
> Still think likely issue is following set of conditions:
> ipc.c:408 executes (checks if ipc connection still alive)
> ipc.c:793 executes (someone calls ipc_disconnect, which matches with test case)

This may happen

> ipc.c:423 executes (a now-freed send buffer is passed to a later function)

This cannot happen. Simply because:
ipc_disconnect is doing nothing. It simply sets conn_info->state to CONN_STATE_THREAD_REQUEST_EXIT. This may be followed by conn_info_destroy called by poll_handler_connection if state is CONN_STATE_THREAD_REQUEST_EXIT (what is). 

BUT conn_info_destroy will in such case call pthread_join (conn_info->thread, &retval) -> no free, no shmctl

> ipc.c:427 executes (returning reserved_msgs = -1 because freed/memset buffer is
> passed from ipc.c:423)
> ipc.c:462 executes (reserved = -1, the conn info data is broken, resulting in
> incorrect result size)
> 
> Note this can happen given the multithreaded nature of the system.

That's different case then the one I was trying to reproduce by simulating sync delay and ipc:484 (should give you the same). And after 5-6 hrs, I'm pretty sure that this part of ipc is well written and really bullet prof.

Anyway, tomorrow I have meeting with ppecka, and he will show me how to reproduce that bug.

Comment 8 Jan Friesse 2011-05-24 14:20:07 UTC
Few additional notes:
- actual hang from coredump is Overload, tell library to retry section
- it also seems thet right before response_send, conn_info service was in different state then in time of call
- machine where bug appeared was not updated openais-0.80.6-28.el5 -> trying to update to current Z. Hopefully problem is already fixed
- ppecka was not able to reproduce core dump, he was only able to reproduce original bug https://bugzilla.redhat.com/show_bug.cgi?id=703590
- there are "weird" OOM conditions

Comment 9 Jan Friesse 2011-06-08 15:11:50 UTC
Because nether Allan, PPecka nor me were able to reproduce this bug after upgrade to latest Z-stream package 0.80.6-28.1, closing this bug as a duplicate.

*** This bug has been marked as a duplicate of bug 694181 ***

Comment 10 jackie 2011-11-23 14:16:53 UTC
(In reply to comment #9)
> Because nether Allan, PPecka nor me were able to reproduce this bug after
> upgrade to latest Z-stream package 0.80.6-28.1, closing this bug as a
> duplicate.
> 
> *** This bug has been marked as a duplicate of bug 694181 ***

How did u solve this problem? The same problem occured in my site,could the following patch solves this problem ? I'm not familiar with the code of openais,
could you review for me ? thank u.

Comment 11 jackie 2011-11-23 14:21:49 UTC
Index: openais-0.80.6/exec/ipc.c
===================================================================
--- openais-0.80.6/exec/ipc.c	(revision 2186)
+++ openais-0.80.6/exec/ipc.c	(working copy)
@@ -954,11 +954,10 @@
 
 		conn_info = list_entry (list, struct conn_info, list);
 
-		shmdt (conn_info->mem);
-		shmctl (conn_info->shmid, IPC_RMID, NULL);
-		semctl (conn_info->semid, 0, IPC_RMID);
-	
-		pthread_kill (conn_info->thread, SIGUSR1);
+        ipc_disconnect(conn_info);
+
+        while(conn_info_destroy(conn_info) != -1)
+            ;
 	}
 }


Note You need to log in before you can comment on or make changes to this bug.