Bug 547511 - corosync IPC shutdown issue leads to several problems
Summary: corosync IPC shutdown issue leads to several problems
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: 12
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
Assignee: Jan Friesse
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 547813
TreeView+ depends on / blocked
 
Reported: 2009-12-14 20:32 UTC by Fabio Massimo Di Nitto
Modified: 2010-02-16 12:28 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
: 547813 (view as bug list)
Environment:
Last Closed: 2010-02-16 12:28:35 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
code for 2 test cases (1.65 KB, text/plain)
2009-12-14 20:32 UTC, Fabio Massimo Di Nitto
no flags Details
Proposed patch (4.87 KB, patch)
2009-12-17 12:17 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - take 2 (4.81 KB, patch)
2010-01-07 16:01 UTC, Jan Friesse
no flags Details | Diff
Test case (487 bytes, application/x-sh)
2010-01-07 16:04 UTC, Jan Friesse
no flags Details
proposed patchset - 1/6 (2.34 KB, patch)
2010-02-16 12:25 UTC, Jan Friesse
no flags Details | Diff
proposed patchset - 2/6 (3.88 KB, patch)
2010-02-16 12:25 UTC, Jan Friesse
no flags Details | Diff
proposed patchset - 3/6 (759 bytes, patch)
2010-02-16 12:26 UTC, Jan Friesse
no flags Details | Diff
proposed patchset - 4/6 (896 bytes, patch)
2010-02-16 12:26 UTC, Jan Friesse
no flags Details | Diff
proposed patchset - 5/6 (4.23 KB, patch)
2010-02-16 12:27 UTC, Jan Friesse
no flags Details | Diff
proposed patchset - 6/6 (11.05 KB, patch)
2010-02-16 12:27 UTC, Jan Friesse
no flags Details | Diff

Description Fabio Massimo Di Nitto 2009-12-14 20:32:18 UTC
Created attachment 378340 [details]
code for 2 test cases

Description of problem:

corosync shutdown process leaves open a window in which the daemon starts to shutdown but still accepts new IPC connections.

Because of this wrong behavior several issues can be seen: 

a) corosync crashes on shutdown (always)
b) clients connected to corosync hangs (almost always - very racy)
c) corosync does not shutdown at all (rare - also racy)

Version-Release number of selected component (if applicable):

1.2.0-1

How reproducible:

depends on the test case and race that we hit from test run to test run.

Steps to Reproduce:
1. build the test case in attachment (defaults to trigger #a and #b).
2. fire corosync -f (or even just corosync) in one terminal
3. fire the test case on another terminal
4a. kill -TERM $(pidof corosync)
4b. ctrl+c (if executing corosync -f)
  
Actual results:

a)corosync will always segfault on exit:

^CDec 14 20:58:34 corosync [SERV  ] Unloading all Corosync service engines.
Dec 14 20:58:34 corosync [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Dec 14 20:58:34 corosync [SERV  ] Service engine unloaded: corosync configuration service
Dec 14 20:58:34 corosync [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Dec 14 20:58:34 corosync [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Dec 14 20:58:34 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Dec 14 20:58:34 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Segmentation fault

b) the test case will hang forever waiting for a reply from corosync (that´s already dead):

0x00a68416 in __kernel_vsyscall ()
Missing separate debuginfos, use: debuginfo-install glibc-2.11-2.i686
(gdb) bt
#0  0x00a68416 in __kernel_vsyscall ()
#1  0x007d0f85 in sem_wait@@GLIBC_2.1 () from /lib/libpthread.so.0
#2  0x00952628 in reply_receive (res_len=<value optimized out>, 
    res_msg=<value optimized out>, ipc_instance=<value optimized out>)
    at coroipcc.c:465
#3  coroipcc_msg_send_reply_receive (res_len=<value optimized out>, 
    res_msg=<value optimized out>, ipc_instance=<value optimized out>)
    at coroipcc.c:973
#4  0x00ca924c in confdb_key_increment (handle=<value optimized out>, 
    parent_object_handle=<value optimized out>, 
    key_name=<value optimized out>, key_name_len=<value optimized out>, 
    value=<value optimized out>) at confdb.c:1001
#5  0x0804892d in test_objdb ()
#6  0x0804897b in main ()
(gdb) q

c) i don´t have a reduced test case for it yet, but we (chrissie and I) have seen that corosync will stop the shutdown process and all threads are locked on a mutex in objdb. We reproduce this one using cluster stable3 on top of corosync. Note that this bug might be unrelated to this same issue (it´s difficult to discern the issues) but in case we will file another bug.

d) even more rarely we have seen this backtrace (always in conjunction with a shutdown operation and active clients):

*** buffer overflow detected ***: corosync terminated
======= Backtrace: =========
/lib/libc.so.6(__fortify_fail+0x4d)[0x4bec5d]
/lib/libc.so.6(+0xf1d7a)[0x4bcd7a]
/usr/libexec/lcrso/service_confdb.lcrso(+0x1788)[0x257788]
/usr/lib/libcoroipcs.so.4(+0x2105)[0xdda105]
/lib/libpthread.so.0(+0x5ab5)[0x29dab5]
/lib/libc.so.6(clone+0x5e)[0x4a583e]

Expected results:

- corosync should not crash
- ipc should be locked (to avoid new connections) before shutdown process starts
- current ipc connections should be notified of shutdown (so clients can fail gracefully rather than hanging)

Additional info:

As mentioned.. there might be multiple bugs involved. Please split as required and we will follow up as necessary.

This is a major blocker at the moment.

Comment 1 Steven Dake 2009-12-14 20:48:04 UTC
The reason this is a critical issue is if the config file is invalid or too old, corosync will be told to exit immediately which can trigger init lockup.

Comment 3 Jan Friesse 2009-12-17 12:17:01 UTC
Created attachment 378976 [details]
Proposed patch

Patch sent to ml

Comment 4 Fabio Massimo Di Nitto 2009-12-17 13:13:58 UTC
The patch seems to have done some progress but doesn´t fix it all.

Here are the test results using the test cases attached to this bug.

test_objdb seems to pass now and the client connection is closed without problems but this might simply be luck because the next test shows that there is still a race condition.

test_conn still triggers a segfault.

term1:
corosync -f
....
Dec 17 14:12:20 corosync [MAIN  ] Completed service synchronization, ready to provide service.

term2:
./test

term3:
killall -TERM corosync

and term1 segfaults.

Comment 5 Fabio Massimo Di Nitto 2009-12-17 13:17:08 UTC
(In reply to comment #0)

> c) corosync does not shutdown at all (rare - also racy)

---

> c) i don´t have a reduced test case for it yet, but we (chrissie and I) have
> seen that corosync will stop the shutdown process and all threads are locked on
> a mutex in objdb. We reproduce this one using cluster stable3 on top of
> corosync. Note that this bug might be unrelated to this same issue (it´s
> difficult to discern the issues) but in case we will file another bug.

I cannot find a way to reproduce this in a small test case.

This is a way to do it:

- setup a cman cluster (2/3 nodes)

on one of the nodes:

term1:
cman_tool -d join

term2:
while [ ! -f stop ]; do ccs_tool query /cluster/@name; done

term1:
cman_tool leave

This will cause ccs_tool query to hang connected with that objdb mutex lock hold and corosync will _never_ shutdown.

Probably worth a separate bug, if so let´s just clone this one.

Fabio

Comment 6 Jan Friesse 2009-12-17 15:04:19 UTC
Segfaults are caused by pthread_join in recently added code. Some other tweaks will be needed to that code to work properly. One of them could be mutex together with variable. Variable can be pointer to pthread_t or NULL, if thread is not exist.

Of course, maybe Steve will find proper better and nicer solution, but I'm pretty off to PTO.

Comment 7 Jan Friesse 2010-01-07 16:01:54 UTC
Created attachment 382272 [details]
Proposed patch - take 2

Fabbio,
this is different patch. It should be applicable on current corosync trunk. In my test (test case in next attachment) it never fall. I didn't had time to test c) case.

Comment 8 Jan Friesse 2010-01-07 16:04:41 UTC
Created attachment 382274 [details]
Test case

This is testcase, which I use for testing patch - take2. testconn and testobjdb are compiled "code for 2 test cases".

Comment 9 Fabio Massimo Di Nitto 2010-01-08 10:14:15 UTC
So the patch looks a lot better.

I don´t see any more hanging on shutdowns and IPC seems to exit properly.

I can still see a segfault in the executive at shutdown, when running the test case code.

Anyway this is a super big step forward.

Fabio

Comment 10 Jan Friesse 2010-02-16 12:25:37 UTC
Created attachment 394513 [details]
proposed patchset - 1/6

Comment 11 Jan Friesse 2010-02-16 12:25:59 UTC
Created attachment 394514 [details]
proposed patchset - 2/6

Comment 12 Jan Friesse 2010-02-16 12:26:53 UTC
Created attachment 394515 [details]
proposed patchset - 3/6

Comment 13 Jan Friesse 2010-02-16 12:26:59 UTC
Created attachment 394516 [details]
proposed patchset - 4/6

Comment 14 Jan Friesse 2010-02-16 12:27:05 UTC
Created attachment 394517 [details]
proposed patchset - 5/6

Comment 15 Jan Friesse 2010-02-16 12:27:10 UTC
Created attachment 394518 [details]
proposed patchset - 6/6

Comment 16 Jan Friesse 2010-02-16 12:28:35 UTC
I hope, that with all patches (proposed patchset) problem is gone (all patches are already in Trunk).

Closing this as upstream.


Note You need to log in before you can comment on or make changes to this bug.