Bug 787208

Summary: libqb crash in qb_ipcc_event_recv
Product: [Fedora] Fedora Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: libqbAssignee: Angus Salkeld <asalkeld>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rawhideCC: asalkeld, sdake, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-29 00:46:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Fabio Massimo Di Nitto 2012-02-03 14:59:10 UTC
This is not super simple to reproduce.

447 set_members lockspace rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd"
447 write "0" to "/sys/kernel/dlm/clvmd/event_done"
447 clvmd purged 0 plocks for 2

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7364630 in sem_trywait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.15-2.fc17.x86_64 libgcc-4.7.0-0.10.fc17.x86_64 libqb-0.9.0-1.10.42d2.dirty.fc17.x86_64 libxml2-2.7.8-7.fc17.x86_64 zlib-1.2.5-6.fc17.x86_64
(gdb) bt
#0  0x00007ffff7364630 in sem_trywait () from /lib64/libpthread.so.0
#1  0x00007ffff5f19584 in ?? () from /usr/lib64/libqb.so.0
#2  0x00007ffff5f18f46 in qb_rb_chunk_read () from /usr/lib64/libqb.so.0
#3  0x00007ffff5f1b6de in qb_ipcc_event_recv () from /usr/lib64/libqb.so.0
#4  0x00007ffff6f4bae4 in cpg_dispatch (handle=1822089774534492161, 
    dispatch_types=CS_DISPATCH_ALL) at cpg.c:355
#5  0x000055555555e770 in process_cpg_lockspace (ci=6) at cpg.c:1922
#6  0x0000555555563290 in loop () at main.c:964
#7  0x0000555555563f66 in main (argc=3, argv=0x7fffffffe538) at main.c:1300
(gdb) 

You need libqb with no timerfd. corosync master, dlm-3.99.0-3 from rawhide and lvm2-cluster from rawhide.

quorum {
    provider: corosync_votequorum
    two_node: 1
    wait_for_all: 0
    last_man_standing: 0
    auto_tie_breaker: 0
}

nodelist {
        node {
                ring0_addr: 192.168.2.193
                nodeid: 1
        }
        node {
                ring0_addr: 192.168.2.194
                nodeid: 2
        }
}

start corosync -f on both nodes
start dlm_controld -f0 -D on both nodes
start clvmd -d1 on both nodes

kill clvmd (ctrl+c) on node2 and then on node1

One of the two kills, in a very high amount of cases (almost always), will cause dlm_controld to segfault.

Comment 1 Fabio Massimo Di Nitto 2012-02-06 12:22:46 UTC
New backtrace with 0.9.0-2

Program received signal SIGSEGV, Segmentation fault.
sem_trywait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_trywait.S:31
31              movl    (%rdi), %eax
(gdb) bt
#0  sem_trywait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_trywait.S:31
#1  0x00007ffff5f1d314 in my_posix_sem_timedwait (rb=0x555555c80e70, 
    ms_timeout=0) at ringbuffer_helper.c:39
#2  0x00007ffff5f1ccd6 in qb_rb_chunk_read (rb=0x555555c80e70, 
    data_out=0x7fffffefe380, len=1048576, timeout=<optimized out>)
    at ringbuffer.c:541
#3  0x00007ffff5f1fb2e in qb_ipcc_event_recv (c=0x555555b808b0, 
    msg_pt=msg_pt@entry=0x7fffffefe380, msg_len=msg_len@entry=1048576, 
    ms_timeout=ms_timeout@entry=0) at ipcc.c:290
#4  0x00007ffff6f4c9f1 in cpg_dispatch (handle=1822089774534492161, 
    dispatch_types=CS_DISPATCH_ALL)
    at /home/fabbione/work/cluster/corosync/corosync/lib/cpg.c:355
#5  0x000055555555e770 in process_cpg_lockspace (ci=6) at cpg.c:1922
#6  0x0000555555563290 in loop () at main.c:964
#7  0x0000555555563f66 in main (argc=3, argv=0x7fffffffe538) at main.c:1300

Comment 2 Angus Salkeld 2012-02-07 03:17:34 UTC
with the latest corosync and libqb I don't get a crash but I see
dlm_controld not handling and error

	if (error != CPG_OK)
		log_error("daemon cpg_dispatch error %d", error);

cpg returns CS_ERR_TIMEOUT. This is from libqb but as this is a change
in behavior we can change the wrapper function to return TRY_AGAIN

(this seems to work well)

diff --git a/include/corosync/corotypes.h b/include/corosync/corotypes.h
index 74183d8..c67bf29 100644
--- a/include/corosync/corotypes.h
+++ b/include/corosync/corotypes.h
@@ -151,6 +151,7 @@ static inline cs_error_t qb_to_cs_error (int result)
        case ENOMEM:
                err = CS_ERR_NO_MEMORY;
                break;
+       case ETIMEDOUT:
        case EAGAIN:
                err = CS_ERR_TRY_AGAIN;
                break;
@@ -158,7 +159,6 @@ static inline cs_error_t qb_to_cs_error (int result)
                err = CS_ERR_FAILED_OPERATION;
                break;
        case ETIME:
-       case ETIMEDOUT:
                err = CS_ERR_TIMEOUT;
                break;
        case EINVAL:

Comment 3 Steven Dake 2012-02-13 17:02:29 UTC
wfm - close out bug when pushed.

Regards
-steve