Hide Forgot
This is not super simple to reproduce. 447 set_members lockspace rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd" 447 write "0" to "/sys/kernel/dlm/clvmd/event_done" 447 clvmd purged 0 plocks for 2 Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7364630 in sem_trywait () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.15-2.fc17.x86_64 libgcc-4.7.0-0.10.fc17.x86_64 libqb-0.9.0-1.10.42d2.dirty.fc17.x86_64 libxml2-2.7.8-7.fc17.x86_64 zlib-1.2.5-6.fc17.x86_64 (gdb) bt #0 0x00007ffff7364630 in sem_trywait () from /lib64/libpthread.so.0 #1 0x00007ffff5f19584 in ?? () from /usr/lib64/libqb.so.0 #2 0x00007ffff5f18f46 in qb_rb_chunk_read () from /usr/lib64/libqb.so.0 #3 0x00007ffff5f1b6de in qb_ipcc_event_recv () from /usr/lib64/libqb.so.0 #4 0x00007ffff6f4bae4 in cpg_dispatch (handle=1822089774534492161, dispatch_types=CS_DISPATCH_ALL) at cpg.c:355 #5 0x000055555555e770 in process_cpg_lockspace (ci=6) at cpg.c:1922 #6 0x0000555555563290 in loop () at main.c:964 #7 0x0000555555563f66 in main (argc=3, argv=0x7fffffffe538) at main.c:1300 (gdb) You need libqb with no timerfd. corosync master, dlm-3.99.0-3 from rawhide and lvm2-cluster from rawhide. quorum { provider: corosync_votequorum two_node: 1 wait_for_all: 0 last_man_standing: 0 auto_tie_breaker: 0 } nodelist { node { ring0_addr: 192.168.2.193 nodeid: 1 } node { ring0_addr: 192.168.2.194 nodeid: 2 } } start corosync -f on both nodes start dlm_controld -f0 -D on both nodes start clvmd -d1 on both nodes kill clvmd (ctrl+c) on node2 and then on node1 One of the two kills, in a very high amount of cases (almost always), will cause dlm_controld to segfault.
New backtrace with 0.9.0-2 Program received signal SIGSEGV, Segmentation fault. sem_trywait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_trywait.S:31 31 movl (%rdi), %eax (gdb) bt #0 sem_trywait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_trywait.S:31 #1 0x00007ffff5f1d314 in my_posix_sem_timedwait (rb=0x555555c80e70, ms_timeout=0) at ringbuffer_helper.c:39 #2 0x00007ffff5f1ccd6 in qb_rb_chunk_read (rb=0x555555c80e70, data_out=0x7fffffefe380, len=1048576, timeout=<optimized out>) at ringbuffer.c:541 #3 0x00007ffff5f1fb2e in qb_ipcc_event_recv (c=0x555555b808b0, msg_pt=msg_pt@entry=0x7fffffefe380, msg_len=msg_len@entry=1048576, ms_timeout=ms_timeout@entry=0) at ipcc.c:290 #4 0x00007ffff6f4c9f1 in cpg_dispatch (handle=1822089774534492161, dispatch_types=CS_DISPATCH_ALL) at /home/fabbione/work/cluster/corosync/corosync/lib/cpg.c:355 #5 0x000055555555e770 in process_cpg_lockspace (ci=6) at cpg.c:1922 #6 0x0000555555563290 in loop () at main.c:964 #7 0x0000555555563f66 in main (argc=3, argv=0x7fffffffe538) at main.c:1300
with the latest corosync and libqb I don't get a crash but I see dlm_controld not handling and error if (error != CPG_OK) log_error("daemon cpg_dispatch error %d", error); cpg returns CS_ERR_TIMEOUT. This is from libqb but as this is a change in behavior we can change the wrapper function to return TRY_AGAIN (this seems to work well) diff --git a/include/corosync/corotypes.h b/include/corosync/corotypes.h index 74183d8..c67bf29 100644 --- a/include/corosync/corotypes.h +++ b/include/corosync/corotypes.h @@ -151,6 +151,7 @@ static inline cs_error_t qb_to_cs_error (int result) case ENOMEM: err = CS_ERR_NO_MEMORY; break; + case ETIMEDOUT: case EAGAIN: err = CS_ERR_TRY_AGAIN; break; @@ -158,7 +159,6 @@ static inline cs_error_t qb_to_cs_error (int result) err = CS_ERR_FAILED_OPERATION; break; case ETIME: - case ETIMEDOUT: err = CS_ERR_TIMEOUT; break; case EINVAL:
wfm - close out bug when pushed. Regards -steve