| Summary: | libqb crash in qb_ipcc_event_recv | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Fabio Massimo Di Nitto <fdinitto> |
| Component: | libqb | Assignee: | Angus Salkeld <asalkeld> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | rawhide | CC: | asalkeld, sdake, teigland |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-05-29 00:46:59 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
New backtrace with 0.9.0-2
Program received signal SIGSEGV, Segmentation fault.
sem_trywait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_trywait.S:31
31 movl (%rdi), %eax
(gdb) bt
#0 sem_trywait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_trywait.S:31
#1 0x00007ffff5f1d314 in my_posix_sem_timedwait (rb=0x555555c80e70,
ms_timeout=0) at ringbuffer_helper.c:39
#2 0x00007ffff5f1ccd6 in qb_rb_chunk_read (rb=0x555555c80e70,
data_out=0x7fffffefe380, len=1048576, timeout=<optimized out>)
at ringbuffer.c:541
#3 0x00007ffff5f1fb2e in qb_ipcc_event_recv (c=0x555555b808b0,
msg_pt=msg_pt@entry=0x7fffffefe380, msg_len=msg_len@entry=1048576,
ms_timeout=ms_timeout@entry=0) at ipcc.c:290
#4 0x00007ffff6f4c9f1 in cpg_dispatch (handle=1822089774534492161,
dispatch_types=CS_DISPATCH_ALL)
at /home/fabbione/work/cluster/corosync/corosync/lib/cpg.c:355
#5 0x000055555555e770 in process_cpg_lockspace (ci=6) at cpg.c:1922
#6 0x0000555555563290 in loop () at main.c:964
#7 0x0000555555563f66 in main (argc=3, argv=0x7fffffffe538) at main.c:1300
with the latest corosync and libqb I don't get a crash but I see
dlm_controld not handling and error
if (error != CPG_OK)
log_error("daemon cpg_dispatch error %d", error);
cpg returns CS_ERR_TIMEOUT. This is from libqb but as this is a change
in behavior we can change the wrapper function to return TRY_AGAIN
(this seems to work well)
diff --git a/include/corosync/corotypes.h b/include/corosync/corotypes.h
index 74183d8..c67bf29 100644
--- a/include/corosync/corotypes.h
+++ b/include/corosync/corotypes.h
@@ -151,6 +151,7 @@ static inline cs_error_t qb_to_cs_error (int result)
case ENOMEM:
err = CS_ERR_NO_MEMORY;
break;
+ case ETIMEDOUT:
case EAGAIN:
err = CS_ERR_TRY_AGAIN;
break;
@@ -158,7 +159,6 @@ static inline cs_error_t qb_to_cs_error (int result)
err = CS_ERR_FAILED_OPERATION;
break;
case ETIME:
- case ETIMEDOUT:
err = CS_ERR_TIMEOUT;
break;
case EINVAL:
wfm - close out bug when pushed. Regards -steve |
This is not super simple to reproduce. 447 set_members lockspace rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd" 447 write "0" to "/sys/kernel/dlm/clvmd/event_done" 447 clvmd purged 0 plocks for 2 Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7364630 in sem_trywait () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.15-2.fc17.x86_64 libgcc-4.7.0-0.10.fc17.x86_64 libqb-0.9.0-1.10.42d2.dirty.fc17.x86_64 libxml2-2.7.8-7.fc17.x86_64 zlib-1.2.5-6.fc17.x86_64 (gdb) bt #0 0x00007ffff7364630 in sem_trywait () from /lib64/libpthread.so.0 #1 0x00007ffff5f19584 in ?? () from /usr/lib64/libqb.so.0 #2 0x00007ffff5f18f46 in qb_rb_chunk_read () from /usr/lib64/libqb.so.0 #3 0x00007ffff5f1b6de in qb_ipcc_event_recv () from /usr/lib64/libqb.so.0 #4 0x00007ffff6f4bae4 in cpg_dispatch (handle=1822089774534492161, dispatch_types=CS_DISPATCH_ALL) at cpg.c:355 #5 0x000055555555e770 in process_cpg_lockspace (ci=6) at cpg.c:1922 #6 0x0000555555563290 in loop () at main.c:964 #7 0x0000555555563f66 in main (argc=3, argv=0x7fffffffe538) at main.c:1300 (gdb) You need libqb with no timerfd. corosync master, dlm-3.99.0-3 from rawhide and lvm2-cluster from rawhide. quorum { provider: corosync_votequorum two_node: 1 wait_for_all: 0 last_man_standing: 0 auto_tie_breaker: 0 } nodelist { node { ring0_addr: 192.168.2.193 nodeid: 1 } node { ring0_addr: 192.168.2.194 nodeid: 2 } } start corosync -f on both nodes start dlm_controld -f0 -D on both nodes start clvmd -d1 on both nodes kill clvmd (ctrl+c) on node2 and then on node1 One of the two kills, in a very high amount of cases (almost always), will cause dlm_controld to segfault.