+++ This bug was initially created as a clone of Bug #625601 +++ Description of problem: Aug 19 21:36:32 taft-01 clvmd: Cluster LVM daemon started - connected to CMAN Aug 19 21:36:32 taft-01 kernel: device-mapper: dm-log-userspace: version 1.0.0 loaded Aug 19 21:36:32 taft-01 dmeventd[2446]: dmeventd ready for processing. Aug 19 21:36:32 taft-01 lvm[2446]: Monitoring mirror device TAFT-ha1 for events. Aug 19 21:36:32 taft-01 lvm[2446]: TAFT-ha1 is now in-sync. Aug 19 21:36:33 taft-01 abrt[2449]: saved core dump of pid 2057 (/usr/sbin/corosync) to /var/spool/abrt/ccpp-1282271793-2057.new/coredump (69464064 bytes) Aug 19 21:36:33 taft-01 abrtd: Directory 'ccpp-1282271793-2057' creation detected Aug 19 21:36:33 taft-01 fenced[2112]: cluster is down, exiting Aug 19 21:36:33 taft-01 gfs_controld[2148]: daemon cpg_dispatch error 2 Aug 19 21:36:33 taft-01 dlm_controld[2128]: cluster is down, exiting Aug 19 21:36:33 taft-01 gfs_controld[2148]: cluster is down, exiting Aug 19 21:36:33 taft-01 abrtd: Registered Database plugin 'SQLite3' Aug 19 21:36:33 taft-01 abrtd: Package 'corosync' isn't signed with proper key Aug 19 21:36:33 taft-01 abrtd: Corrupted or bad crash /var/spool/abrt/ccpp-1282271793-2057 (res:5), deleting Aug 19 21:36:35 taft-01 cmirrord[2268]: Sync checkpoint section creation failed: SA_AIS_ERR_LIBRARY Aug 19 21:36:37 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4 Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 4 Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 3 Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 2 Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 1 Aug 19 21:36:37 taft-01 kernel: dlm: clvmd: no userland control daemon, stopping lockspace Aug 19 21:36:39 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to open checkpoint for 4: SA_AIS_ERR_LIBRARY Aug 19 21:36:39 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4 Aug 19 21:36:41 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to open checkpoint for 4: SA_AIS_ERR_LIBRARY Aug 19 21:36:41 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4 Aug 19 21:36:43 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to open checkpoint for 4: SA_AIS_ERR_LIBRARY Aug 19 21:36:43 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4 (gdb) where #0 0x0000003def2329b5 in raise () from /lib64/libc.so.6 #1 0x0000003def234195 in abort () from /lib64/libc.so.6 #2 0x0000003def26fe1b in __libc_message () from /lib64/libc.so.6 #3 0x0000003def2fb447 in __fortify_fail () from /lib64/libc.so.6 #4 0x0000003def2f9340 in __chk_fail () from /lib64/libc.so.6 #5 0x0000003def2f8799 in _IO_str_chk_overflow () from /lib64/libc.so.6 #6 0x0000003def273f1c in _IO_default_xsputn_internal () from /lib64/libc.so.6 #7 0x0000003def247b36 in vfprintf () from /lib64/libc.so.6 #8 0x0000003def2f883d in __vsprintf_chk () from /lib64/libc.so.6 #9 0x00007fc81a525ee2 in vsprintf (rec_ident=4294965351, function_name=0x7fc818331c80 "checkpoint_section_find", file_name=0x7fc818330d00 "ckpt.c", file_line=919, format=<value optimized out>, ap=<value optimized out>) at /usr/include/bits/stdio2.h:47 #10 _logsys_log_vprintf (rec_ident=4294965351, function_name=0x7fc818331c80 "checkpoint_section_find", file_name=0x7fc818330d00 "ckpt.c", file_line=919, format=<value optimized out>, ap=<value optimized out>) at logsys.c:1325 #11 0x00007fc81a5261da in _logsys_log_printf (rec_ident=<value optimized out>, function_name=<value optimized out>, file_name=<value optimized out>, file_line=<value optimized out>, format=<value optimized out>) at logsys.c:1398 #12 0x00007fc818329663 in checkpoint_section_find (checkpoint=0x19b2320, id=0x7ffffd28c4d8 "sync_bits\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377"..., id_len=9) at ckpt.c:919 #13 0x00007fc81832dd9a in message_handler_req_exec_ckpt_sectioncreate (message=0x7ffffd28c390, nodeid=<value optimized out>) at ckpt.c:1801 #14 0x0000000000406ec0 in deliver_fn (nodeid=1, msg=0x7ffffd28c390, msg_len=<value optimized out>, endian_conversion_required=0) at main.c:758 #15 0x00007fc81a74122f in app_deliver_fn (nodeid=1, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_required=0) at totempg.c:506 #16 0x00007fc81a7417b3 in totempg_deliver_fn (nodeid=1, msg=0x19b3112, msg_len=0, endian_conversion_required=0) at totempg.c:618 #17 0x00007fc81a73958f in messages_deliver_to_app (instance=0x7fc8135cb010, skip=0, end_point=<value optimized out>) at totemsrp.c:3701 #18 0x00007fc81a73f594 in message_handler_orf_token (instance=<value optimized out>, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3575 #19 0x00007fc81a7357c3 in rrp_deliver_fn (context=0x194e610, msg=0x1974b3c, msg_len=71) at totemrrp.c:1393 #20 0x00007fc81a7346b6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x1974470) at totemudp.c:1244 #21 0x00007fc81a73068a in poll_run (handle=5429246941735157760) at coropoll.c:435 #22 0x00000000004056ab in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1607 Version-Release number of selected component (if applicable): openais-1.1.1-5.el6.x86_64 corosync-1.2.3-20.el6.x86_64 2.6.32-59.1.el6.x86_64 lvm2-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 lvm2-libs-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 lvm2-cluster-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 udev-147-2.22.el6 BUILT: Fri Jul 23 07:21:33 CDT 2010 device-mapper-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 device-mapper-libs-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 device-mapper-event-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 device-mapper-event-libs-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 cmirror-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 How reproducible: Everytime I attempt to start clvmd with the cluster in "this state". --- Additional comment from sdake on 2010-08-19 18:12:50 EDT --- The issue is sync_bits section id length as given to the ckpt service is = 9. This does not account for the null termination that the pretty printing in ckpt does. We can't add a null termination inside ckpt because it will break wire compat. The two possible solutions are to fix the numerous pretty printing that takes place inside the ckpt service or alternatively increase the "sync_bits" and possibly other length parameters used by cmirror passed into ckpt. --- Additional comment from jbrassow on 2010-08-20 15:50:37 EDT --- Don't add support for NULL terminated strings to the structures - or as others to pass in NULL terminated strings - just print the strings as allowed by printf ("%.*s" format). --- Additional comment from sdake on 2010-08-20 17:25:48 EDT --- General solution of patch looks reasonable, although attachment 440026 [details] has several FIXMEs which remain unaddressed. Ryan will have to resolve those issues in an updated patch.
Steve, since this hasn't been hit in RHEL5 can you provide a reproducer and test results?
When I finish the patch I am planning to test with saftest (~100 ckpt test cases). Don't have any idea how to reproduce, however code in rhel5 is nearly identical to code in rhel6 for this service.
tested with debug: on to ensure all the printing statements were triggered saftest ckpt test cases run on openais-0.80.6-16.el5 (rhel5.5 base release) [root@mrg-02 AIS-ckpt-B.01.01]# ./run_tests.sh all ./run_tests.sh: line 32: 17861 Alarm clock $TIMEOUT_EXE 0 > /dev/null 2>&1 ./report.sh all [root@mrg-02 AIS-ckpt-B.01.01]# ./report.sh all [root@mrg-02 AIS-ckpt-B.01.01]# more result.txt Total run test cases: 403 Pass: 394 Fail: 7 Block: 0 Notsupport: 0 Noresolved: 2 Unknown reason: 0 Finish testing [root@mrg-02 AIS-ckpt-B.01.01]# ls find-test LDFLAGS Makefile result.txt src include log report.sh run_tests.sh saftest ckpt test cases run on scratch build including attached patch: (scratch build at https://brewweb.devel.redhat.com/taskinfo?taskID=2713013 [root@mrg-02 AIS-ckpt-B.01.01]# ./report.sh all [root@mrg-02 AIS-ckpt-B.01.01]# ls find-test LDFLAGS Makefile result.txt src include log report.sh run_tests.sh [root@mrg-02 AIS-ckpt-B.01.01]# more result.txt Total run test cases: 403 Pass: 395 Fail: 6 Block: 0 Notsupport: 0 Noresolved: 2 Unknown reason: 0 Finish testing [root@mrg-02 AIS-ckpt-B.01.01]# Patch does not appear to introduce regressions.
Created attachment 441310 [details] revision 2159 to fix problem
Steve, Does the newly passing test case cover the fix you're proposing here?
There is no test case that is available to reproduce this issue - however the code is clearly defective. The ~400 test cases run in comment #4 provide high degree of confidence there is no regression as a result of the patch.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When a checkpoint name was not terminated with the NULL character, the aisexec process may have terminated unexpectedly with a segmentation fault, causing a cluster outage. With this update, the underlying source code has been modified to resolve this issue, and such strings no longer cause aisexec to crash.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0100.html