Bug 625601

Summary: openais should handle non null terminated chkpoint strings
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: openaisAssignee: Ryan O'Hara <rohara>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: agk, antillon.maurizio, cluster-maint, fdinitto, lhh, michael.hagmann, sdake, ssaha
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openais-1.1.1-6.el6 Doc Type: Bug Fix
Doc Text:
When a checkpoint name was not terminated with the NULL character, the aisexec process may have terminated unexpectedly with a segmentation fault, causing a cluster outage. With this update, the underlying source code has been modified to resolve this issue, and such strings no longer cause aisexec to crash.
Story Points: ---
Clone Of:
: 625947 (view as bug list) Environment:
Last Closed: 2010-11-10 22:13:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 625947    
Attachments:
Description Flags
Fix printing of non-NULL terminated strings
none
Minor change to previous patch.
none
Resolve FIXME issues, changes printf statement to use length. none

Description Corey Marthaler 2010-08-19 22:08:44 UTC
Description of problem:

Aug 19 21:36:32 taft-01 clvmd: Cluster LVM daemon started - connected to CMAN
Aug 19 21:36:32 taft-01 kernel: device-mapper: dm-log-userspace: version 1.0.0 loaded
Aug 19 21:36:32 taft-01 dmeventd[2446]: dmeventd ready for processing.
Aug 19 21:36:32 taft-01 lvm[2446]: Monitoring mirror device TAFT-ha1 for events.
Aug 19 21:36:32 taft-01 lvm[2446]: TAFT-ha1 is now in-sync.
Aug 19 21:36:33 taft-01 abrt[2449]: saved core dump of pid 2057 (/usr/sbin/corosync) to /var/spool/abrt/ccpp-1282271793-2057.new/coredump (69464064 bytes)
Aug 19 21:36:33 taft-01 abrtd: Directory 'ccpp-1282271793-2057' creation detected
Aug 19 21:36:33 taft-01 fenced[2112]: cluster is down, exiting
Aug 19 21:36:33 taft-01 gfs_controld[2148]: daemon cpg_dispatch error 2
Aug 19 21:36:33 taft-01 dlm_controld[2128]: cluster is down, exiting
Aug 19 21:36:33 taft-01 gfs_controld[2148]: cluster is down, exiting
Aug 19 21:36:33 taft-01 abrtd: Registered Database plugin 'SQLite3'
Aug 19 21:36:33 taft-01 abrtd: Package 'corosync' isn't signed with proper key
Aug 19 21:36:33 taft-01 abrtd: Corrupted or bad crash /var/spool/abrt/ccpp-1282271793-2057 (res:5), deleting
Aug 19 21:36:35 taft-01 cmirrord[2268]: Sync checkpoint section creation failed: SA_AIS_ERR_LIBRARY
Aug 19 21:36:37 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4
Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 4
Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 3
Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 2
Aug 19 21:36:37 taft-01 kernel: dlm: closing connection to node 1
Aug 19 21:36:37 taft-01 kernel: dlm: clvmd: no userland control daemon, stopping lockspace
Aug 19 21:36:39 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to open checkpoint for 4: SA_AIS_ERR_LIBRARY
Aug 19 21:36:39 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4
Aug 19 21:36:41 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to open checkpoint for 4: SA_AIS_ERR_LIBRARY
Aug 19 21:36:41 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4
Aug 19 21:36:43 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to open checkpoint for 4: SA_AIS_ERR_LIBRARY
Aug 19 21:36:43 taft-01 cmirrord[2268]: [0tBLqj9e] Failed to export checkpoint for 4


(gdb) where
#0  0x0000003def2329b5 in raise () from /lib64/libc.so.6
#1  0x0000003def234195 in abort () from /lib64/libc.so.6
#2  0x0000003def26fe1b in __libc_message () from /lib64/libc.so.6
#3  0x0000003def2fb447 in __fortify_fail () from /lib64/libc.so.6
#4  0x0000003def2f9340 in __chk_fail () from /lib64/libc.so.6
#5  0x0000003def2f8799 in _IO_str_chk_overflow () from /lib64/libc.so.6
#6  0x0000003def273f1c in _IO_default_xsputn_internal () from /lib64/libc.so.6
#7  0x0000003def247b36 in vfprintf () from /lib64/libc.so.6
#8  0x0000003def2f883d in __vsprintf_chk () from /lib64/libc.so.6
#9  0x00007fc81a525ee2 in vsprintf (rec_ident=4294965351, function_name=0x7fc818331c80 "checkpoint_section_find", file_name=0x7fc818330d00 "ckpt.c",
    file_line=919, format=<value optimized out>, ap=<value optimized out>) at /usr/include/bits/stdio2.h:47
#10 _logsys_log_vprintf (rec_ident=4294965351, function_name=0x7fc818331c80 "checkpoint_section_find", file_name=0x7fc818330d00 "ckpt.c", file_line=919,
    format=<value optimized out>, ap=<value optimized out>) at logsys.c:1325
#11 0x00007fc81a5261da in _logsys_log_printf (rec_ident=<value optimized out>, function_name=<value optimized out>, file_name=<value optimized out>,
    file_line=<value optimized out>, format=<value optimized out>) at logsys.c:1398
#12 0x00007fc818329663 in checkpoint_section_find (checkpoint=0x19b2320,
    id=0x7ffffd28c4d8 "sync_bits\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377"..., id_len=9) at ckpt.c:919
#13 0x00007fc81832dd9a in message_handler_req_exec_ckpt_sectioncreate (message=0x7ffffd28c390, nodeid=<value optimized out>) at ckpt.c:1801
#14 0x0000000000406ec0 in deliver_fn (nodeid=1, msg=0x7ffffd28c390, msg_len=<value optimized out>, endian_conversion_required=0) at main.c:758
#15 0x00007fc81a74122f in app_deliver_fn (nodeid=1, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_required=0)
    at totempg.c:506
#16 0x00007fc81a7417b3 in totempg_deliver_fn (nodeid=1, msg=0x19b3112, msg_len=0, endian_conversion_required=0) at totempg.c:618
#17 0x00007fc81a73958f in messages_deliver_to_app (instance=0x7fc8135cb010, skip=0, end_point=<value optimized out>) at totemsrp.c:3701
#18 0x00007fc81a73f594 in message_handler_orf_token (instance=<value optimized out>, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3575
#19 0x00007fc81a7357c3 in rrp_deliver_fn (context=0x194e610, msg=0x1974b3c, msg_len=71) at totemrrp.c:1393
#20 0x00007fc81a7346b6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x1974470)
    at totemudp.c:1244
#21 0x00007fc81a73068a in poll_run (handle=5429246941735157760) at coropoll.c:435
#22 0x00000000004056ab in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1607


Version-Release number of selected component (if applicable):
openais-1.1.1-5.el6.x86_64
corosync-1.2.3-20.el6.x86_64

2.6.32-59.1.el6.x86_64

lvm2-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
lvm2-libs-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
lvm2-cluster-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-libs-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-event-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-event-libs-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
cmirror-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010



How reproducible:
Everytime I attempt to start clvmd with the cluster in "this state".

Comment 1 Steven Dake 2010-08-19 22:12:50 UTC
The issue is sync_bits section id length as given to the ckpt service is = 9.  This does not account for the null termination that the pretty printing in ckpt does.

We can't add a null termination inside ckpt because it will break wire compat.

The two possible solutions are to fix the numerous pretty printing that takes place inside the ckpt service or alternatively increase the "sync_bits" and possibly other length parameters used by cmirror passed into ckpt.

Comment 2 Jonathan Earl Brassow 2010-08-20 19:50:37 UTC
Created attachment 440026 [details]
Fix printing of non-NULL terminated strings

Don't add support for NULL terminated strings to the structures - or as others to pass in NULL terminated strings - just print the strings as allowed by printf ("%.*s" format).

Comment 3 Steven Dake 2010-08-20 21:25:48 UTC
General solution of patch looks reasonable, although attachment 440026 [details] has several FIXMEs which remain unaddressed.  Ryan will have to resolve those issues in an updated patch.

Comment 5 Ryan O'Hara 2010-08-20 21:51:13 UTC
Created attachment 440058 [details]
Minor change to previous patch.

Removed an extra ')' in the previous patch, line 1531.

Comment 6 Ryan O'Hara 2010-08-21 01:08:18 UTC
I have FIXMEs resolved. Updated patch will be posted soon. Need to recreate bug and test fix before handing-off to QE.

Comment 8 Ryan O'Hara 2010-08-23 15:38:28 UTC
Created attachment 440415 [details]
Resolve FIXME issues, changes printf statement to use length.

This should resolve all issues with log_printf and unterminated string. The issues tagged with FIXME in the original patch have been resolved. Please review this patch.

Comment 9 Ryan O'Hara 2010-08-23 19:27:16 UTC
Fix pushed upstream and pulled into RHEL6 build. Marking this as POST.

Comment 10 Steven Dake 2010-08-23 19:31:30 UTC
Patch reviewed by sdake - is good to go.

Comment 11 Ryan O'Hara 2010-08-23 19:38:25 UTC
Fixed in openais-1.1.1-6 build. Marking this MODIFIED.

Comment 12 Corey Marthaler 2010-08-23 20:34:46 UTC
No longer seeing clvmd start-up issues. Marking verified with the latest build.

openais-1.1.1-6.el6.x86_64
openaislib-1.1.1-6.el6.x86_64

Comment 13 releng-rhel@redhat.com 2010-11-10 22:13:01 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Comment 14 Jaromir Hradilek 2010-12-07 16:54:37 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When a checkpoint name was not terminated with the NULL character, the aisexec process may have terminated unexpectedly with a segmentation fault, causing a cluster outage. With this update, the underlying source code has been modified to resolve this issue, and such strings no longer cause aisexec to crash.