Description of problem: resend_event_notifications uses unaddressable byte Version-Release number of selected component (if applicable): Master How reproducible: 100% Steps to Reproduce: 1. corosync from master (of f17 updates) 2. libqb from master (or f17 updates -> 0.14.0) 3. valgrind corosync -f 4. cpg_test_agent 5. echo "1:0:cpg_initialize:" ; sleep 0.2; echo "2:0:cpg_join:0:cts_group:"; sleep 1; echo "1:0:record_messages:" ; sleep 0.1; echo "2:0:msg_blaster:0:9000:"; sleep 1; for i in `seq 1 1`;do echo "2:0:read_messages:0:50:" ; sleep 0.01;done) | nc 127.0.0.1 9034 Actual results: ==7169== Syscall param socketcall.sendto(msg) points to unaddressable byte(s) ==7169== at 0x58DCACC: send (in /lib64/libpthread-2.12.so) ==7169== by 0x526C572: qb_ipc_us_send (ipc_us.c:98) ==7169== by 0x526A467: resend_event_notifications (ipcs.c:333) ==7169== by 0x526B364: qb_ipcs_dispatch_connection_request (ipcs.c:733) ==7169== by 0x526682E: _poll_dispatch_and_take_back_ (loop_poll.c:98) ==7169== by 0x526602C: qb_loop_run_level (loop.c:45) ==7169== by 0x5266606: qb_loop_run (loop.c:206) ==7169== by 0x41C938: main (main.c:1229) ==7169== Address 0x73842c0 is 0 bytes after a block of size 1,328 alloc'd ==7169== at 0x4C25A28: calloc (vg_replace_malloc.c:467) ==7169== by 0x526AA8F: qb_ipcs_connection_alloc (ipcs.c:496) ==7169== by 0x526D56E: handle_new_connection (ipc_us.c:601) ==7169== by 0x526DF15: qb_ipcs_us_connection_acceptor (ipc_us.c:910) ==7169== by 0x526682E: _poll_dispatch_and_take_back_ (loop_poll.c:98) ==7169== by 0x526602C: qb_loop_run_level (loop.c:45) ==7169== by 0x5266606: qb_loop_run (loop.c:206) ==7169== by 0x41C938: main (main.c:1229) Expected results: No error Additional info:
Also please note that without valgrind, this causes data corruption in message (on client side) so it will ether ends up with incorrect sha1 hash or (most likely) segfault because NSS is not able to process message (msg len (in structure not in callback) is corrupted and says something about HUGE message) This is BLOCKER for corosync.
Created attachment 598007 [details] Patch to ensure problem is really in libqb event Angus, attached is patch to ensure that problem is really in libqb and not in corosync itself. Also problem is not 100% reproducible but with following few commands, it's very easy to reproduce in short time (seconds): 1. corosync -f 2. while true;do ./cpg_test_agent;done 3. while true;do (echo "1:0:cpg_initialize:" ; sleep 0.2; echo "2:0:cpg_join:0:cts_group:"; sleep 1; echo "1:0:record_messages:" ; sleep 0.1; echo "2:0:msg_blaster:0:9000:"; sleep 1; for i in `seq 1 1`;do echo "2:0:read_messages:0:50:" ; sleep 0.01;done) | nc 127.0.0.1 9034;done Result (in cpg_test_agent): ERR: nid = 1797661194, pid = 12508, seq = 2081, size = (15046755946816602112 0xd0d0d0d000000000) msg_len = 532 Followed by "Segmentation fault" (because NSS is trying to compute SHA1 from HUGE unallocated data) In other words, main problem is DATA CORRUPTION of events. As you can see, msg_pt->len is totally corrupted (in my test environment usually with pattern 0xd0d0d0d000000000) I'm able to reproduce problem (independently) on multiple computers (all with FB DIMM ECC memory) and/or VMs, on RHEL 6.3 and/or FC17. This problem must be solved ASAP (Corosync 2.0 can't be used in production with this problem).
Lon added to CC because this bug blocks me.
The valgrind error might be an irritation, but will not cause any issues. The socket is only used as a notifier (1 byte means 1 message) so that the client can put a socket in a poll loop. The data is never used. That said I'll sort it out. The real issues is that there is no actual message in the ringbuffer. #define QB_RB_CHUNK_MAGIC 0xA1A1A1A1 #define QB_RB_CHUNK_MAGIC_DEAD 0xD0D0D0D0 #define QB_RB_CHUNK_MAGIC_ALLOC 0xA110CED0 QB_RB_CHUNK_MAGIC_DEAD indicates that the space been looked at has already been reclaimed (like freed). I'll have a look on Monday.
This is now fixed upstream: https://github.com/asalkeld/libqb/commit/e5be0396a7510d24b7e5e7a315c7f2f955e31452 I'll work to get it into fedora.
libqb-0.14.1-1.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/libqb-0.14.1-1.fc17
Package libqb-0.14.1-1.fc17: * should fix your issue, * was pushed to the Fedora 17 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing libqb-0.14.1-1.fc17' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-10851/libqb-0.14.1-1.fc17 then log in and leave karma (feedback).
libqb-0.14.1-1.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.