Bug 839605 - event message corrupted (may be because of valgrind: socketcall.sendto(msg) points to unaddressable byte(s))
Summary: event message corrupted (may be because of valgrind: socketcall.sendto(msg) p...
Alias: None
Product: Fedora
Classification: Fedora
Component: libqb
Version: 17
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Angus Salkeld
QA Contact: Fedora Extras Quality Assurance
Depends On:
TreeView+ depends on / blocked
Reported: 2012-07-12 12:25 UTC by Jan Friesse
Modified: 2014-01-13 01:40 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2012-07-26 22:23:12 UTC
Type: Bug

Attachments (Terms of Use)
Patch to ensure problem is really in libqb event (4.47 KB, patch)
2012-07-13 08:26 UTC, Jan Friesse
no flags Details | Diff

Description Jan Friesse 2012-07-12 12:25:50 UTC
Description of problem:
resend_event_notifications uses unaddressable byte

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. corosync from master (of f17 updates)
2. libqb from master (or f17 updates -> 0.14.0)
3. valgrind corosync -f
4. cpg_test_agent
5. echo "1:0:cpg_initialize:" ; sleep 0.2; echo "2:0:cpg_join:0:cts_group:"; sleep 1; echo "1:0:record_messages:" ; sleep 0.1; echo "2:0:msg_blaster:0:9000:"; sleep 1; for i in `seq 1 1`;do echo "2:0:read_messages:0:50:" ; sleep 0.01;done) | nc 9034

Actual results:
==7169== Syscall param socketcall.sendto(msg) points to unaddressable byte(s)
==7169==    at 0x58DCACC: send (in /lib64/libpthread-2.12.so)
==7169==    by 0x526C572: qb_ipc_us_send (ipc_us.c:98)
==7169==    by 0x526A467: resend_event_notifications (ipcs.c:333)
==7169==    by 0x526B364: qb_ipcs_dispatch_connection_request (ipcs.c:733)
==7169==    by 0x526682E: _poll_dispatch_and_take_back_ (loop_poll.c:98)
==7169==    by 0x526602C: qb_loop_run_level (loop.c:45)
==7169==    by 0x5266606: qb_loop_run (loop.c:206)
==7169==    by 0x41C938: main (main.c:1229)
==7169==  Address 0x73842c0 is 0 bytes after a block of size 1,328 alloc'd
==7169==    at 0x4C25A28: calloc (vg_replace_malloc.c:467)
==7169==    by 0x526AA8F: qb_ipcs_connection_alloc (ipcs.c:496)
==7169==    by 0x526D56E: handle_new_connection (ipc_us.c:601)
==7169==    by 0x526DF15: qb_ipcs_us_connection_acceptor (ipc_us.c:910)
==7169==    by 0x526682E: _poll_dispatch_and_take_back_ (loop_poll.c:98)
==7169==    by 0x526602C: qb_loop_run_level (loop.c:45)
==7169==    by 0x5266606: qb_loop_run (loop.c:206)
==7169==    by 0x41C938: main (main.c:1229)

Expected results:
No error

Additional info:

Comment 1 Jan Friesse 2012-07-12 14:13:19 UTC
Also please note that without valgrind, this causes data corruption in message (on client side) so it will ether ends up with incorrect sha1 hash or (most likely) segfault because NSS is not able to process message (msg len (in structure not in callback) is corrupted and says something about HUGE message)

This is BLOCKER for corosync.

Comment 2 Jan Friesse 2012-07-13 08:26:03 UTC
Created attachment 598007 [details]
Patch to ensure problem is really in libqb event

attached is patch to ensure that problem is really in libqb and not in corosync itself.

Also problem is not 100% reproducible but with following few commands, it's very easy to reproduce in short time (seconds):

1. corosync -f
2. while true;do ./cpg_test_agent;done
3. while true;do (echo "1:0:cpg_initialize:" ; sleep 0.2; echo "2:0:cpg_join:0:cts_group:"; sleep 1; echo "1:0:record_messages:" ; sleep 0.1; echo "2:0:msg_blaster:0:9000:"; sleep 1; for i in `seq 1 1`;do echo "2:0:read_messages:0:50:" ; sleep 0.01;done) | nc 9034;done

Result (in cpg_test_agent):
ERR: nid = 1797661194, pid = 12508, seq = 2081, size = (15046755946816602112 0xd0d0d0d000000000) msg_len = 532

Followed by "Segmentation fault" (because NSS is trying to compute SHA1 from HUGE unallocated data)

In other words, main problem is DATA CORRUPTION of events.

As you can see, msg_pt->len is totally corrupted (in my test environment usually with pattern 0xd0d0d0d000000000)

I'm able to reproduce problem (independently) on multiple computers (all with FB DIMM ECC memory) and/or VMs, on RHEL 6.3 and/or FC17.

This problem must be solved ASAP (Corosync 2.0 can't be used in production with this problem).

Comment 3 Jan Friesse 2012-07-13 08:27:54 UTC
Lon added to CC because this bug blocks me.

Comment 4 Angus Salkeld 2012-07-13 13:07:35 UTC
The valgrind error might be an irritation, but will not cause any issues.
The socket is only used as a notifier (1 byte means 1 message) so that the 
client can put a socket in a poll loop. The data is never used. That
said I'll sort it out.

The real issues is that there is no actual message in the ringbuffer.
#define QB_RB_CHUNK_MAGIC		0xA1A1A1A1

QB_RB_CHUNK_MAGIC_DEAD indicates that the space been looked at has already
been reclaimed (like freed). I'll have a look on Monday.

Comment 5 Angus Salkeld 2012-07-17 11:33:36 UTC
This is now fixed upstream:

I'll work to get it into fedora.

Comment 6 Fedora Update System 2012-07-18 02:43:30 UTC
libqb-0.14.1-1.fc17 has been submitted as an update for Fedora 17.

Comment 7 Fedora Update System 2012-07-19 09:15:30 UTC
Package libqb-0.14.1-1.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libqb-0.14.1-1.fc17'
as soon as you are able to.
Please go to the following url:
then log in and leave karma (feedback).

Comment 8 Fedora Update System 2012-07-26 22:23:12 UTC
libqb-0.14.1-1.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.