Bug 772666

Summary: flow-to-disk queue stuck when two subscribers are draining messages
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED DUPLICATE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1CC: jross
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-09 15:24:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
coredump of segfault (see actual results, point 3) none

Description Pavel Moravec 2012-01-09 15:09:44 UTC
Description of problem:
If more subscribers are bound to a flow-to-disk (FTD) queue, the queue may be stuck after raising JERR__UNEXPRESPONSE / RHM_IORES_EMPTY exception.

Version-Release number of selected component (if applicable):
# rpm -qa | grep qpid
qpid-java-example-0.10-9.el6.noarch
qpid-cpp-client-devel-docs-0.10-8.el6_1.noarch
python-qpid-0.12-1.el6.noarch
qpid-tools-0.12-2.el6.noarch
qpid-cpp-client-ssl-0.12-6.el6.x86_64
qpid-cpp-client-0.12-6.el6.x86_64
qpid-qmf-0.12-6.el6.x86_64
python-qpid-qmf-0.12-6.el6.x86_64
qpid-cpp-server-devel-0.12-6.el6.x86_64
qpid-cpp-server-store-0.12-6.el6.x86_64
qpid-java-client-0.10-9.el6.noarch
qpid-java-common-0.10-9.el6.noarch
qpid-cpp-server-0.12-6.el6.x86_64
qpid-cpp-client-devel-0.12-6.el6.x86_64
qpid-cpp-server-xml-0.12-6.el6.x86_64
#

How reproducible:
100% within few minutes

Steps to Reproduce:
Run this script:
./drain "ftd_queue; { create:always, node:{ type:queue, durable:true, x-declare:{ arguments:{ qpid.policy_type:flow_to_disk, qpid.max_count:100, qpid.max_size:0, qpid.flow_stop_count:0, qpid.flow_stop_size:0, qpid.file_count:4, qpid.file_size:512 } } }}"
while true; do
        echo "$(date): starting next round"
        ./spout -c 10000 --content "some content some content some content some content some content some content some content some content some content some content some content some content some content some content some content some content some content some content" "ftd_queue" &
	./drain -f -c 5000 "ftd_queue" &
	./drain -f -c 5000 "ftd_queue" &
	wait
done

(the script creates a FTD queue, and in parallel sends there messages on 1 connection and drains messages on 2 parallel connections - every loop sends+drains 10k messages)

  
Actual results:
Few issues seen:
1) After a while, drain program raises exception:
2012-01-09 16:04:14 warning Broker closed connection: 320, Queue ftd_queue: loadContent() failed: jexception 0x0108 JournalImpl::loadMsgContent() threw JERR__UNEXPRESPONSE: Unexpected response to call or event. (read_data_record() returned RHM_IORES_EMPTY) (MessageStoreImpl.cpp:1249)
connection-forced: Queue ftd_queue: loadContent() failed: jexception 0x0108 JournalImpl::loadMsgContent() threw JERR__UNEXPRESPONSE: Unexpected response to call or event. (read_data_record() returned RHM_IORES_EMPTY) (MessageStoreImpl.cpp:1249)
2) Sometimes (easy to reproduce by repeating the cycle from the script), the queue becomes "stuck", having messages in it (per qpid-stat -q) but a drain fails with the same exception even during fetching a single message. The queue is then absolutely useless.
3) Rarely, the bug even segfaults qpidd process - coredump to be added

Expected results:
No such behavior (either 1, 2 or 3 from above)

Additional info:

Comment 1 Pavel Moravec 2012-01-09 15:14:44 UTC
Created attachment 551581 [details]
coredump of segfault (see actual results, point 3)

/var/log/messages commented the segfault with these logs:

Jan  9 14:46:19 pmoravec-mrg-train6 qpidd[16887]: 2012-01-09 14:46:19 error Connection 127.0.0.1:5672-127.0.0.1:58718 closed by error: Queue ftd_queue: loadContent() failed: jexception 0x0901 rmgr::read() threw JERR_RMGR_RIDMISMATCH: RID mismatch between current record and dtok RID (rid=0xcd; dtok_rid=0xcc; dtok_id=0x7) (MessageStoreImpl.cpp:1249)(320)
Jan  9 14:46:40 pmoravec-mrg-train6 kernel: qpidd[16896] general protection ip:3863833fc4 sp:7f483b821930 error:0 in libc-2.12.so[3863800000+197000]
Jan  9 14:46:41 pmoravec-mrg-train6 abrt[17432]: saved core dump of pid 16887 (/usr/sbin/qpidd) to /var/spool/abrt/ccpp-1326116800-16887.new/coredump (56401920 bytes)
Jan  9 14:46:41 pmoravec-mrg-train6 abrtd: Directory 'ccpp-1326116800-16887' creation detected
Jan  9 14:46:41 pmoravec-mrg-train6 abrtd: New crash /var/spool/abrt/ccpp-1326116800-16887, processing

Comment 2 Pavel Moravec 2012-01-09 15:24:12 UTC

*** This bug has been marked as a duplicate of bug 769828 ***