Bug 1494952 - Sending durable messages via AMQP 1.0 on 1.36.0-9 broker gets stuck when destination queue is suddenly empty
Summary: Sending durable messages via AMQP 1.0 on 1.36.0-9 broker gets stuck when dest...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.2
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: 3.2.6
: ---
Assignee: Kim van der Riet
QA Contact: Petr Matousek
URL:
Whiteboard:
Depends On:
Blocks: 1487729
TreeView+ depends on / blocked
 
Reported: 2017-09-24 10:17 UTC by Pavel Moravec
Modified: 2021-06-10 13:06 UTC (History)
4 users (show)

Fixed In Version: qpid-cpp-1.36.0-11
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-24 20:24:17 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Pavel Moravec 2017-09-24 10:17:30 UTC
Description of problem:
Sending durable AMQP 1.0 messages to a durable queue that gets empty (consumer fetched and ACKed all messages), broker stops acknowledging messages from the sender (no further disposition performatives are seen). That starts to block the sender after it deplenishes its capacity/prefetch.

Note that this problem requires:
- sending _durable_ messages (to durable queue)
- using AMQP 1.0
- qpid-cpp 1.36.0-9 (regression from -6)


Version-Release number of selected component (if applicable):
1.36.0-9


How reproducible:
100%


Steps to Reproduce:
1)
rm -rf /var/lib/qpidd/*
service qpidd restart
qpid-config add queue PerfQueue --durable
qpid-receive -a PerfQueue -m 1000000 -f --print-content=no --print-headers=no --connection-options "{protocol:'amqp1.0'}" --capacity=1 &
qpid-send -a PerfQueue --report-every=30 --send-rate=3 --durable=yes -m 1000000 --content-size=700 --capacity=1 --connection-options "{protocol:'amqp1.0'}"

2) in another terminal, time to time check queue depth.


Actual results:
Some time after the start - often after sending 101 or 201 or 301 messages - the message flow gets stuck, producer gets stuck (no new "reports" on msg throughput), queue's message depth is 1 forever (1 because capacity=1 on sender; capacity on consumer doesnt matter afaik)

(where the "multiply of 100" threshold to get stuck comes from? maybe some consumer activity?

Expected results:
No such producer stuck


Additional info:
The above happens when using broker parameter journal-flush-timeout=1ms . Without that (e.g. having qpidd.conf empty / default), just 14 messages are sent (every time 14*capacity, doesnt depend on content size or send rate).

Comment 1 Pavel Moravec 2017-09-27 07:19:02 UTC
I can confirm the same behaviour when using Java client (as sender).

When using journal-flush-timeout=1ms , broker stops accepting messages from the sender after random hundreds of messages (why hundreds?). If not using that option, the "no more ACK" comes everytime after 14*capacity messages received.

Comment 2 Pavel Moravec 2017-09-27 07:29:30 UTC
(In reply to Pavel Moravec from comment #1)
> I can confirm the same behaviour when using Java client (as sender).
> 

Just for the sake of completeness, I run:

msgSize=700
capacity=1
messages=1000000
sndRate=10

msgContent=""
for i in $(seq 1 $msgSize); do
        msgContent="${msgContent}X"
done

java -jar /root/dtests/node_data/clients/java/aac/target/aac-staging-0.21.0.redhat-1.jar sender -a PerfQueue --capacity=$capacity -c $messages --msg-content=$msgContent -d $((messages/sndRate))

instead of the qpid-send.

Comment 4 Kim van der Riet 2017-09-29 18:06:27 UTC
I have tracked this problem down to a race condition in linearstore. When the store flush timer is set to a low value, it triggers the race condition which has the effect of causing the flush timer to stop flushing.  When using AMQP 1.0, this will have the effect of hanging the application.

I have checked in a fix and tested it using the method above (https://bugzilla.redhat.com/show_bug.cgi?id=1494952#c0) and it works without a problem.

The checkins for this issue can be seen at QPID-7895 (https://issues.apache.org/jira/browse/QPID-7895).

Comment 5 Pavel Moravec 2017-09-30 11:21:57 UTC
(In reply to Kim van der Riet from comment #4)
> I have tracked this problem down to a race condition in linearstore. When
> the store flush timer is set to a low value, it triggers the race condition
> which has the effect of causing the flush timer to stop flushing.  When
> using AMQP 1.0, this will have the effect of hanging the application.
> 
> I have checked in a fix and tested it using the method above
> (https://bugzilla.redhat.com/show_bug.cgi?id=1494952#c0) and it works
> without a problem.
> 
> The checkins for this issue can be seen at QPID-7895
> (https://issues.apache.org/jira/browse/QPID-7895).

Successfully tested in qpid-cpp-server-1.36.0-11.el6.x86_64 build. Both reproducers (consumer stopped => producer blocked very soon; consumer running => producer stuck after hundreds of msgs) using both clients (C++ either Java).

Comment 7 Petr Matousek 2017-10-03 15:29:05 UTC
The issue has been fixed. Verified with both above available reproducers. Internal test case has been linked.

packages under test:
* qpid-cpp-1.36.0-11

-> VERIFIED

Comment 8 Mike Cressman 2017-10-24 20:24:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2892


Note You need to log in before you can comment on or make changes to this bug.