Bug 1494952

Summary:	Sending durable messages via AMQP 1.0 on 1.36.0-9 broker gets stuck when destination queue is suddenly empty
Product:	Red Hat Enterprise MRG	Reporter:	Pavel Moravec <pmoravec>
Component:	qpid-cpp	Assignee:	Kim van der Riet <kim.vdriet>
Status:	CLOSED ERRATA	QA Contact:	Petr Matousek <pematous>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	3.2	CC:	jross, kim.vdriet, mcressma, pematous
Target Milestone:	3.2.6
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	qpid-cpp-1.36.0-11	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-24 20:24:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1487729

Description Pavel Moravec 2017-09-24 10:17:30 UTC

Description of problem:
Sending durable AMQP 1.0 messages to a durable queue that gets empty (consumer fetched and ACKed all messages), broker stops acknowledging messages from the sender (no further disposition performatives are seen). That starts to block the sender after it deplenishes its capacity/prefetch.

Note that this problem requires:
- sending _durable_ messages (to durable queue)
- using AMQP 1.0
- qpid-cpp 1.36.0-9 (regression from -6)


Version-Release number of selected component (if applicable):
1.36.0-9


How reproducible:
100%


Steps to Reproduce:
1)
rm -rf /var/lib/qpidd/*
service qpidd restart
qpid-config add queue PerfQueue --durable
qpid-receive -a PerfQueue -m 1000000 -f --print-content=no --print-headers=no --connection-options "{protocol:'amqp1.0'}" --capacity=1 &
qpid-send -a PerfQueue --report-every=30 --send-rate=3 --durable=yes -m 1000000 --content-size=700 --capacity=1 --connection-options "{protocol:'amqp1.0'}"

2) in another terminal, time to time check queue depth.


Actual results:
Some time after the start - often after sending 101 or 201 or 301 messages - the message flow gets stuck, producer gets stuck (no new "reports" on msg throughput), queue's message depth is 1 forever (1 because capacity=1 on sender; capacity on consumer doesnt matter afaik)

(where the "multiply of 100" threshold to get stuck comes from? maybe some consumer activity?

Expected results:
No such producer stuck


Additional info:
The above happens when using broker parameter journal-flush-timeout=1ms . Without that (e.g. having qpidd.conf empty / default), just 14 messages are sent (every time 14*capacity, doesnt depend on content size or send rate).

Comment 1 Pavel Moravec 2017-09-27 07:19:02 UTC

I can confirm the same behaviour when using Java client (as sender).

When using journal-flush-timeout=1ms , broker stops accepting messages from the sender after random hundreds of messages (why hundreds?). If not using that option, the "no more ACK" comes everytime after 14*capacity messages received.

Comment 2 Pavel Moravec 2017-09-27 07:29:30 UTC

(In reply to Pavel Moravec from comment #1)
> I can confirm the same behaviour when using Java client (as sender).
> 

Just for the sake of completeness, I run:

msgSize=700
capacity=1
messages=1000000
sndRate=10

msgContent=""
for i in $(seq 1 $msgSize); do
        msgContent="${msgContent}X"
done

java -jar /root/dtests/node_data/clients/java/aac/target/aac-staging-0.21.0.redhat-1.jar sender -a PerfQueue --capacity=$capacity -c $messages --msg-content=$msgContent -d $((messages/sndRate))

instead of the qpid-send.

Comment 4 Kim van der Riet 2017-09-29 18:06:27 UTC

I have tracked this problem down to a race condition in linearstore. When the store flush timer is set to a low value, it triggers the race condition which has the effect of causing the flush timer to stop flushing.  When using AMQP 1.0, this will have the effect of hanging the application.

I have checked in a fix and tested it using the method above (https://bugzilla.redhat.com/show_bug.cgi?id=1494952#c0) and it works without a problem.

The checkins for this issue can be seen at QPID-7895 (https://issues.apache.org/jira/browse/QPID-7895).

Comment 5 Pavel Moravec 2017-09-30 11:21:57 UTC

(In reply to Kim van der Riet from comment #4)
> I have tracked this problem down to a race condition in linearstore. When
> the store flush timer is set to a low value, it triggers the race condition
> which has the effect of causing the flush timer to stop flushing.  When
> using AMQP 1.0, this will have the effect of hanging the application.
> 
> I have checked in a fix and tested it using the method above
> (https://bugzilla.redhat.com/show_bug.cgi?id=1494952#c0) and it works
> without a problem.
> 
> The checkins for this issue can be seen at QPID-7895
> (https://issues.apache.org/jira/browse/QPID-7895).

Successfully tested in qpid-cpp-server-1.36.0-11.el6.x86_64 build. Both reproducers (consumer stopped => producer blocked very soon; consumer running => producer stuck after hundreds of msgs) using both clients (C++ either Java).

Comment 7 Petr Matousek 2017-10-03 15:29:05 UTC

The issue has been fixed. Verified with both above available reproducers. Internal test case has been linked.

packages under test:
* qpid-cpp-1.36.0-11

-> VERIFIED

Comment 8 Mike Cressman 2017-10-24 20:24:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2892