Bug 1502855
Summary: | sending durable messages concurrently from 0-10 and 1.0 producer causes high latency on ACKing the 1.0 ones | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pavel Moravec <pmoravec> |
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> |
Status: | CLOSED ERRATA | QA Contact: | Petr Matousek <pematous> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.2 | CC: | esammons, gsim, jross, kim.vdriet, mcressma, pematous |
Target Milestone: | 3.2.7 | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qpid-cpp-1.36.0-12 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
When durable messages are sent to the same queue using both AMQP 1.0 and 0-10 concurrently, a condition can be triggered where the store no longer flushes the 1.0 messages.
Consequence:
When this condition is triggered, the 1.0 messages are only flushed by the 0-10 activity on the queue, or if there is none, then the messages are never flushed.
Fix:
The code was changed so that any combination of 0-10 and 1.0 messages are successfully processed without causing messages to be delayed.
Result:
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-20 16:52:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pavel Moravec
2017-10-16 20:34:28 UTC
Durable queues are mandatory. No durable queues or messages, no reproducer. Smaller reproducer using just one queue: 1) In one terminal, run below cmds in a script: rm -rf /var/lib/qpidd/* killall qpid-send qpid-receive service qpidd restart # create durable queue qpid-receive -a "DurablePerf; {create:always, node:{type:queue, durable:true}}" # have 1.0 consumer forever qpid-receive -a DurablePerf -m 1000000 --print-content=no --print-headers=no --connection-options "{protocol:'amqp1.0'}" --capacity=1 -f & # repeatedly send a 1.0 message and print real time spent there while true; do (time qpid-send -a DurablePerf --capacity=1 --durable=yes -m 1 --content-size=$((RANDOM%500+1000)) --connection-options "{protocol:'amqp1.0'}") 2>&1 | grep real sleep 1 done & 2) In another terminal, send some 0-10 messages time to time to the same queue: while true; do qpid-send -a DurablePerf --capacity=1 --durable=yes -m 10 --content-size=$((RANDOM%1500)); sleep 1; ps aux | grep -v grep | grep qpid-send; sleep 1; ps aux | grep -v grep | grep qpid-send; date; done 3) check 1st terminal output - "real" time (how long does it take each qpid-send to send a message and terminate) will be below 0.1s. But after a race condition is hit (concurrent sending 0-10 and 1.0 message?), the times start to be above 1s forever. Since that time, 2nd terminal will show: Tue Oct 17 12:17:30 CEST 2017 root 21615 0.0 0.0 202836 5636 pts/1 Sl 12:17 0:00 qpid-send -a DurablePerf --capacity=1 --durable=yes -m 1 --content-size=1206 --connection-options {protocol:'amqp1.0'} root 21615 0.0 0.0 202836 5636 pts/1 Sl 12:17 0:00 qpid-send -a DurablePerf --capacity=1 --durable=yes -m 1 --content-size=1206 --connection-options {protocol:'amqp1.0'} Tue Oct 17 12:17:33 CEST 2017 root 21632 0.0 0.0 202836 5640 pts/1 Sl 12:17 0:00 qpid-send -a DurablePerf --capacity=1 --durable=yes -m 1 --content-size=1327 --connection-options {protocol:'amqp1.0'} root 21632 0.0 0.0 202836 5640 pts/1 Sl 12:17 0:00 qpid-send -a DurablePerf --capacity=1 --durable=yes -m 1 --content-size=1327 --connection-options {protocol:'amqp1.0'} Tue Oct 17 12:17:35 CEST 2017 i.e. same qpid-send (compare PID) after 1 second sleep. Interesting observation: if I stop the 0-10 sender, the 1.0 sender gets stuck. Forever. Well, until I send another 0-10 message that triggers disposition frame is sent from qpidd to the 1.0 qpid-send and that sender can finally terminate. So, why qpidd waits with sending 1.0 disposition performative for _another_ 0-10 message? > So, why qpidd waits with sending 1.0 disposition performative for _another_ 0-10 message?
In 1.0, there are no explicit syncs defined in the protocol, and so at present there are no flushes on the journal. With 0-10 there are. I suspect this is another bug in the journal, where it fails to flush automatically in some cases. For durable AMQP 1.0 messages, the disposition is sent back only after the journal callsback to indicate successful completion of the enqueue operation.
(In reply to Gordon Sim from comment #2) > > So, why qpidd waits with sending 1.0 disposition performative for _another_ 0-10 message? > > In 1.0, there are no explicit syncs defined in the protocol, and so at > present there are no flushes on the journal. With 0-10 there are. I suspect > this is another bug in the journal, where it fails to flush automatically in > some cases. For durable AMQP 1.0 messages, the disposition is sent back only > after the journal callsback to indicate successful completion of the enqueue > operation. That makes sense. So if 1.0 disposition is sent back only after a trigger from linearstore, the linearstore must be guilty. But how it can be buggy if it doesnt care or even know the protocol version of rceived message, while concurrent sending of them is the key factor in the reproducer? Idea: - 1.0 message is received - 0-10 message is received - 1.0 message is enqueued to the queue and stored into journal - 0-10 message as well - the journal flush trigger is fired now - but "processed"(?) by the 0-10 sender, not by 1.0 Not sure about the latest point, but I am quite confident that if 0-10 message is enqueued after 1.0 one and before journal flush is triggered for the 1.0 message (or, for an observer, before disposition frame is sent to the 1.0 sender), then this bug is hit. Since: > Interesting observation: if I stop the 0-10 sender, the 1.0 sender gets stuck. > Forever. Well, until I send another 0-10 message that triggers disposition > frame is sent from qpidd to the 1.0 qpid-send and that sender can finally terminate. triggers the same "disposition frame stuck" again. Every time. (i.e. once we hit this BZ, the "workaround" in sending 0-10 message is also its trigger) The 1.0 codepath *never* flushes the queue explicitly. There is no explicit protocol 'hint' for it to do so. It would need to infer that flushing made sense from the absence of any more data incoming on that connection (which at present it has no knowledge of either). So my guess is that something is happening whereby the automatic flushing of the journal by itself stops happening, so completion relies on explicit flushes which only come from the 0-10 codepath. Gordon's analysis is correct. The problem seems to be an error in the state machine of the InactivityFireEvent TimerTask. In cases where an external flush (which is provided by the presence of AMQP 0-10 messages) arrives just before the timer fires, the state machine moves from state RUNNING to state FLUSHED. A pathway for the state to move from FLUSHED to FIRED once the timer actually fires was missing, and this error put the timer into a state where the timer could never be reset. By adding this pathway into the state machine, the problem has been solved, and the reproducer in C#1 now works. I can confirm the qpid-cpp-server-1.36.0-12.el6.x86_64 package fixes this bug. Ran few other tests (i.e. the "previous" BZ reproducers) and found no issue. Thanks for prompt patch. I've been running the reproducer from comment#1 more than 20hours. The issue did not appear on any of the supported platforms, thus the issue is considered as fixed. Internal test case has been linked. packages under test: * qpid-cpp-1.36.0-13 -> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3253 |