Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1142866

Summary: Purging TTL expired messages 5times longer than before
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: Gordon Sim <gsim>
Status: CLOSED ERRATA QA Contact: Leonid Zhaldybin <lzhaldyb>
Severity: high Docs Contact:
Priority: high    
Version: 2.5CC: esammons, iboverma, jross, lzhaldyb, mcressma, pematous, zkraus
Target Milestone: 2.5.5Keywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-0.18-32 Doc Type: Bug Fix
Doc Text:
It was discovered that a recently fixed portion of debug code, which was implemented to avoid a segfault, was executing even if debug mode was not enabled. This caused purging expired messages from a queue to take longer than it did prior to the fix. The code was reworked so that the debug code is only executed when debug mode is enabled. Purging messages now happens at the same speed it did before the erroneous change.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-22 16:34:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1140368, 1140369    
Attachments:
Description Flags
reproducer script none

Description Pavel Moravec 2014-09-17 14:23:45 UTC
Description of problem:
for background, read first 2 paragraphs in bz1093996.

For $summary, there is a regression in the time the thread blocks heartbeat timers to be fired. For short times, the blocking is acceptable but not in the amount seen now. The regression was introduced between 0.18-18 (still ok) and 0.18-19 (5times higher execution time).


Version-Release number of selected component (if applicable):
0.18-31 (seen since 0.18-19)


How reproducible:
100%


Steps to Reproduce:
1. Run attached script (replace 10.34.84.78 by your IP address there, optionally modify "queue-purge-interval=240" to be sure all enqueues happened within the limit)

2. Wait for queue-purge-interval seconds, logs like:

2014-09-17 14:31:00 [Client] warning Connection [10.34.1.148:60941-10.34.84.78:5672] closed

should start to appear (this is the unwanted behaviour, but the main problem is _duration_ of it).

3. Wait until the logs stop to appear.
4. Count time diff between 1st and last such log and compare it with qpid-cpp 0.18-18 results (or any older broker version)


Actual results:
on my VM, 0.18-18 was not sending heartbeats for approx 10 seconds (the time diff was around 9 seconds). 0.18-19 and any higher 0.18-* version provided 5times worse times (time diff 53 seconds).


Expected results:
Similar time of execution of the thread / similar time diff in step 4 seen.


Additional info:
For QE: If bz1093996 is backported, there might not be any such timeout.

Comment 1 Pavel Moravec 2014-09-17 14:26:02 UTC
Created attachment 938493 [details]
reproducer script

- replace IP address of the broker (10.34.84.78) by yours everywhere
- it runs tcpdump for detailed analysis - tis is irrelevant for basic observation/reproducing

Comment 3 Pavel Moravec 2014-09-17 14:33:20 UTC
(In reply to Pavel Moravec from comment #1)
> Created attachment 938493 [details]
> reproducer script
> 
> - replace IP address of the broker (10.34.84.78) by yours everywhere
> - it runs tcpdump for detailed analysis - tis is irrelevant for basic
> observation/reproducing

FYI the reproducer sends 5millions messages to the broker. That requires 8GB RAM machine.

It is not desired to test with much lower number of messages, as the times (like 9 seconds) would be comparably smaller, then.

Comment 4 Pavel Moravec 2014-09-18 09:49:13 UTC
Diff between 0.18-18 (works well) and 0.18-19 (performance degraded):

> - BZ  966004 - use correct exception object for older ssl implementations
> - BZ 1030406 - segfault in Queue::isExpired() when debug logging for queue is enabled
> - BZ 1030608 - qpid segfaults in qpid::ha::BrokerReplicator::disconnected
> - BZ 1030637 - cluster restart with cluster-size=3, auth=yes
> - BZ 1036650 - ensure that replaced LVQ messages are deleted from store
> - BZ 1041575 - don't delete subscription queue when closing link if it was explicitly declared non-exclusive
> - BZ 1042224 - durable LVQ raises journal error when only transient messages are sent
> - BZ 1044002 - setting heartbeat has no effect on connection establishment timeout


From that list, either of these two is most probably causing the degradation:

> - BZ 1030406 - segfault in Queue::isExpired() when debug logging for queue is enabled
> - BZ 1044002 - setting heartbeat has no effect on connection establishment timeout

/me trying now to isolate the code behind those BZs that is relevant for the regerssion

Comment 5 Pavel Moravec 2014-09-18 10:52:55 UTC
bz1030406, esp. http://git.app.eng.bos.redhat.com/git/rh-qpid.git/diff/qpid/cpp/src/qpid/broker/Queue.cpp?h=0.18-mrg&id=885515f3db9010462d0dad61b7bacbcf8800c0b2 prepares some debug statement in all cases  - but it prints it only when debugs are enabled.

So, for debugs disabled (like by default), redundant code is executed. And this is new in 0.18-19.

Comment 13 Leonid Zhaldybin 2014-10-14 12:42:53 UTC
Tested on RHEL5.11 and RHEL6.6 (both i386 and x86_64). This issue has been fixed, the purging process takes 4 - 5 times less time on the latest build.

Packages used for testing:

RHEL5:
python-qpid-0.18-13.el5_10
python-qpid-qmf-0.18-28.el5_10
qpid-cpp-0.18-35.el5_10
qpid-qmf-0.18-28.el5_10
qpid-tools-0.18-10.el5_9

RHEL6:
python-qpid-0.18-13.el6
python-qpid-qmf-0.18-28
qpid-cpp-0.18-35.el6
qpid-qmf-0.18-28.el6
qpid-tools-0.18-10.el6_4

-> VERIFIED

Comment 15 errata-xmlrpc 2014-10-22 16:34:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2014-1682.html