| Summary: | MRG 3 produces poor latency results when compare to MRG2.1 testing 10Gb networks | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Tom Tracy <ttracy> | ||||||||||||||||||||
| Component: | qpid-cpp | Assignee: | mick <mgoulish> | ||||||||||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Matej Lesko <mlesko> | ||||||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||||||
| Priority: | medium | ||||||||||||||||||||||
| Version: | 3.0 | CC: | aconway, esammons, freznice, gsim, iboverma, jross, mgoulish, perfbz | ||||||||||||||||||||
| Target Milestone: | 3.0 | Keywords: | Performance, Regression | ||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||
| OS: | Linux | ||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||
| Fixed In Version: | qpid-cpp-0.22-44 | Doc Type: | Bug Fix | ||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||
| Last Closed: | 2015-01-21 12:57:24 UTC | Type: | Bug | ||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Tom Tracy
2013-09-23 20:03:25 UTC
Created attachment 899962 [details]
graph showing throughput regression
png file showing throughput regression, comparing versions 0.14, 0.18, 0.22
The root cause of the latency regression appears to be a throughput regression.
At some point, when pushing more and more data per second between a client and a broker, the system's ability to handle increasing throughput will fail and messages will build up in their queues. A symptom of this will be a very large increase in message latencies, which is what the reporter has seen. I.e. not increase of 20%, but increases of two or three orders of magnitude.
The problem here is not that this happens -- but that it happens at a lower level of throughput in more recent versions than in earlier versions.
The message latencies that result when these backups begin to happen are a good match for what the report observed.
In the graph in attachment 899962 [details], I quantify this effect, with the following method.
For each version, try several different message sizes: 8, 64, 512, 2048, 4096, 8192, 16384, and 32768 bytes. For each message size, run several separate tests, increasing the send-rate until I see a sudden sharp increase in message latency -- that's the point that I graph.
The picture shows that the throughput-saturation point has decreased significantly both from 0.14 to 0.18, and from 0.18 to 0.22 .
Here is the same information as the above graph, in table form, with 0.14 throughput saturation points normalized to "100".
Percent change in throughput saturation point
=============================================
defining 0.14 saturation point as "100"
throughput saturation point
---------------------------
message size 0.14 0.18 0.22
--------------+-----------+----------+-----------+
| | | |
8 | 100 | 84 | 79 |
| | | |
64 | 100 | 82 | 79 |
| | | |
512 | 100 | 77 | 73 |
| | | |
2048 | 100 | 84 | 76 |
| | | |
4096 | 100 | 92 | 82 |
| | | |
8192 | 100 | 96 | 90 |
| | | |
16384 | 100 | 94 | 94 |
| | | |
32768 | 100 | 90 | 90 |
| | | |
--------------+-----------+----------+-----------+
Using perf to measure instructions per second. Testing the brokers with a run of qpid-latency-test that sends 1,000,000 messages of 256 bytes each. I ran each test 3 times and took average of the three. 0.14 instructions per second == 1.988 billion 0.22 instructions per second == 1.543 billion 1.543 / 1.988 == 0.78 This exactly corresponds to the throughput reduction I am seeing from 0.14 to 0.22 . I think the only way this can happen is if the processors are spending more time doing nothing at all. And the only way that can happen, I think, is lock contention. Doing profiling to look at how many calls the different brokers make to do the same amount of work -- with a run of qpid-latency-test that sends 1,000,000 messages of 256 bytes each.
These numbers are perfectly repeatable.
Choosing a few of the functions that seem probably-mostly independent of each other:
0.14 0.22 diff %
--------------------+-----------+-----------+------------
| | |
strcmp | 258383 | 268950 | + 4
| | |
malloc | 179870 | 226743 | + 26
| | |
std::string::assign | 21326 | 27126 | + 27
| | |
--------------------+-----------+-----------+------------
The increase in calls to malloc and string assign from 0.14 to 0.22 are very similar to the decreases in throughput I am seeing.
These data make it clear that 0.22 is simply doing more stuff than 0.14 per unit of actual useful work. ( i.e. receiving and sending a message. )
The conclusions to be drawn from the data in comments 5 and 6 are 1. inescapable 2. mutually exclusive So ... I'm doomed. Producer Flow Control makes a difference! Thanks for idea, Justin! In the 0.22 broker, turning off producer flow control removes exactly half of the throughput-decrease between from 0.18 to 0.22 . That still leaves a good ways to go to reach 0.14 -- but it is definitely significant. Created attachment 901470 [details]
Performance of version 0.22A
0.22A is 0.22 except with producer flow control turned off, and with all "ScopedUse" locks commented out of Queue.cpp
Here is the same information, except in table form:
0.22A is 0.22 code except:
================================
1. "ScopedUse" locks in Queue.cpp are commented out.
2. Tests are run with producer flow control turned off.
Note: The ScopedUse locks are to prevent the Store from
prematurely deleting queues. (thanks, gsim)
improved
0.14 0.22 regression 0.22a regression
----------------------------------------------------------------
8 61000 48000 21% 52000 15%
64 62000 49000 21% 51000 18%
512 60000 44000 27% 46000 23%
2048 50000 38000 24% 43000 14%
4096 40000 33000 18% 36000 10%
8192 29000 26000 10% 26000 10%
To sum up, 0.22A eliminates about 26% of the regression that we see from 0.14 --> 0.22 . Created attachment 901514 [details]
aconway's stap script
to run this,
1. have qpidd running already
2. as root: stap /home/mick/mutexes.stp -x 1342 -d /usr/local/sbin/qpidd --ldd -v ( where -x arg is the PID of the running qpidd, and the -d arg is the path to the executable that is running )
3. after stap script starts, start the script that exercises the broker. ( in my case, this is a script that runs qpid-latency-test )
4. When you are done exercising the broker, do *not* stop the broker. Instead, stop the stap script, with CTRL-C. It will print out its data on exit.
5. Finally you can stop the broker.
Before using stap, I had to do this:
A. yum install systemtap kernel-devel yum-utils debuginfo-install kernel
B. debuginfo-install glibc
When I run aconway's script the same way he did, I get very similar results.
This is reassuring.
*However* this is not how I will run it going forward.
Going forward, I will run only with producer flow control disabled,
to give a more fair comparison with 0.14 .
1. run broker this way -------------------------------------------
#! /bin/bash
QPIDD=qpidd
rm -rf /tmp/mick
mkdir /tmp/mick
$QPIDD \
--data-dir /tmp/mick/data_1 \
--auth=no \
-d
echo "started broker from $QPIDD "
2. run latency test this way ---------------------------------------
#! /bin/bash
# This has to be changed for different installs.
# They go to different paths.
LATENCY_TEST=/usr/local/libexec/qpid/tests/qpid-latency-test
TEN_GB_ADDR=192.168.10.38
echo "emulate Alan version..."
${LATENCY_TEST} -b ${TEN_GB_ADDR} --count 50000 --csv
3. Use recent qpid trunk code -----------------
=======================================
And here is what I get:
contended 32664 times, 947121us total, 28us avg
qpid::sys::Mutex::lock()
qpid::sys::ScopedLock<qpid::sys::Mutex>::ScopedLock(qpid::sys::Mutex&)
qpid::broker::Queue::enqueue(qpid::broker::TransactionContext*, qpid::broker::Message&)
qpid::broker::Queue::deliverTo(qpid::broker::Message, qpid::broker::TxBuffer*)
qpid::broker::Queue::deliver(qpid::broker::Message, qpid::broker::TxBuffer*)
qpid::broker::DeliverableMessage::deliverTo(boost::shared_ptr<qpid::broker::Queue> const&)
contended 19344 times, 398531us total, 20us avg
qpid::sys::Mutex::lock()
qpid::sys::ScopedLock<qpid::sys::Mutex>::ScopedLock(qpid::sys::Mutex&)
qpid::broker::Queue::UsageBarrier::release()
qpid::broker::Queue::ScopedUse::~ScopedUse()
qpid::broker::Queue::enqueue(qpid::broker::TransactionContext*, qpid::broker::Message&)
qpid::broker::Queue::deliverTo(qpid::broker::Message, qpid::broker::TxBuffer*)
I continue to get contradictory results from "perf" and "systemtap". perf results clearly suggest a large amount of extra lock contention in 0.22 , because instructions per second are lower by 28% in 0.22 as compared to 0.14 However a systemtap lock contention script shows that 0.14 has *twice* as much lock contention (measured in microseconds, not contentions) as 0.22. So! Hypothesis: the systemtap technology is changing qpidd's behavior too much to get meaningful results. To test this hypothesis: use valgrind+callgrind to look for extra work being done in 0.22 in areas where locks are held. If we are doing substantial extra work in those areas, this will strongly support the idea that we are causing lock contention, which causes the instrucrtions per second reduction seen by "perf". Created attachment 905459 [details]
improvement from disabling credit calculation
In file MessageTransfer.cpp, the functions getRequiredCredit() and computeRequiredCredit() both have this line:
frames.map_if(sum, qpid::framing::TypeFilter2<qpid::framing::HEADER_BODY, qpid::framing::CONTENT_BODY>());
If I comment that out and just return or store (whichever appropriate for the fn) the value of 0 -- I get the results in the attached graph.
One half of the throughput regression from 0.14 to 0.22 goes away.
Is this because that expression is somehow shockingly inefficient?
Or.... is it an effect of the value that is returned?
aha. next test -- find out what value those exprs are returning... and just do the same, hardcoded.
There are two issues: 1. It scans the entire frame set each time, the frame set always starts with the header frames so its wasting time skipping over these. 2. The message size doesn't change (I don't think) over the message lifetime so this could be computed once when we decode the message. Note there usually only a few frames in a message, small messages probably only have one header and one body frame. However if this code is getting called many times per message then it might be an issue. If it is being called many times per message then it would also be instructive to find out why - there's no obvious reason why we should repeatedly calculate the size of the same message. Here is a function template in FrameSet.h:
template <class F, class P> void map_if(F& functor, P predicate) const {
for(Frames::const_iterator i = parts.begin(); i != parts.end(); i++) {
if (predicate(*i)) functor(*i);
}
}
That fn is called from MessageTransfer::getRequiredCredit() and MessageTransfer::computeRequiredCredit().
I wrote a Plain Old Function that I think does the same thing:
uint32_t contentAndHeaderSize ( ) const {
uint32_t sum = 0;
for(Frames::const_iterator i = parts.begin(); i != parts.end(); ++i ) {
uint8_t type = (*i).getBody()->type();
if ((type == qpid::framing::HEADER_BODY ) || (type == qpid::framing::CONTENT_BODY ))
sum += (*i).encodedSize();
}
return sum;
}
My function always returns 24 bytes more than the template version.
If I substitute calls to my function in MessageTransfer::{get,compute}RequiredCredit(), I get rid of about 50% of the throughput regression from qpid 0.14 to qpid 0.22, in mid-range message sizes.
By "mid-range" I mean 2K to 8K message sizes. At 512 bytes and below this change has little effect (although the small effect it does have is positive.)
At higher message sizes all versions get very close to each other in performance in my tests.
(In reply to mick from comment #17) > Here is a function template in FrameSet.h: > > template <class F, class P> void map_if(F& functor, P predicate) const { > for(Frames::const_iterator i = parts.begin(); i != parts.end(); i++) > { > if (predicate(*i)) functor(*i); > } > } > > That fn is called from MessageTransfer::getRequiredCredit() and > MessageTransfer::computeRequiredCredit(). > > > > I wrote a Plain Old Function that I think does the same thing: > > uint32_t contentAndHeaderSize ( ) const { > uint32_t sum = 0; > for(Frames::const_iterator i = parts.begin(); i != parts.end(); ++i > ) { > uint8_t type = (*i).getBody()->type(); > if ((type == qpid::framing::HEADER_BODY ) || (type == > qpid::framing::CONTENT_BODY )) > sum += (*i).encodedSize(); This should be i->getBody()->encodedSize(). Otherwise it includes the frame header which is 12 bytes pre frame (with two frames counted). > } > return sum; > } > > > My function always returns 24 bytes more than the template version. > > > If I substitute calls to my function in > MessageTransfer::{get,compute}RequiredCredit(), I get rid of about 50% of > the throughput regression from qpid 0.14 to qpid 0.22, in mid-range message > sizes. > > By "mid-range" I mean 2K to 8K message sizes. At 512 bytes and below this > change has little effect (although the small effect it does have is > positive.) > > At higher message sizes all versions get very close to each other in > performance in my tests. Created attachment 907702 [details]
version 0.22d results
In this version:
1. replace map_if call in MessageTransfer::computeRequiredCredit and getRequiredCredit with this function in FrameSet.h:
uint32_t contentAndHeaderSize ( ) const {
uint32_t sum = 0;
for(Frames::const_iterator i = parts.begin(); i != parts.end(); ++i ) {
uint8_t type = (*i).getBody()->type();
if ((type == qpid::framing::HEADER_BODY ) || (type == qpid::framing::CONTENT_BODY ))
sum += (*i).getBody()->encodedSize();
}
return sum;
}
NOTE: change to (*i).getBody()->encodedSize() to give correct size -- thanks, gsim!
2. In function SessionState::handleContent, put call to msgBuilder.end(); *before* call to semanticState.route().
This eliminates calls to MessageTransfer::getRequiredCredit() that were happening *before* calls to MessageTransfer::computeRequiredCredit(). Which caused an extra calc of the required credit -- I think with lock held.
I am planning to turn this one into a patch, submit for review, and go on from there.
Created attachment 912758 [details]
trunk throughput with & without patch
The attached graph shows latency averages for trunk code (as of 26 June 2014) with and without patch.
The patch was originally intended for an older code version, but it has a reasonable effect on trunk. With this patch, latency does not skyrocket as quickly as before.
Tested with messages of 2048 bytes.
Different failure-mode on trunk.
-----------------------------------------------------
Recent (yesterday) trunk code behaves very differently under my testing than does the 0.14 or 0.22 code.
In the 0.14 and 0.22 versions, I was doing this:
* choose a message size
* at that size, increase messages per second until throughput saturates.
* throughput saturates when you see a latency "runaway". i.e. the latency
numbers that qpid-latency-test prints out will look something like this:
3 10 37 112 250 500 1000 ....
They increase rapidly, without limit.
So at message size == 2048, if you see the runaway at 37000 messages per second, then I would put the point ( 37000, 2048 ) on my graph.
** but ******
That is not how trunk behaves!
The latest trunk code never runs away like that. It seems like this could only be due to some kind of producer flow control -- except that I have that turned off!
But -- it does runaway in a different sense. For messages of a certain size, i.w. 2048, as you increase messages-per-second, you reach a point where average latency increases massively. It does *n* run away. It is always stable -- but becomes very large.
See my attachment 912758 [details] (comment 20) for an illustration of what i mean.
And also note that in this newest code -- the number of messages per second at which point the latency goes high -- is quite a bit *lower* ( i.e. worse ) than the runaway point in previous code.
Bummer.
Fix checked in on git branch remotes/origin/trunk-mick-bz1011221 . (In reply to mick from comment #22) > Fix checked in on git branch remotes/origin/trunk-mick-bz1011221 . I'm confused by this. We want the commit upstream, in Qpid's authoritative repo. Is it going to get there? (In reply to Justin Ross from comment #23) > (In reply to mick from comment #22) > > Fix checked in on git branch remotes/origin/trunk-mick-bz1011221 . > > I'm confused by this. We want the commit upstream, in Qpid's authoritative > repo. Is it going to get there? Yes, sorry -- I didn't understand where it should go. Created attachment 919951 [details]
latency comparison, 0.14 and 0.22-with-patch
To make these charts, I used the qpid-latency-test from version 0.14 .
It puts out a result every 1 second.
For each point on the graph, I took 30 seconds of output, and averaged the results for latency.
On my 2 boxes, the best results I got were close to 0.2 msec . So when latency reached 10 times that number, I stopped. ( Or when I had a latency runaway, where the numbers increased without limit. )
I had separate boxes for the broker and the client -- the two boxes were attached with 10 GbE. This was as close as I could simulate the hardware used by the original reporter of the bug.
I will provide the actual scripts I used in a separate attachment.
Created attachment 920122 [details]
the scripts I used -- final versions
These are the scripts I used to start broker and client during testing for this bug. Please see README file, included.
Untarring this archive will create a new directory: 1011221_scripts .
Created attachment 929675 [details]
results
Every result in this table represents an average value from 30 test runs.
Deformations are caused by one or two extreme values from 30 results. Therefore these results can't be taken as valid one.
In addition to comment 31 I would like to add, that there is a visible latency improvement. Here are used packages, which contains already mentioned fix: Packages from machine with broker: qpid-cpp-client-0.22-46.el6.x86_64 qpid-cpp-client-devel-0.22-46.el6.x86_64 qpid-cpp-client-devel-docs-0.22-46.el6.noarch qpid-cpp-client-rdma-0.22-46.el6.x86_64 qpid-cpp-debuginfo-0.22-46.el6.x86_64 qpid-cpp-server-0.22-46.el6.x86_64 qpid-cpp-server-devel-0.22-46.el6.x86_64 qpid-cpp-server-ha-0.22-46.el6.x86_64 qpid-cpp-server-linearstore-0.22-46.el6.x86_64 qpid-cpp-server-rdma-0.22-46.el6.x86_64 qpid-cpp-server-xml-0.22-46.el6.x86_64 rh-qpid-cpp-tests-0.22-46.el6.x86_64 Packages from machine with clinets: qpid-cpp-client-0.22-46.el6.i686 qpid-cpp-client-devel-0.22-46.el6.i686 qpid-cpp-client-devel-docs-0.22-46.el6.noarch qpid-cpp-client-rdma-0.22-46.el6.i686 qpid-cpp-debuginfo-0.22-46.el6.i686 qpid-cpp-server-0.22-46.el6.i686 qpid-cpp-server-devel-0.22-46.el6.i686 qpid-cpp-server-ha-0.22-46.el6.i686 qpid-cpp-server-linearstore-0.22-46.el6.i686 qpid-cpp-server-rdma-0.22-46.el6.i686 qpid-cpp-server-xml-0.22-46.el6.i686 |