Bug 1011221 - MRG 3 produces poor latency results when compare to MRG2.1 testing 10Gb networks
Summary: MRG 3 produces poor latency results when compare to MRG2.1 testing 10Gb networks
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: 3.0
: ---
Assignee: mick
QA Contact: Matej Lesko
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-09-23 20:03 UTC by Tom Tracy
Modified: 2015-01-21 12:57 UTC (History)
8 users (show)

Fixed In Version: qpid-cpp-0.22-44
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-21 12:57:24 UTC
Target Upstream Version:


Attachments (Terms of Use)
graph showing throughput regression (65.29 KB, image/png)
2014-05-28 12:29 UTC, mick
no flags Details
Performance of version 0.22A (57.07 KB, image/jpeg)
2014-06-02 14:36 UTC, mick
no flags Details
aconway's stap script (3.10 KB, text/plain)
2014-06-02 17:56 UTC, mick
no flags Details
improvement from disabling credit calculation (44.53 KB, image/jpeg)
2014-06-09 17:56 UTC, mick
no flags Details
version 0.22d results (43.19 KB, image/jpeg)
2014-06-11 15:51 UTC, mick
no flags Details
trunk throughput with & without patch (37.53 KB, image/jpeg)
2014-06-27 10:50 UTC, mick
no flags Details
latency comparison, 0.14 and 0.22-with-patch (230.55 KB, image/jpeg)
2014-07-22 15:40 UTC, mick
no flags Details
the scripts I used -- final versions (1.56 KB, application/x-gzip)
2014-07-23 07:14 UTC, mick
no flags Details
results (20.71 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-08-22 14:30 UTC, Matej Lesko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-5910 0 None None None Never

Description Tom Tracy 2013-09-23 20:03:25 UTC
Description of problem: When testing latency against 10Gb network cards, get poor results when compared to MRG 2.1 14-16 rpm packages. Testing methods are done the same for each MRG packages. Tested this based on RHEL64 with MRG2,1. Do not see any major regression with RHEL6.5. See same issue with MRG2.2 18-14 rpm packages.


Version-Release number of selected component (if applicable):

qpid-cpp-server-0.22-16.el6.x86_64

How reproducible:
Every time

Steps to Reproduce:
1. isolate sockets using tuna. Pin network interrupts to same socket as qpidd
2. use numactl to start qpidd process on desired socket
3. run latencytest on client machine using same method as starting qpidd

Actual results:

Tested both Solarflare and Broadcom 10Gb and see similar results on both Westmere and SandyBridge Class machines

Expected results:

Results should be low. Tested small medium and large packets (8,256,1024,2048). 

Additional info:

Best to see the results here as they are graphed out
http://perf1.perf.lab.eng.bos.redhat.com/network/MRG%20Comparions%20with%20Different%20Network%20Cards.ods

Comment 2 mick 2014-05-28 12:29:32 UTC
Created attachment 899962 [details]
graph showing throughput regression

png file showing throughput regression, comparing versions 0.14, 0.18, 0.22

Comment 3 mick 2014-05-28 13:13:28 UTC
The root cause of the latency regression appears to be a throughput regression.

At some point, when pushing more and more data per second between a client and a broker, the system's ability to handle increasing throughput will fail and messages will build up in their queues.  A symptom of this will be a very large increase in message latencies, which is what the reporter has seen.  I.e. not increase of 20%, but increases of two or three orders of magnitude.

The problem here is not that this happens -- but that it happens at a lower level of throughput in more recent versions than in earlier versions.

The message latencies that result when these backups begin to happen are a good match for what the report observed.

In the graph in attachment 899962 [details], I quantify this effect, with the following method.

For each version, try several different message sizes:  8, 64, 512, 2048, 4096, 8192, 16384, and 32768 bytes.  For each message size, run several separate tests, increasing the send-rate until I see a sudden sharp increase in message latency -- that's the point that I graph.

The picture shows that the throughput-saturation point has decreased significantly both from 0.14 to 0.18, and from 0.18 to 0.22 .

Comment 4 mick 2014-05-28 14:58:39 UTC
Here is the same information as the above graph, in table form, with 0.14 throughput saturation points normalized to "100".



  Percent change in throughput saturation point
  =============================================

  defining 0.14 saturation point as "100"


                     throughput saturation point
                     ---------------------------
  message size       0.14       0.18        0.22

  --------------+-----------+----------+-----------+
                |           |          |           |
        8       |    100    |    84    |    79     |
                |           |          |           |
       64       |    100    |    82    |    79     |
                |           |          |           |
      512       |    100    |    77    |    73     |
                |           |          |           |
     2048       |    100    |    84    |    76     |
                |           |          |           |
     4096       |    100    |    92    |    82     |
                |           |          |           |
     8192       |    100    |    96    |    90     |
                |           |          |           |
    16384       |    100    |    94    |    94     |
                |           |          |           |
    32768       |    100    |    90    |    90     |
                |           |          |           |
  --------------+-----------+----------+-----------+

Comment 5 mick 2014-05-28 16:46:56 UTC
Using perf to measure instructions per second.

Testing the brokers with a run of qpid-latency-test that sends 1,000,000 messages of 256 bytes each.

I ran each test 3 times and took average of the three.

0.14 instructions per second == 1.988 billion

0.22 instructions per second == 1.543 billion

1.543 / 1.988 == 0.78

This exactly corresponds to the throughput reduction I am seeing from 0.14 to 0.22 .

I think the only way this can happen is if the processors are spending more time doing nothing at all.  And the only way that can happen, I think, is lock contention.

Comment 6 mick 2014-05-28 17:01:44 UTC
Doing profiling to look at how many calls the different brokers make to do the same amount of work -- with a run of  qpid-latency-test that sends 1,000,000 messages of 256 bytes each.

These numbers are perfectly repeatable.

Choosing a few of the functions that seem probably-mostly independent of each other:

                         0.14        0.22        diff %
--------------------+-----------+-----------+------------
                    |           |           |
strcmp              |  258383   |  268950   |  + 4
                    |           |           |
malloc              |  179870   |  226743   |  + 26
                    |           |           |
std::string::assign |   21326   |   27126   |  + 27
                    |           |           |
--------------------+-----------+-----------+------------


The increase in calls to malloc and string assign from 0.14 to 0.22 are very similar to the decreases in throughput I am seeing.

These data make it clear that 0.22 is simply doing more stuff than 0.14 per unit of actual useful work.  ( i.e. receiving and sending a message. )

Comment 7 mick 2014-05-28 17:03:40 UTC
The conclusions to be drawn from the data in comments 5 and 6 are

  1. inescapable

  2. mutually exclusive

So ... I'm doomed.

Comment 8 mick 2014-05-29 02:03:58 UTC
Producer Flow Control makes a difference!
Thanks for idea, Justin!

In the 0.22 broker, turning off producer flow control removes exactly half of the throughput-decrease between from 0.18 to 0.22 . 

That still leaves a good ways to go to reach 0.14 -- but it is definitely significant.

Comment 9 mick 2014-06-02 14:36:25 UTC
Created attachment 901470 [details]
Performance of version 0.22A

0.22A is 0.22 except with producer flow control turned off, and with all "ScopedUse" locks commented out of Queue.cpp

Comment 10 mick 2014-06-02 14:38:03 UTC
Here is the same information, except in table form:


   0.22A is 0.22 code except:
   ================================

      1. "ScopedUse" locks in Queue.cpp are commented out.
      2. Tests are run with producer flow control turned off.

      Note: The ScopedUse locks are to prevent the Store from
            prematurely deleting queues. (thanks, gsim)


                                                  improved
         0.14       0.22    regression  0.22a    regression
  ----------------------------------------------------------------
  8     61000      48000      21%       52000       15%
  64    62000      49000      21%       51000       18%
  512   60000      44000      27%       46000       23%
  2048  50000      38000      24%       43000       14%
  4096  40000      33000      18%       36000       10%
  8192  29000      26000      10%       26000       10%

Comment 11 mick 2014-06-02 14:42:41 UTC
To sum up, 0.22A eliminates about 26% of the regression that we see from 0.14 --> 0.22 .

Comment 12 mick 2014-06-02 17:56:18 UTC
Created attachment 901514 [details]
aconway's stap script

to run this, 

  1. have qpidd running already
  2. as root:   stap /home/mick/mutexes.stp -x 1342 -d /usr/local/sbin/qpidd --ldd -v    ( where -x arg is the PID of the running qpidd, and the -d arg is the path to the executable that is running )
  3. after stap script starts, start the script that exercises the broker.  ( in my case, this is a script that runs qpid-latency-test )
  4. When you are done exercising the broker, do *not* stop the broker.  Instead, stop the stap script, with CTRL-C.  It will print out its data on exit.
  5. Finally you can stop the broker.

Before using stap, I had to do this:

  A. yum install systemtap kernel-devel yum-utils debuginfo-install kernel
  B. debuginfo-install glibc

Comment 13 mick 2014-06-03 17:12:33 UTC

When I run aconway's script the same way he did, I get very similar results.
This is reassuring.

*However* this is not how I will run it going forward.
Going forward, I will run only with producer flow control disabled,
to give a more fair comparison with 0.14 .



  1. run broker this way -------------------------------------------

#! /bin/bash

QPIDD=qpidd

rm -rf /tmp/mick
mkdir  /tmp/mick


$QPIDD                                           \
  --data-dir /tmp/mick/data_1                    \
  --auth=no                                      \
  -d


echo "started broker from $QPIDD  "



2. run latency test this way ---------------------------------------

 #! /bin/bash

# This has to be changed for different installs.
# They go to different paths.
LATENCY_TEST=/usr/local/libexec/qpid/tests/qpid-latency-test


TEN_GB_ADDR=192.168.10.38


echo "emulate Alan version..."


${LATENCY_TEST} -b ${TEN_GB_ADDR} --count 50000 --csv



3. Use recent qpid trunk code -----------------


=======================================


And here is what I get:

    contended 32664 times, 947121us total, 28us avg
    qpid::sys::Mutex::lock()
    qpid::sys::ScopedLock<qpid::sys::Mutex>::ScopedLock(qpid::sys::Mutex&)
    qpid::broker::Queue::enqueue(qpid::broker::TransactionContext*, qpid::broker::Message&)
    qpid::broker::Queue::deliverTo(qpid::broker::Message, qpid::broker::TxBuffer*)
    qpid::broker::Queue::deliver(qpid::broker::Message, qpid::broker::TxBuffer*)
    qpid::broker::DeliverableMessage::deliverTo(boost::shared_ptr<qpid::broker::Queue> const&)


    contended 19344 times, 398531us total, 20us avg
    qpid::sys::Mutex::lock()
    qpid::sys::ScopedLock<qpid::sys::Mutex>::ScopedLock(qpid::sys::Mutex&)
    qpid::broker::Queue::UsageBarrier::release()
    qpid::broker::Queue::ScopedUse::~ScopedUse()
    qpid::broker::Queue::enqueue(qpid::broker::TransactionContext*, qpid::broker::Message&)
    qpid::broker::Queue::deliverTo(qpid::broker::Message, qpid::broker::TxBuffer*)

Comment 14 mick 2014-06-04 16:43:57 UTC
I continue to get contradictory results from "perf" and "systemtap".

perf results clearly suggest a large amount of extra lock contention in 0.22 , because instructions per second are lower by 28% in 0.22 as compared to 0.14

However a systemtap lock contention script shows that 0.14 has *twice* as much lock contention (measured in microseconds, not contentions) as 0.22.


So!

Hypothesis:  the systemtap technology is changing qpidd's behavior too much to get meaningful results.

To test this hypothesis:   use valgrind+callgrind to look for extra work being done in 0.22 in areas where locks are held.  If we are doing substantial extra work in those areas, this will strongly support the idea that we are causing lock contention, which causes the instrucrtions per second reduction seen by "perf".

Comment 15 mick 2014-06-09 17:56:12 UTC
Created attachment 905459 [details]
improvement from disabling credit calculation

In file MessageTransfer.cpp, the functions getRequiredCredit() and computeRequiredCredit() both have this line:

  frames.map_if(sum, qpid::framing::TypeFilter2<qpid::framing::HEADER_BODY, qpid::framing::CONTENT_BODY>());

If I comment that out and just return or store (whichever appropriate for the fn) the value of 0 -- I get the results in the attached graph.
One half of the throughput regression from 0.14 to 0.22 goes away.

Is this because that expression is somehow shockingly inefficient?
Or....    is it an effect of the value that is returned?

aha.   next test -- find out what value those exprs are returning... and just do the same, hardcoded.

Comment 16 Alan Conway 2014-06-09 19:34:35 UTC
There are two issues:

1. It scans the entire frame set each time, the frame set always starts with the header frames so its wasting time skipping over these.

2. The message size doesn't change (I don't think) over the message lifetime so this could be computed once when we decode the message.

Note there usually only a few frames in a message, small messages probably only have one header and one body frame. However if this code is getting called many times per message then it might be an issue. If it is being called many times per message then it would also be instructive to find out why - there's no obvious reason why we should repeatedly calculate the size of the same message.

Comment 17 mick 2014-06-11 13:38:48 UTC
Here is a function template in FrameSet.h:

    template <class F, class P> void map_if(F& functor, P predicate) const {
        for(Frames::const_iterator i = parts.begin(); i != parts.end(); i++) {
            if (predicate(*i)) functor(*i);
        }
    }

That fn is called from  MessageTransfer::getRequiredCredit()  and  MessageTransfer::computeRequiredCredit().



I wrote a Plain Old Function that I think does the same thing:

    uint32_t contentAndHeaderSize ( ) const {
        uint32_t sum = 0;
        for(Frames::const_iterator i = parts.begin(); i != parts.end(); ++i ) {
            uint8_t type = (*i).getBody()->type();
            if ((type == qpid::framing::HEADER_BODY ) || (type == qpid::framing::CONTENT_BODY ))
                sum += (*i).encodedSize();
        }
        return sum;
    }


My function always returns 24 bytes more than the template version.


If I substitute calls to my function in MessageTransfer::{get,compute}RequiredCredit(), I get rid of about 50% of the throughput regression from qpid 0.14 to qpid 0.22, in mid-range message sizes.

By "mid-range" I mean 2K to 8K message sizes.  At 512 bytes and below this change has little effect (although the small effect it does have is positive.)

At higher message sizes all versions get very close to each other in performance in my tests.

Comment 18 Gordon Sim 2014-06-11 13:42:48 UTC
(In reply to mick from comment #17)
> Here is a function template in FrameSet.h:
> 
>     template <class F, class P> void map_if(F& functor, P predicate) const {
>         for(Frames::const_iterator i = parts.begin(); i != parts.end(); i++)
> {
>             if (predicate(*i)) functor(*i);
>         }
>     }
> 
> That fn is called from  MessageTransfer::getRequiredCredit()  and 
> MessageTransfer::computeRequiredCredit().
> 
> 
> 
> I wrote a Plain Old Function that I think does the same thing:
> 
>     uint32_t contentAndHeaderSize ( ) const {
>         uint32_t sum = 0;
>         for(Frames::const_iterator i = parts.begin(); i != parts.end(); ++i
> ) {
>             uint8_t type = (*i).getBody()->type();
>             if ((type == qpid::framing::HEADER_BODY ) || (type ==
> qpid::framing::CONTENT_BODY ))
>                 sum += (*i).encodedSize();

This should be i->getBody()->encodedSize(). Otherwise it includes the frame header which is 12 bytes pre frame (with two frames counted).

>         }
>         return sum;
>     }
> 
> 
> My function always returns 24 bytes more than the template version.
> 
> 
> If I substitute calls to my function in
> MessageTransfer::{get,compute}RequiredCredit(), I get rid of about 50% of
> the throughput regression from qpid 0.14 to qpid 0.22, in mid-range message
> sizes.
> 
> By "mid-range" I mean 2K to 8K message sizes.  At 512 bytes and below this
> change has little effect (although the small effect it does have is
> positive.)
> 
> At higher message sizes all versions get very close to each other in
> performance in my tests.

Comment 19 mick 2014-06-11 15:51:34 UTC
Created attachment 907702 [details]
version 0.22d results

In this version:

  1. replace map_if call in MessageTransfer::computeRequiredCredit and getRequiredCredit with this function in FrameSet.h:

    uint32_t contentAndHeaderSize ( ) const {
        uint32_t sum = 0;
        for(Frames::const_iterator i = parts.begin(); i != parts.end(); ++i ) {
            uint8_t type = (*i).getBody()->type();
            if ((type == qpid::framing::HEADER_BODY ) || (type == qpid::framing::CONTENT_BODY ))
                sum += (*i).getBody()->encodedSize();
        }
        return sum;
    }

 NOTE: change to    (*i).getBody()->encodedSize()   to give correct size -- thanks, gsim!



2. In function   SessionState::handleContent, put call to  msgBuilder.end();  *before* call to semanticState.route().
   This eliminates calls to MessageTransfer::getRequiredCredit() that were happening *before* calls to MessageTransfer::computeRequiredCredit().  Which caused an extra calc of the required credit -- I think with lock held.


I am planning to turn this one into a patch, submit for review, and go on from there.

Comment 20 mick 2014-06-27 10:50:53 UTC
Created attachment 912758 [details]
trunk throughput with & without patch

The attached graph shows latency averages for trunk code (as of 26 June 2014) with and without patch.

The patch was originally intended for an older code version, but it has a reasonable effect on trunk.  With this patch, latency does not skyrocket as quickly as before.

Tested with messages of 2048 bytes.

Comment 21 mick 2014-06-27 14:22:15 UTC
Different failure-mode on trunk.
-----------------------------------------------------


Recent (yesterday) trunk code behaves very differently under my testing than does the 0.14 or 0.22 code.   

In the 0.14 and 0.22 versions, I was doing this:

  * choose a message size
  * at that size, increase messages per second until throughput saturates.
  * throughput saturates when you see a latency "runaway".  i.e. the latency
    numbers that qpid-latency-test prints out will look something like this:
      3   10   37  112  250  500  1000 .... 
    They increase rapidly, without limit.

So at message size == 2048, if you see the runaway at 37000 messages per second, then I would put the point ( 37000, 2048 ) on my graph.

** but ******

That is not how trunk behaves!

The latest trunk code never runs away like that.  It seems like this could only be due to some kind of producer flow control -- except that I have that turned off!

But -- it does runaway in a different sense.  For messages of a certain size, i.w. 2048, as you increase messages-per-second, you reach a point where average latency increases massively.  It does *n* run away.  It is always stable -- but becomes very large.

See my attachment 912758 [details] (comment 20) for an illustration of what i mean.

And also note that in this newest code -- the number of messages per second at which point the latency goes high -- is quite a bit *lower* ( i.e. worse ) than the runaway point in previous code.

Bummer.

Comment 22 mick 2014-07-21 17:33:05 UTC
Fix checked in on git branch remotes/origin/trunk-mick-bz1011221 .

Comment 23 Justin Ross 2014-07-21 17:49:12 UTC
(In reply to mick from comment #22)
> Fix checked in on git branch remotes/origin/trunk-mick-bz1011221 .

I'm confused by this.  We want the commit upstream, in Qpid's authoritative repo.  Is it going to get there?

Comment 24 mick 2014-07-22 11:27:21 UTC
(In reply to Justin Ross from comment #23)
> (In reply to mick from comment #22)
> > Fix checked in on git branch remotes/origin/trunk-mick-bz1011221 .
> 
> I'm confused by this.  We want the commit upstream, in Qpid's authoritative
> repo.  Is it going to get there?


Yes, sorry -- I didn't understand where it should go.

Comment 25 Justin Ross 2014-07-22 13:25:11 UTC
https://svn.apache.org/r1612559

Comment 27 mick 2014-07-22 15:40:06 UTC
Created attachment 919951 [details]
latency comparison, 0.14 and 0.22-with-patch

To make these charts, I used the qpid-latency-test from version 0.14 .

It puts out a result every 1 second.
For each point on the graph, I took 30 seconds of output, and averaged the results for latency.

On my 2 boxes, the best results I got were close to 0.2 msec .   So when latency reached 10 times that number, I stopped.  ( Or when I had a latency runaway, where the numbers increased without limit. )

I had separate boxes for the broker and the client -- the two boxes were attached with 10 GbE.  This was as close as I could simulate the hardware used by the original reporter of the bug.

I will provide the actual scripts I used in a separate attachment.

Comment 28 mick 2014-07-23 07:14:39 UTC
Created attachment 920122 [details]
the scripts I used -- final versions

These are the scripts I used to start broker and client during testing for this bug.  Please see README file, included.
 
Untarring this archive will create a new directory: 1011221_scripts .

Comment 31 Matej Lesko 2014-08-22 14:30:05 UTC
Created attachment 929675 [details]
results

Every result in this table represents an average value from 30 test runs.
Deformations are caused by one or two extreme values from 30 results. Therefore these results can't be taken as valid one.

Comment 32 Matej Lesko 2014-08-22 14:50:22 UTC
In addition to comment 31 I would like to add, that there is a visible latency improvement. 

Here are used packages, which contains already mentioned fix:

Packages from machine with broker:
qpid-cpp-client-0.22-46.el6.x86_64
qpid-cpp-client-devel-0.22-46.el6.x86_64
qpid-cpp-client-devel-docs-0.22-46.el6.noarch
qpid-cpp-client-rdma-0.22-46.el6.x86_64
qpid-cpp-debuginfo-0.22-46.el6.x86_64
qpid-cpp-server-0.22-46.el6.x86_64
qpid-cpp-server-devel-0.22-46.el6.x86_64
qpid-cpp-server-ha-0.22-46.el6.x86_64
qpid-cpp-server-linearstore-0.22-46.el6.x86_64
qpid-cpp-server-rdma-0.22-46.el6.x86_64
qpid-cpp-server-xml-0.22-46.el6.x86_64
rh-qpid-cpp-tests-0.22-46.el6.x86_64

Packages from machine with clinets:
qpid-cpp-client-0.22-46.el6.i686
qpid-cpp-client-devel-0.22-46.el6.i686
qpid-cpp-client-devel-docs-0.22-46.el6.noarch
qpid-cpp-client-rdma-0.22-46.el6.i686
qpid-cpp-debuginfo-0.22-46.el6.i686
qpid-cpp-server-0.22-46.el6.i686
qpid-cpp-server-devel-0.22-46.el6.i686
qpid-cpp-server-ha-0.22-46.el6.i686
qpid-cpp-server-linearstore-0.22-46.el6.i686
qpid-cpp-server-rdma-0.22-46.el6.i686
qpid-cpp-server-xml-0.22-46.el6.i686


Note You need to log in before you can comment on or make changes to this bug.