Bug 507421

Summary: Cluster flow control does not appear to be working properly.
Product: Red Hat Enterprise MRG Reporter: Alan Conway <aconway>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: high Docs Contact:
Priority: high    
Version: 1.1.1CC: freznice, gsim, ppecka, trusnak
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:02:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to verify read-credit on recent code.
none
Patch to verify read-credit on revision 752560.
none
Correct patch to verify read-credit on revision 752560. none

Description Alan Conway 2009-06-22 17:52:56 UTC
Description of problem:

While investigating bug 507388 it appears that cluster flow control is not working properly. By adding log statements to cluster::Connection::decode and giveReadCredit I saw long runs of calls to decode without any credit. These are interleaved with runs where credit appears to be workig properly.

Comment 1 Alan Conway 2009-09-09 21:36:36 UTC
Cluster flow control fixed as of r813100

Comment 2 Alan Conway 2009-09-10 14:57:52 UTC
The fix to cluser flow control requires both of the following commits
813100 and 810591

Comment 4 ppecka 2010-06-07 14:52:43 UTC

Can term "long runs of calls" be specified with some integer approximation. 

does "credit" in "...decode without any credit..." mean, that line "giveReadCredit(1);" is actually not evaluated within qpid::cluster::Connection::decode?

Actually i've tried to check on described that behaviour with r810573 over gdb, started similar configuration as in bug 507388, and then set up break on line where giveReadCredit(1); is inside qpid::cluster::Connection::decode while running perftest --durable=true --size=8
During my last try i saw 914 calls of qpid::cluster::Connection::decode which never resulted in calling giveReadCredit(1);


otherwise when i checked with just one "break qpid::cluster::Connection::giveReadCredit"
then many calls of giveReadCredit method (1500254) were observed

Can you please clarify this issue?

Comment 5 Alan Conway 2010-06-08 12:33:12 UTC
Long runs means > --cluster-read-max which is 10 by default. I was seeing runs of 20-100 I believe. I demonstrated the problem by adding log or printf statements to giveReadCredit and decode. If flow control is working then this should always be true:

0 <= (number_of_decode - number_of_giveReadCredit) <= --cluster-read-max

Not sure what you were seeing in the debugger.

Comment 6 ppecka 2010-06-29 17:03:51 UTC
I'm still not able to see described behaviour, even when I rolled to little bit older version of qpid. 

qpidd-0.5.752581-3.el5
qpidc-rdma-0.5.752581-3.el5
qpidd-acl-0.5.752581-3.el5
qpidd-ssl-0.5.752581-3.el5
qpidc-perftest-0.5.752581-3.el5
qpid-java-qman-0.5.751061-1.el5
python-qpid-0.5.752581-1.el5
qpid-dotnet-0.4.738274-2.el5
qpidc-ssl-0.5.752581-3.el5
ruby-qpid-0.4.749380-2.el5
qpidd-cluster-0.5.752581-3.el5
qpid-java-client-0.5.751061-1.el5
qpidd-devel-0.5.752581-3.el5
qpidc-devel-0.5.752581-3.el5
qpidd-rdma-0.5.752581-3.el5
qpidc-debuginfo-0.5.752581-3.el5
qpidc-0.5.752581-3.el5
qpid-java-common-0.5.751061-1.el5
qpidd-xml-0.5.752581-3.el5

As i don't have such a deep knowledge about whole credit mechanism can you please attach Connection.cpp file containing, as you've said "log or printf statements to giveReadCredit and decode", to show how exactly this issue can be reproduced. I'm using the same reproducer as posted in bug 507388

Comment 8 Frantisek Reznicek 2010-10-05 08:16:28 UTC
Hi Alan,
could you possibly specify in more detail how you would like us to test that cluster flow control is working properly, please?
Petr spent some time ago more than week w/o success.

Thanks upfront.

Comment 9 Alan Conway 2010-10-05 17:30:54 UTC
Created attachment 451736 [details]
Patch to verify read-credit on recent code.

Comment 10 Alan Conway 2010-10-05 17:37:03 UTC
Created attachment 451738 [details]
Patch to verify read-credit on revision 752560.

How to verify this bug: Unfortunately I can't think of any way to verify this bug without instrumenting the code so I've attached 2 patches. 

Apply rccheck-752560.patch to revision 752560, start a clustered broker and run 
perftest --size 8 --durable

You should see a lot of debugging messages and an assertion failure: 

qpidd: ../../qpid/cpp/src/qpid/cluster/Connection.cpp:218: size_t qpid::cluster::Connection::decode(const char*, size_t): Assertion `cc >= 0' failed.

Now apply rccheck-trunk.patch to the version you want to verify and run perftest as before. You should see a lot of debugging messages but no assertion failure.

The patches are functionally the same, but the code has changed so much that  separate patch files are needed for the different revisions.

Comment 11 Tomas Rusnak 2010-10-06 12:23:13 UTC
I tried to compile qpid at revision 752560 with patch from attachment to reproduce the issue:

qpid/cluster/Connection.cpp: In member function 'void qpid::cluster::Connection::giveReadCredit(int)':
qpid/cluster/Connection.cpp:101: error: 'class qpid::cluster::Cluster' has no member named 'getSettings'
make[3]: *** [qpid/cluster/Connection.lo] Error 1

void Connection::giveReadCredit(int credit) {
    if (cluster.getReadMax() && credit) {
        // FIXME aconway 2010-10-05: TEST ONLY
        int cc = (readCreditCheck += credit);
        QPID_LOG(critical, "FIXME " << this << " giveReadCredit(" << credit << ") " << cc);
        assert(cc <= int(cluster.getSettings().readMax));

        output.giveReadCredit(credit);
    }
}

The getSettings method is missing in particular Class. Please could you take a look at it, and found the right revision which can be compiled with your patch?

Comment 12 Alan Conway 2010-10-06 13:03:04 UTC
Created attachment 451884 [details]
Correct patch to verify read-credit on revision 752560.

This is the correct patch for revision 752560, previous one was a mistake.

Comment 13 Tomas Rusnak 2010-10-08 09:50:07 UTC
Restested over current packages on RHEL5/x86 and x86_64 with:

qpid-cpp-server-store-0.7.946106-17.el5
qpid-tests-0.7.946106-1.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-cpp-client-devel-0.7.946106-17.el5
python-qpid-0.7.946106-14.el5
qpid-java-common-0.7.946106-10.el5
qpid-cpp-client-devel-docs-0.7.946106-17.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-tools-0.7.946106-11.el5
qpid-cpp-server-xml-0.7.946106-17.el5
qpid-cpp-client-0.7.946106-17.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-java-client-0.7.946106-10.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-mrg-debuginfo-0.7.946106-17.el5
qpid-cpp-client-ssl-0.7.946106-17.el5

No regression found.

>>> VERIFIED

Comment 14 Alan Conway 2010-10-12 19:18:54 UTC
Tech note not required, this problem was internally found & fixed there are no known cases of it causing a user-visible problem.

Comment 16 errata-xmlrpc 2010-10-14 16:02:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html