752381 – DIGEST-MD5 connection failure between RHEL5 and RHEL6 with more connections

Bug 752381 - DIGEST-MD5 connection failure between RHEL5 and RHEL6 with more connections

Summary: DIGEST-MD5 connection failure between RHEL5 and RHEL6 with more connections

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	python-qpid
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	2.1.2
Target Release:	---
Assignee:	mick
QA Contact:	Stanislav Graf
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	754228 788115
TreeView+	depends on / blocked

Reported:	2011-11-09 11:03 UTC by Luigi Toscano
Modified:	2012-12-07 17:44 UTC (History)
CC List:	6 users (show)
Fixed In Version:	python-qpid-0.14-3.el5, python-qpid-0.14-5.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Concurrent active use of SASL authentication Consequence: The connection aborts Fix: Correct SASL decoding to distinguish between failures and empty results Result: SASL authentication completes and the connection remains operational
Clone Of:
Environment:
Last Closed:	2012-12-07 17:42:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-3778	0	None	None	None	Never
Red Hat Bugzilla	743620	0	unspecified	CLOSED	Missing information about cyrus-sasl-plain package	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	754228	1	None	None	None	2021-01-20 06:05:38 UTC

Internal Links: 743620 754228 804616

Description Luigi Toscano 2011-11-09 11:03:16 UTC

Description of problem:
Scenario: two machines, RHEL5.7 and RHEL6.2 (latest snapshot), with latest bits from MRG (messaging and grid, which means qpid 0.10 on RHEL5 and qpid 0.12 on RHEL6).
The broker is on the RHEL5 machines, configured to use
I configured the broker to use DIGEST-MD5 (mech_list: DIGEST-MD5  in /etc/sasl2/qpidd.conf).
A cumin user (password cumin) has been created in the sasl db. Likewise for grid and sesame (realm: QPID).
Sesame and condor have been configured to connect to the broker using DIGEST-MD5.

python-saslwrapper is installed on both machines.

When the broker is started, but no condor or sesame on any machine, the command
qpid-stat -q --sasl-mechanism=DIGEST-MD5 cumin/cumin@broker-machine-name:5672
(or -b, or -c) works correctly from both machines.

If I execute sesame on the same machines where the broker runs, qpid-stat (with the same parameters) still works and run.

When condor is executed on the same machine where the broker run, then qpid-stat:
- works when invoked from the machine where the broker runs (RHEL5)
- when invoked from the other machine (RHEL6) returns an error:
Failed: SessionException - (None, 'connection aborted')

This, incidentally, prevents cumin (running on the RHEL6 machine) to connect to the broker using DIGEST-MD5.

Version-Release number of selected component (if applicable):

RHEL5:
python-qpid-0.10-1.el5
python-qpid-qmf-0.10-11.el5
qpid-cpp-client-0.10-9.el5
qpid-cpp-server-ssl-0.10-9.el5
condor-7.6.5-0.6.el5
qpid-tools-0.10-6.el5
condor-classads-7.6.5-0.6.el5
qpid-cpp-server-0.10-9.el5
condor-qmf-7.6.5-0.6.el5
qpid-cpp-client-ssl-0.10-9.el5
qpid-qmf-0.10-11.el5

RHEL6:
qpid-cpp-client-ssl-0.12-6.el6.x86_64
condor-7.6.5-0.6.el6.x86_64
python-qpid-0.12-1.el6.noarch
qpid-cpp-client-0.12-6.el6.x86_64
qpid-qmf-0.12-6.el6.x86_64
condor-classads-7.6.5-0.6.el6.x86_64
qpid-tools-0.12-2.el6.noarch
python-qpid-qmf-0.12-6.el6.x86_64
condor-qmf-7.6.5-0.6.el6.x86_64
qpid-cpp-server-0.12-6.el6.x86_64
qpid-cpp-server-ssl-0.12-6.el6.x86_64

Comment 2 Luigi Toscano 2011-11-09 12:15:03 UTC

I tried the same command line (qpid-stat...) from another RHEL5 machine and it reports an error as well.

Comment 4 Luigi Toscano 2011-11-09 15:18:11 UTC

Switching the roles of RHEL5 and RHEL6 (broker on RHEL6) qpid-tool works from both machines when condor is started too. It could be an issue with 0.10, maybe.

Comment 5 Justin Ross 2011-11-23 20:51:35 UTC

We believe this is fixed upstream as of 0.12.  We'd like QE to consider verifying it for 2.2 when we have new builds.

Comment 6 Trevor McKay 2011-12-16 02:02:01 UTC

I have a different reproducer which may be the same issue.  It involves only a single machine (happens to be the grid CM) and multiple local clients.

RHEL 5.5
qpid version is 0.14
broker mech list is ANONYMOUS PLAIN DIGEST-MD5
changed the broker port to 5777 to keep the rest of the grid from finding it.
sasldb with cumin user as described above

1) Run about 5 local clients connected to the broker.  I did this by running 5 qpid-tools, and also by running cumin and controlling the number of data and web instances with the [webs] and [datas] settings in cumin.conf.  Which client doesn't seem to matter as long as the connection is held open.

2) Run qpid-stat -c cumin/cumin@localhost

qpid-stat exits with 

Failed: SessionException - (None, 'connection aborted')

I received this result 100% with 5 or more other clients running, maybe 50% (?) with 4 clients.  Less than 4 clients, qpid-stat succeeded 100%

3) note, the following two invocations work regardless of the number of clients

qpid-stat -c localhost:5777
qpid-stat -c localhost:5777 --sasl-mechanism='PLAIN'

For example -- 

# ps -ef | grep -i qpid-tool
root      9029   379  1 20:38 pts/5    00:00:14 /usr/bin/python /usr/bin/qpid-tool localhost:5777
root     15807 15419  1 20:41 pts/11   00:00:11 /usr/bin/python /usr/bin/qpid-tool localhost:5777
root     18595 18306  1 20:42 pts/13   00:00:10 /usr/bin/python /usr/bin/qpid-tool localhost:5777
root     22364 11174  2 20:57 pts/9    00:00:00 /usr/bin/python /usr/bin/qpid-tool localhost:5777
root     22633 25311  1 20:57 pts/3    00:00:00 /usr/bin/python /usr/bin/qpid-tool localhost:5777
root     24073  1457  0 20:57 pts/12   00:00:00 grep -i qpid-tool

# qpid-stat -c cumin/cumin@localhost:5777
Failed: SessionException - (None, 'connection aborted')

# kill -9 15807 18595
# qpid-stat -c cumin/cumin@localhost:5777
Connections
  client-addr                     cproc      cpid   auth            connected  idle  msgIn  msgOut
  ==================================================================================================
  127.0.0.1:5777-127.0.0.1:46292  qpid-tool  22633  anonymous@QPID  1m 40s     0s     437    986
  127.0.0.1:5777-127.0.0.1:46230  qpid-tool  22364  anonymous@QPID  1m 54s     0s     509   1.20k
  127.0.0.1:5777-127.0.0.1:47871  qpid-stat  27170  cumin@QPID      0s         0s     210    265
  127.0.0.1:5777-127.0.0.1:54322  qpid-tool  9029   anonymous@QPID  20m 0s     0s    3.55k  10.3k
#

Comment 7 Trevor McKay 2011-12-16 18:03:02 UTC

note, reproduced with the formula from comment 6 on another box for completeness

RHEL 5.7
qpid version is 0.10

Comment 8 Trevor McKay 2011-12-16 18:49:33 UTC

Forgot to mention

RHEL 5.7
qpid version 0.14

(In reply to comment #7)
> note, reproduced with the formula from comment 6 on another box for
> completeness
> 
> RHEL 5.7
> qpid version is 0.10

Comment 9 Ken Giusti 2011-12-19 13:56:28 UTC

Trevor was able to configure a system to repro this problem, and I've dug in a bit.

We can reproduce the problem with a simple "qpid-stat" call, when passing "cumin/cumin" as the credentials:

qpid-stat -c   ### works fine:

Connections
  client-addr                     cproc       cpid   auth            connected  idle  msgIn  msgOut
  ===================================================================================================
  127.0.0.1:5672-127.0.0.1:47723  cumin-data  4768   cumin@QPID      32s        0s      96    169
  127.0.0.1:5672-127.0.0.1:47722  cumin-data  4767   cumin@QPID      33s        0s     102    175
  127.0.0.1:5672-127.0.0.1:47725  cumin-web   4766   cumin@QPID      32s        0s      88    153
  127.0.0.1:5672-127.0.0.1:47724  cumin-data  4769   cumin@QPID      32s        0s     102    175
  127.0.0.1:5672-127.0.0.1:47726  cumin-web   4766   cumin@QPID      31s        0s      88    153
  127.0.0.1:5672-127.0.0.1:47729  qpid-stat   31720  anonymous@QPID  1s         0s     208    263



qpid-stat -c cumin/cumin@localhost   ### hits the problem:

Failed: SessionException - (None, 'connection aborted')


Some preliminary observations:

1) qpid-stat -c cumin/cumin@localhost works correctly IFF the other cumin users are -not- active.   Once the cumin-web/cumin-data are running, the failure starts to manifest itself.

2) The python client appears to authenticate fine prior to the failure.  From the broker log:

Dec 19 08:49:15 dell-pesc1420-01 qpidd[31577]: 2011-12-19 08:49:15 info SASL: Mechanism list: DIGEST-MD5 ANONYMOUS 
Dec 19 08:49:15 dell-pesc1420-01 qpidd[31577]: 2011-12-19 08:49:15 info SASL: Starting authentication with mechanism: DIGEST-MD5 
Dec 19 08:49:15 dell-pesc1420-01 qpidd[31577]: 2011-12-19 08:49:15 info 127.0.0.1:5672-127.0.0.1:35175 SASL: Authentication succeeded for: cumin@QPID 


3) The problem appears to be in the sasl-wrapper client.  I've traced the failure to this point in the python client code:

/usr/lib/python2.4/site-packages/qpid/connection.py::run():

    while not self.closed:
      try:
        data = self.sock.recv(64*1024)
        if self.security_layer_rx and data:
          status, data = self.security_layer_rx.decode(data)
        if not data:
          self.detach_all()
          break

At the point of failure, the decode call is returning (True, None) - the expected behavior is the return the decoded data.  Note that this happens well into the exchange - prior to this point much of the connection data has decoded correctly.   The security_layer_rx is set to the sasl wrapper Client() object, IIRC.

Comment 10 mick 2012-01-11 14:54:34 UTC

This does not appear to be directly related to any recent changes in the default qpidd sasl config file.  I do not use that file in my testing, but use instead the one that is created during 'make check' testing in the cpp tree -- which has not changed.  And I am reproducing the bug just fine.

Right now it also looks like it may not be related to the sasl handshake at all, but more on that presently.

Comment 11 mick 2012-01-12 20:48:21 UTC

    Here's what I've seen so far.

    1. The SASL handshake completes successfully for MD5, and the MD5 client begins to communicate with the broker.  Both the broker code and the saslwrapper code on the client side seem to be perfectly happy.

    2. The Doomed Client gets a SessionAttachBody, and then more types of messages including
            ExchangeBindBody
            MessageSubscribeBody
            MessageSetFlowModeBody
            QueueDeclareBody
            MessageSubscribeBody
            ExchangeQueryBody


    3. When the Doomed Client (qpid-stat) dies, there is no indication of any error in the Broker log.  ( The word "error" does not occur in any form. )

    4. The final lines in the broker log concerning the Doomed Client do not seem to indicate anything unusual.  SessionState code receives a "sender marker completed" message.  This has already happened 71 times in the log for this client.  The next log message shows the broker receiving a SessionDetachBody, indicating that the client is leaving.  From the broker's point o
f view, nothing bad has happened.

   I am continuing now to look more closely at the client to see what it's upset about.

Comment 12 mick 2012-01-13 17:39:18 UTC

The current implementation of Connection::run in python/qpid/connection.py errors out if it receives no data in response to its call to self.security_layer_rx.decode(data)

But that decode function, in src/cyrus/saslwrapper.cpp, calls sasl_decode() in the cyrus sasl library.

The man page for that fn says:
    Note  that  sasl_decode can succeed and outputlen can be zero. If this is
    the case simply wait for more data and call sasl_decode again.

This is happening in the failure case.

If I alter the python Connection::run code to only error out if the return code indicates failure, the observed bug happens 0 out of 5 times.   If I do not make that change, the bug happens 5 out of 5 times.

So -- this change definitely should be made.  Next order of business is to see whether this bug, which we see with a simplified reproducer, is actually the same thing that was originally reported.


Here is the diff:

Index: python/qpid/connection.py
===================================================================
--- python/qpid/connection.py   (revision 1228344)
+++ python/qpid/connection.py   (working copy)
@@ -161,9 +161,12 @@
         data = self.sock.recv(64*1024)
         if self.security_layer_rx and data:
           status, data = self.security_layer_rx.decode(data)
-        if not data:
-          self.detach_all()
-          break
+          # zero-length data is OK, as long as return code is good.
+          # when that happens, just keep trying.  the sasl library 
+          # will send us data eventually.  ( or an error code. )
+          if not status:
+            self.detach_all()
+            break
       except socket.timeout:
         if self.aborted():
           self.close_code = (None, "connection timed out")

Comment 13 mick 2012-01-13 17:50:43 UTC

Also note that if we were simply sending too much data, i.e. more than the buffer size that has been negotiated, we would receive back the error code SASL_BADPARAM, rather than SASL_OK.

Comment 14 mick 2012-01-24 14:12:32 UTC

jira QPID-3778
fixed ( with above diff ) in svn rev 1228189

Comment 15 mick 2012-01-27 18:19:18 UTC

I had to back out the above checkin because it broke two tests that I missed:

    qpid.tests.connection.ConnectionTest.testCloseGet
    qpid.tests.connection.ConnectionTest.testCloseListen

The new & improved fix went in as rev rev 1236729.

Comment 17 Stanislav Graf 2012-02-24 10:19:36 UTC

Reproduction:
root@sgraf-rhel5i:~# rpm -qa | grep -i -e md5 -e python-qpid -e cumin -e saslwrapper | sort
cumin-0.1.5192-1.el5.noarch
cyrus-sasl-md5-2.1.22-5.el5_4.3.i386
python-qpid-0.10-1.el5.noarch
python-qpid-qmf-0.10-11.el5.i386
python-saslwrapper-0.10-4.el5.i386
saslwrapper-0.10-4.el5.i386

root@sgraf-rhel6i:~# rpm -qa | grep -i -e md5 -e python-qpid -e cumin -e saslwrapper | sort
cumin-0.1.5192-1.el6.noarch
cyrus-sasl-md5-2.1.23-13.el6.i686
python-qpid-0.12-1.el6.noarch
python-qpid-qmf-0.12-6.el6.i686
python-saslwrapper-0.10-2.el6.i686
saslwrapper-0.10-2.el6.i686

I used complete setup of condor+wallaby+sesame+cumin+qpidd, all using DIGEST-MD5. After all of them connects to qpidd, I wasn't able to perform qpid-stat:
# qpid-stat --connections --sasl-mechanism=DIGEST-MD5 cumin/cumin@localhost:5672
Failed: SessionException - (None, 'connection aborted')

I will test also scenario from Comment 0 and Comment 6 for verification.

Comment 24 Stanislav Graf 2012-03-09 16:16:43 UTC

Tested scenarios:
Comment 0 - DIGEST-MD5 connection failure between RHEL5 and RHEL6 with more connections 
Comment 6 - multiple qpid-stat
Comment 17 - complete setup of condor+wallaby+sesame+cumin+qpidd

I verified also pointing all daemons from Comment 17 to only one machine:
5i+5x+6i+6x connected all to 5i, then all to 5x... 6i, 6x (64 MD5 connections to qpid at the same time)

I also checked during tests if cumin is working properly.

VERIFIED

Comment 25 Justin Ross 2012-04-27 12:24:12 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Concurrent active use of SASL authentication
Consequence: The connection aborts
Fix: Correct SASL decoding to distinguish between failures and empty results
Result: SASL authentication completes and the connection remains operational

Note You need to log in before you can comment on or make changes to this bug.