Description of problem: Scenario: two machines, RHEL5.7 and RHEL6.2 (latest snapshot), with latest bits from MRG (messaging and grid, which means qpid 0.10 on RHEL5 and qpid 0.12 on RHEL6). The broker is on the RHEL5 machines, configured to use I configured the broker to use DIGEST-MD5 (mech_list: DIGEST-MD5 in /etc/sasl2/qpidd.conf). A cumin user (password cumin) has been created in the sasl db. Likewise for grid and sesame (realm: QPID). Sesame and condor have been configured to connect to the broker using DIGEST-MD5. python-saslwrapper is installed on both machines. When the broker is started, but no condor or sesame on any machine, the command qpid-stat -q --sasl-mechanism=DIGEST-MD5 cumin/cumin@broker-machine-name:5672 (or -b, or -c) works correctly from both machines. If I execute sesame on the same machines where the broker runs, qpid-stat (with the same parameters) still works and run. When condor is executed on the same machine where the broker run, then qpid-stat: - works when invoked from the machine where the broker runs (RHEL5) - when invoked from the other machine (RHEL6) returns an error: Failed: SessionException - (None, 'connection aborted') This, incidentally, prevents cumin (running on the RHEL6 machine) to connect to the broker using DIGEST-MD5. Version-Release number of selected component (if applicable): RHEL5: python-qpid-0.10-1.el5 python-qpid-qmf-0.10-11.el5 qpid-cpp-client-0.10-9.el5 qpid-cpp-server-ssl-0.10-9.el5 condor-7.6.5-0.6.el5 qpid-tools-0.10-6.el5 condor-classads-7.6.5-0.6.el5 qpid-cpp-server-0.10-9.el5 condor-qmf-7.6.5-0.6.el5 qpid-cpp-client-ssl-0.10-9.el5 qpid-qmf-0.10-11.el5 RHEL6: qpid-cpp-client-ssl-0.12-6.el6.x86_64 condor-7.6.5-0.6.el6.x86_64 python-qpid-0.12-1.el6.noarch qpid-cpp-client-0.12-6.el6.x86_64 qpid-qmf-0.12-6.el6.x86_64 condor-classads-7.6.5-0.6.el6.x86_64 qpid-tools-0.12-2.el6.noarch python-qpid-qmf-0.12-6.el6.x86_64 condor-qmf-7.6.5-0.6.el6.x86_64 qpid-cpp-server-0.12-6.el6.x86_64 qpid-cpp-server-ssl-0.12-6.el6.x86_64
I tried the same command line (qpid-stat...) from another RHEL5 machine and it reports an error as well.
Switching the roles of RHEL5 and RHEL6 (broker on RHEL6) qpid-tool works from both machines when condor is started too. It could be an issue with 0.10, maybe.
We believe this is fixed upstream as of 0.12. We'd like QE to consider verifying it for 2.2 when we have new builds.
I have a different reproducer which may be the same issue. It involves only a single machine (happens to be the grid CM) and multiple local clients. RHEL 5.5 qpid version is 0.14 broker mech list is ANONYMOUS PLAIN DIGEST-MD5 changed the broker port to 5777 to keep the rest of the grid from finding it. sasldb with cumin user as described above 1) Run about 5 local clients connected to the broker. I did this by running 5 qpid-tools, and also by running cumin and controlling the number of data and web instances with the [webs] and [datas] settings in cumin.conf. Which client doesn't seem to matter as long as the connection is held open. 2) Run qpid-stat -c cumin/cumin@localhost qpid-stat exits with Failed: SessionException - (None, 'connection aborted') I received this result 100% with 5 or more other clients running, maybe 50% (?) with 4 clients. Less than 4 clients, qpid-stat succeeded 100% 3) note, the following two invocations work regardless of the number of clients qpid-stat -c localhost:5777 qpid-stat -c localhost:5777 --sasl-mechanism='PLAIN' For example -- # ps -ef | grep -i qpid-tool root 9029 379 1 20:38 pts/5 00:00:14 /usr/bin/python /usr/bin/qpid-tool localhost:5777 root 15807 15419 1 20:41 pts/11 00:00:11 /usr/bin/python /usr/bin/qpid-tool localhost:5777 root 18595 18306 1 20:42 pts/13 00:00:10 /usr/bin/python /usr/bin/qpid-tool localhost:5777 root 22364 11174 2 20:57 pts/9 00:00:00 /usr/bin/python /usr/bin/qpid-tool localhost:5777 root 22633 25311 1 20:57 pts/3 00:00:00 /usr/bin/python /usr/bin/qpid-tool localhost:5777 root 24073 1457 0 20:57 pts/12 00:00:00 grep -i qpid-tool # qpid-stat -c cumin/cumin@localhost:5777 Failed: SessionException - (None, 'connection aborted') # kill -9 15807 18595 # qpid-stat -c cumin/cumin@localhost:5777 Connections client-addr cproc cpid auth connected idle msgIn msgOut ================================================================================================== 127.0.0.1:5777-127.0.0.1:46292 qpid-tool 22633 anonymous@QPID 1m 40s 0s 437 986 127.0.0.1:5777-127.0.0.1:46230 qpid-tool 22364 anonymous@QPID 1m 54s 0s 509 1.20k 127.0.0.1:5777-127.0.0.1:47871 qpid-stat 27170 cumin@QPID 0s 0s 210 265 127.0.0.1:5777-127.0.0.1:54322 qpid-tool 9029 anonymous@QPID 20m 0s 0s 3.55k 10.3k #
note, reproduced with the formula from comment 6 on another box for completeness RHEL 5.7 qpid version is 0.10
Forgot to mention RHEL 5.7 qpid version 0.14 (In reply to comment #7) > note, reproduced with the formula from comment 6 on another box for > completeness > > RHEL 5.7 > qpid version is 0.10
Trevor was able to configure a system to repro this problem, and I've dug in a bit. We can reproduce the problem with a simple "qpid-stat" call, when passing "cumin/cumin" as the credentials: qpid-stat -c ### works fine: Connections client-addr cproc cpid auth connected idle msgIn msgOut =================================================================================================== 127.0.0.1:5672-127.0.0.1:47723 cumin-data 4768 cumin@QPID 32s 0s 96 169 127.0.0.1:5672-127.0.0.1:47722 cumin-data 4767 cumin@QPID 33s 0s 102 175 127.0.0.1:5672-127.0.0.1:47725 cumin-web 4766 cumin@QPID 32s 0s 88 153 127.0.0.1:5672-127.0.0.1:47724 cumin-data 4769 cumin@QPID 32s 0s 102 175 127.0.0.1:5672-127.0.0.1:47726 cumin-web 4766 cumin@QPID 31s 0s 88 153 127.0.0.1:5672-127.0.0.1:47729 qpid-stat 31720 anonymous@QPID 1s 0s 208 263 qpid-stat -c cumin/cumin@localhost ### hits the problem: Failed: SessionException - (None, 'connection aborted') Some preliminary observations: 1) qpid-stat -c cumin/cumin@localhost works correctly IFF the other cumin users are -not- active. Once the cumin-web/cumin-data are running, the failure starts to manifest itself. 2) The python client appears to authenticate fine prior to the failure. From the broker log: Dec 19 08:49:15 dell-pesc1420-01 qpidd[31577]: 2011-12-19 08:49:15 info SASL: Mechanism list: DIGEST-MD5 ANONYMOUS Dec 19 08:49:15 dell-pesc1420-01 qpidd[31577]: 2011-12-19 08:49:15 info SASL: Starting authentication with mechanism: DIGEST-MD5 Dec 19 08:49:15 dell-pesc1420-01 qpidd[31577]: 2011-12-19 08:49:15 info 127.0.0.1:5672-127.0.0.1:35175 SASL: Authentication succeeded for: cumin@QPID 3) The problem appears to be in the sasl-wrapper client. I've traced the failure to this point in the python client code: /usr/lib/python2.4/site-packages/qpid/connection.py::run(): while not self.closed: try: data = self.sock.recv(64*1024) if self.security_layer_rx and data: status, data = self.security_layer_rx.decode(data) if not data: self.detach_all() break At the point of failure, the decode call is returning (True, None) - the expected behavior is the return the decoded data. Note that this happens well into the exchange - prior to this point much of the connection data has decoded correctly. The security_layer_rx is set to the sasl wrapper Client() object, IIRC.
This does not appear to be directly related to any recent changes in the default qpidd sasl config file. I do not use that file in my testing, but use instead the one that is created during 'make check' testing in the cpp tree -- which has not changed. And I am reproducing the bug just fine. Right now it also looks like it may not be related to the sasl handshake at all, but more on that presently.
Here's what I've seen so far. 1. The SASL handshake completes successfully for MD5, and the MD5 client begins to communicate with the broker. Both the broker code and the saslwrapper code on the client side seem to be perfectly happy. 2. The Doomed Client gets a SessionAttachBody, and then more types of messages including ExchangeBindBody MessageSubscribeBody MessageSetFlowModeBody QueueDeclareBody MessageSubscribeBody ExchangeQueryBody 3. When the Doomed Client (qpid-stat) dies, there is no indication of any error in the Broker log. ( The word "error" does not occur in any form. ) 4. The final lines in the broker log concerning the Doomed Client do not seem to indicate anything unusual. SessionState code receives a "sender marker completed" message. This has already happened 71 times in the log for this client. The next log message shows the broker receiving a SessionDetachBody, indicating that the client is leaving. From the broker's point o f view, nothing bad has happened. I am continuing now to look more closely at the client to see what it's upset about.
The current implementation of Connection::run in python/qpid/connection.py errors out if it receives no data in response to its call to self.security_layer_rx.decode(data) But that decode function, in src/cyrus/saslwrapper.cpp, calls sasl_decode() in the cyrus sasl library. The man page for that fn says: Note that sasl_decode can succeed and outputlen can be zero. If this is the case simply wait for more data and call sasl_decode again. This is happening in the failure case. If I alter the python Connection::run code to only error out if the return code indicates failure, the observed bug happens 0 out of 5 times. If I do not make that change, the bug happens 5 out of 5 times. So -- this change definitely should be made. Next order of business is to see whether this bug, which we see with a simplified reproducer, is actually the same thing that was originally reported. Here is the diff: Index: python/qpid/connection.py =================================================================== --- python/qpid/connection.py (revision 1228344) +++ python/qpid/connection.py (working copy) @@ -161,9 +161,12 @@ data = self.sock.recv(64*1024) if self.security_layer_rx and data: status, data = self.security_layer_rx.decode(data) - if not data: - self.detach_all() - break + # zero-length data is OK, as long as return code is good. + # when that happens, just keep trying. the sasl library + # will send us data eventually. ( or an error code. ) + if not status: + self.detach_all() + break except socket.timeout: if self.aborted(): self.close_code = (None, "connection timed out")
Also note that if we were simply sending too much data, i.e. more than the buffer size that has been negotiated, we would receive back the error code SASL_BADPARAM, rather than SASL_OK.
jira QPID-3778 fixed ( with above diff ) in svn rev 1228189
I had to back out the above checkin because it broke two tests that I missed: qpid.tests.connection.ConnectionTest.testCloseGet qpid.tests.connection.ConnectionTest.testCloseListen The new & improved fix went in as rev rev 1236729.
Reproduction: root@sgraf-rhel5i:~# rpm -qa | grep -i -e md5 -e python-qpid -e cumin -e saslwrapper | sort cumin-0.1.5192-1.el5.noarch cyrus-sasl-md5-2.1.22-5.el5_4.3.i386 python-qpid-0.10-1.el5.noarch python-qpid-qmf-0.10-11.el5.i386 python-saslwrapper-0.10-4.el5.i386 saslwrapper-0.10-4.el5.i386 root@sgraf-rhel6i:~# rpm -qa | grep -i -e md5 -e python-qpid -e cumin -e saslwrapper | sort cumin-0.1.5192-1.el6.noarch cyrus-sasl-md5-2.1.23-13.el6.i686 python-qpid-0.12-1.el6.noarch python-qpid-qmf-0.12-6.el6.i686 python-saslwrapper-0.10-2.el6.i686 saslwrapper-0.10-2.el6.i686 I used complete setup of condor+wallaby+sesame+cumin+qpidd, all using DIGEST-MD5. After all of them connects to qpidd, I wasn't able to perform qpid-stat: # qpid-stat --connections --sasl-mechanism=DIGEST-MD5 cumin/cumin@localhost:5672 Failed: SessionException - (None, 'connection aborted') I will test also scenario from Comment 0 and Comment 6 for verification.
Tested scenarios: Comment 0 - DIGEST-MD5 connection failure between RHEL5 and RHEL6 with more connections Comment 6 - multiple qpid-stat Comment 17 - complete setup of condor+wallaby+sesame+cumin+qpidd I verified also pointing all daemons from Comment 17 to only one machine: 5i+5x+6i+6x connected all to 5i, then all to 5x... 6i, 6x (64 MD5 connections to qpid at the same time) I also checked during tests if cumin is working properly. VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Concurrent active use of SASL authentication Consequence: The connection aborts Fix: Correct SASL decoding to distinguish between failures and empty results Result: SASL authentication completes and the connection remains operational