Bug 470080

Summary:	Cluster integration with security.
Product:	Red Hat Enterprise MRG	Reporter:	Alan Conway <aconway>
Component:	qpid-cpp	Assignee:	mick <mgoulish>
Status:	CLOSED ERRATA	QA Contact:	Lubos Trilety <ltrilety>
Severity:	medium	Docs Contact:
Priority:	high
Version:	1.0	CC:	gsim, kgiusti, ltrilety
Target Milestone:	1.3
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, it was possible for a broker in a cluster to receive an encrypted frame before an appropriate codec was installed. Consequent to this, being unable to interpret such frame, the broker terminated itself in response. This error has been fixed, and an additional locking mechanism has been introduced to the cluster, ensuring the security handshake is completed before processing encrypted frames.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-10-14 16:09:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	471632
Bug Blocks:

Description Alan Conway 2008-11-05 17:22:15 UTC

Description of problem:

Three aspects: 

1. cluster members create insecure Qpid connections to new members in order to provide a state update. Some users may want to secure these connections to prevent "impostors" from joining a cluster. 

2. clients may want to make secure connections to the cluster. Need to verify that this works - there may be some issues to sort out.  

3. openAIS provides security features including encryption to secure the cluster communication. Need to document how to use these features for Qpid.

Comment 1 Alan Conway 2009-01-09 14:51:17 UTC

The cluster currently only advertises TCP connections for failover.

To advertise other protocols we need to (ssl, infiniband etc.)
 - define URL format for the protocol  
 - maybe update the code that generates the default URL.

Not sure if we want to include other protocols in the default URL or just allow user to specify them manually.

Comment 2 Alan Conway 2009-02-17 14:07:07 UTC

The following has been implemented but is not properly tested: 
 - SASL ID is set on updated connections when a new member joins.
 - Authn data for update connections is specified by --cluster-username, --cluster-password, --cluster-mechanism 

These tests are missing:
 - ACL test - take ACL file relative to pwd if no-data-dir. auth=no.
 - update autn test: enable security & verify updates use correct mech/user/pwd.

These features remain to be done:
 - Enable SSL on update connections.
 - include SSL in failover URLs.  This depends on bug 471632.
 - documentation pointers to openais encyrption configuration.

Comment 3 Alan Conway 2009-03-10 19:14:39 UTC

ACL test is written and passing in cluster_test.cpp, but it exposes a memory leak in the SASL client code. The test is disabled till the leak is fixed.

Comment 5 Ken Giusti 2009-11-04 16:28:19 UTC

FYI: I have posted a feature request upstream for allowing secure/auth connections from a client.  See https://issues.apache.org/jira/browse/QPID-2187

Comment 6 mick 2010-05-14 10:58:30 UTC

fixed in svn rev 944158

Reproduction Notes
==========================================

There are three separate problems to check.


1. When a cluster is started with authentication enabled, and a client requests a secure connection -- make sure that no brokers in the cluster shut themselves down.

Test this by using  cpp/src/tests/cluster_authentication_soak.  Run it this way:

  sudo ./cluster_authentication_soak 1

It start 3 brokers.  If any of them shut down, it will report an error.



2. If clients make a secure connection to a cluster, make sure that there is no low-frequency hang.  Initially perftests with 20,000 messages were observed to hang about 2% of the time.
cluster_authentication_soak
Test this with the same program as in (1) above, but run it this way:

  sudo ./cluster_authentication_soak 500

This will run a perftest with 20,000 messages 500 times.  If any of them hang it will report an error.  ( Run this on an otherwise unloaded system.  It judges them to be hanging if they take longer than 60 seconds to complete. )



3. while a perftest is running on a secure connection, make sure that all brokers in the cluster have the same user ID for that connection.

To test this, alter the cluster_authentication_soak.cpp program so that its perftests send 2,000,000 messages instead of 20,000.   Then, while the first instance is running, use the command

  qpid-stat -c localhost:PORT 

and examine the output to make sure that all 3 brokers have the same user name for the perftest connection.  ( Get the PORT by doing  ps -aef | grep qpidd , and see which port is being used by any one of the brokers in your test.)

( The cluster_authentication_soak may report the first perftest as hanging, but that's expected since you increased the number of messages. )

Comment 8 mick 2010-10-05 15:27:57 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:  
In a cluster -- as currently implemented -- one broker will "own" a connection, and the other brokers will "shadow" it -- going through all the motions of operating the connection without actually doing anything.  Occasionally, the "shadowing" brokers would receive encrypted frames *before* they have been able to install the security codec.  ( Race. )

Consequence: 
When one of the shadowing brokers receives a frame that it cannot interpret -- because it has not yet installed the proper codec -- it experiences an error that none of the other brokers have experienced -- so it shuts itself down.

Fix: 
Extra locking code in the cluster (not in regular broker code), and a cluster callback that gets fired by broker::ConnectionHandler::Handler so that the cluster code will know when the security handshake is complete.  The broker that has performed the secret handshake then multicasts a message to all other brokers.  They will not start reading frames again until they receive that message and have their codecs installed.

Result: 
No more occasional broker shutdowns, after 500-trial test.  Previous frequence was about 2%.

Comment 9 Jaromir Hradilek 2010-10-05 21:43:06 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1 @@
-Cause:  
+Previously, it was possible for a broker in a cluster to receive an encrypted frame before an appropriate codec was installed. Consequent to this, being unable to interpret such frame, the broker terminated itself in response. This error has been fixed, and an additional locking mechanism has been introduced to the cluster, ensuring the security handshake is completed before processing encrypted frames.-In a cluster -- as currently implemented -- one broker will "own" a connection, and the other brokers will "shadow" it -- going through all the motions of operating the connection without actually doing anything.  Occasionally, the "shadowing" brokers would receive encrypted frames *before* they have been able to install the security codec.  ( Race. )
-
-Consequence: 
-When one of the shadowing brokers receives a frame that it cannot interpret -- because it has not yet installed the proper codec -- it experiences an error that none of the other brokers have experienced -- so it shuts itself down.
-
-Fix: 
-Extra locking code in the cluster (not in regular broker code), and a cluster callback that gets fired by broker::ConnectionHandler::Handler so that the cluster code will know when the security handshake is complete.  The broker that has performed the secret handshake then multicasts a message to all other brokers.  They will not start reading frames again until they receive that message and have their codecs installed.
-
-Result: 
-No more occasional broker shutdowns, after 500-trial test.  Previous frequence was about 2%.

Comment 10 Lubos Trilety 2010-10-06 11:58:19 UTC

I try to run 100 times ./cluster_authentication_soak 1 in loop.
The result is that many times it fails with

qpid-perftest pid 30539 hanging: killed.
qpid-perftest 0 failed.

Sometimes I even get this from cluster:

2010-10-06 07:30:37 critical cluster(10.16.66.66:8908 UPDATEE) catch-up connection closed prematurely 10.16.66.139:53136(10.16.66.66:8908-1 local,catchup)

Comment 12 Lubos Trilety 2010-10-07 12:39:39 UTC

New bug 640978 was created for brokers ending with: 'catch-up connection closed prematurely'

All three points from comment 6 was checked.

Tested with (version):
qpid-cpp-mrg-debuginfo-0.7.946106-17.el5
qpid-cpp-client-ssl-0.7.946106-17.el5
qpid-cpp-client-devel-docs-0.7.946106-17.el5
python-qpid-0.7.946106-14.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-cpp-server-xml-0.7.946106-17.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-client-0.7.946106-17.el5
qpid-java-common-0.7.946106-10.el5
qpid-java-client-0.7.946106-10.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-tools-0.7.946106-11.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-tests-0.7.946106-1.el5
qpid-cpp-client-devel-0.7.946106-17.el5

Tested on:
RHEL5 x86_84,i386  - passed

>>> VERIFIED

Comment 14 errata-xmlrpc 2010-10-14 16:09:26 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html