Description of problem: Three aspects: 1. cluster members create insecure Qpid connections to new members in order to provide a state update. Some users may want to secure these connections to prevent "impostors" from joining a cluster. 2. clients may want to make secure connections to the cluster. Need to verify that this works - there may be some issues to sort out. 3. openAIS provides security features including encryption to secure the cluster communication. Need to document how to use these features for Qpid.
The cluster currently only advertises TCP connections for failover. To advertise other protocols we need to (ssl, infiniband etc.) - define URL format for the protocol - maybe update the code that generates the default URL. Not sure if we want to include other protocols in the default URL or just allow user to specify them manually.
The following has been implemented but is not properly tested: - SASL ID is set on updated connections when a new member joins. - Authn data for update connections is specified by --cluster-username, --cluster-password, --cluster-mechanism These tests are missing: - ACL test - take ACL file relative to pwd if no-data-dir. auth=no. - update autn test: enable security & verify updates use correct mech/user/pwd. These features remain to be done: - Enable SSL on update connections. - include SSL in failover URLs. This depends on bug 471632. - documentation pointers to openais encyrption configuration.
ACL test is written and passing in cluster_test.cpp, but it exposes a memory leak in the SASL client code. The test is disabled till the leak is fixed.
FYI: I have posted a feature request upstream for allowing secure/auth connections from a client. See https://issues.apache.org/jira/browse/QPID-2187
fixed in svn rev 944158 Reproduction Notes ========================================== There are three separate problems to check. 1. When a cluster is started with authentication enabled, and a client requests a secure connection -- make sure that no brokers in the cluster shut themselves down. Test this by using cpp/src/tests/cluster_authentication_soak. Run it this way: sudo ./cluster_authentication_soak 1 It start 3 brokers. If any of them shut down, it will report an error. 2. If clients make a secure connection to a cluster, make sure that there is no low-frequency hang. Initially perftests with 20,000 messages were observed to hang about 2% of the time. cluster_authentication_soak Test this with the same program as in (1) above, but run it this way: sudo ./cluster_authentication_soak 500 This will run a perftest with 20,000 messages 500 times. If any of them hang it will report an error. ( Run this on an otherwise unloaded system. It judges them to be hanging if they take longer than 60 seconds to complete. ) 3. while a perftest is running on a secure connection, make sure that all brokers in the cluster have the same user ID for that connection. To test this, alter the cluster_authentication_soak.cpp program so that its perftests send 2,000,000 messages instead of 20,000. Then, while the first instance is running, use the command qpid-stat -c localhost:PORT and examine the output to make sure that all 3 brokers have the same user name for the perftest connection. ( Get the PORT by doing ps -aef | grep qpidd , and see which port is being used by any one of the brokers in your test.) ( The cluster_authentication_soak may report the first perftest as hanging, but that's expected since you increased the number of messages. )
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: In a cluster -- as currently implemented -- one broker will "own" a connection, and the other brokers will "shadow" it -- going through all the motions of operating the connection without actually doing anything. Occasionally, the "shadowing" brokers would receive encrypted frames *before* they have been able to install the security codec. ( Race. ) Consequence: When one of the shadowing brokers receives a frame that it cannot interpret -- because it has not yet installed the proper codec -- it experiences an error that none of the other brokers have experienced -- so it shuts itself down. Fix: Extra locking code in the cluster (not in regular broker code), and a cluster callback that gets fired by broker::ConnectionHandler::Handler so that the cluster code will know when the security handshake is complete. The broker that has performed the secret handshake then multicasts a message to all other brokers. They will not start reading frames again until they receive that message and have their codecs installed. Result: No more occasional broker shutdowns, after 500-trial test. Previous frequence was about 2%.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,11 +1 @@ -Cause: +Previously, it was possible for a broker in a cluster to receive an encrypted frame before an appropriate codec was installed. Consequent to this, being unable to interpret such frame, the broker terminated itself in response. This error has been fixed, and an additional locking mechanism has been introduced to the cluster, ensuring the security handshake is completed before processing encrypted frames.-In a cluster -- as currently implemented -- one broker will "own" a connection, and the other brokers will "shadow" it -- going through all the motions of operating the connection without actually doing anything. Occasionally, the "shadowing" brokers would receive encrypted frames *before* they have been able to install the security codec. ( Race. ) - -Consequence: -When one of the shadowing brokers receives a frame that it cannot interpret -- because it has not yet installed the proper codec -- it experiences an error that none of the other brokers have experienced -- so it shuts itself down. - -Fix: -Extra locking code in the cluster (not in regular broker code), and a cluster callback that gets fired by broker::ConnectionHandler::Handler so that the cluster code will know when the security handshake is complete. The broker that has performed the secret handshake then multicasts a message to all other brokers. They will not start reading frames again until they receive that message and have their codecs installed. - -Result: -No more occasional broker shutdowns, after 500-trial test. Previous frequence was about 2%.
I try to run 100 times ./cluster_authentication_soak 1 in loop. The result is that many times it fails with qpid-perftest pid 30539 hanging: killed. qpid-perftest 0 failed. Sometimes I even get this from cluster: 2010-10-06 07:30:37 critical cluster(10.16.66.66:8908 UPDATEE) catch-up connection closed prematurely 10.16.66.139:53136(10.16.66.66:8908-1 local,catchup)
New bug 640978 was created for brokers ending with: 'catch-up connection closed prematurely' All three points from comment 6 was checked. Tested with (version): qpid-cpp-mrg-debuginfo-0.7.946106-17.el5 qpid-cpp-client-ssl-0.7.946106-17.el5 qpid-cpp-client-devel-docs-0.7.946106-17.el5 python-qpid-0.7.946106-14.el5 qpid-cpp-server-devel-0.7.946106-17.el5 qpid-cpp-server-xml-0.7.946106-17.el5 qpid-cpp-server-cluster-0.7.946106-17.el5 qpid-cpp-client-0.7.946106-17.el5 qpid-java-common-0.7.946106-10.el5 qpid-java-client-0.7.946106-10.el5 qpid-cpp-server-0.7.946106-17.el5 qpid-tools-0.7.946106-11.el5 qpid-cpp-server-ssl-0.7.946106-17.el5 qpid-tests-0.7.946106-1.el5 qpid-cpp-client-devel-0.7.946106-17.el5 Tested on: RHEL5 x86_84,i386 - passed >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html