Description of problem: We ran into this today in STG. Basically, if the /etc/qpid dir perms aren't right, then mcollective client calls will hang indefinitely. It's also quite hard to figure this out since the hang makes it so nothing is logged. This directly affects the broker. The broker is basically useless when this happens. Version-Release number of selected component (if applicable): qpid-cpp-client-ssl-0.12-6.el6.x86_64 qpid-cpp-client-0.12-6.el6.x86_64 mcollective-common-1.1.2-4.2.el6_0.noarch qpid-qmf-0.12-6.el6.x86_64 mcollective-client-1.1.2-4.2.el6_0.noarch ruby-qpid-qmf-0.12-6.el6.x86_64 How reproducible: Very Steps to Reproduce: 1. Create a devenv 2. Run: sudo -u libra_passenger mc-ping 3. Notice that it works correctly 4. Run: chmod o-rx /etc/qpid 5. Run: sudo -u libra_passenger mc-ping 6. Notice that this time the command hangs indefinitely Actual results: Command hangs forever. Expected results: This should be an error condition and failure message.
Can we clarify how this effects the broker? Besides the client (mc-ping) command hanging, what other symptoms are we seeing that demonstrate that the broker itself is hung? It's not clear that the broker is actually "basically useless when this happens." Can you run the mc-ping command from a different machine with the correct permissions while the original mc-ping command is hung?
Ok so from Mike: broker != qpidd broker but instead the openshift "broker". That makes more sense. Do this is essentially the mcollective driver. Next to see if it's in the qpid layers or the mcollective driver layer.
I've reproduced the bug. We noticed in the log that the client isn't really hanging but is attempting reconnection indefinitely despite the reconnect-time = 5. I added reconnect-limit=5 to the args for the connection in the hope that it would override the alleged Ruby 1.8 timeout issue. However it had not effect. Working with Qpid team for more ideas.
This might be related to alleged Ruby 1.8 V Ruby 1.9 timeout.rb issues. Output snippet from client side mcoellctive log: D, [2012-05-16T18:48:17.828268 #28087] DEBUG -- : amqp.rb:69:in `connect' Connecting to localhost.localdomain:5671, {transport:ssl, reconnect:true, reconnect-timeout:5, reconnect-limit:5, heartbeat:1} You can see that the reconnect timeout and limit are set. However this log message gets logged continuously, consistently, and indefinitely in the log. (I added the timeout-limit to see if it would override the timeout. I also tried this with only the limit and removed the timeout. It had not effect.)
I've created a BZ for MRG Messaging: https://bugzilla.redhat.com/show_bug.cgi?id=825075
A workaround for this has been found and we continue to investigate our messaging setup. (just doing bug cleanup)