Hide Forgot
Description of problem: When running broker / mcollective based scripts, these scripts sometimes hang for no apparent reason. I've left them alone for over an hour, seeing if they would eventually either succeed or timeout, but neither seems to ever happen. The scripts seem to hang more often the more load that's on the broker / mcollective. I've gone through the mcollective logs with Dan McPherson when hangs occur, and nothing in the logs seems relevant to the hangs. Examples of scripts that hang periodically: * rhc-admin-move hangs a lot, if you run more than 2 or 3 at a time. * migrate-X.X.X will hang if MAX_THREADS is > 1 We're not 100% of the cause of this, but we highly suspect that the hangs are happening in mcollective / qpid. The reason we believe this is that we have a 2 nagios checks based solely on mcollective commands (mc-ping and mc-facts). We've noticed that from time to time, these calls will hang and we'll have to manually kill the processes (by the time we notice, it's usually around 5 processes hung). Version-Release number of selected component (if applicable): rhc-broker-0.85.33-1.el6_2.noarch mcollective-1.1.2-4.2.el6_0.noarch mcollective-client-1.1.2-4.2.el6_0.noarch mcollective-common-1.1.2-4.2.el6_0.noarch qpid-qmf-0.12-6.el6.x86_64 ruby-qpid-qmf-0.12-6.el6.x86_64 qpid-cpp-client-ssl-0.12-6.el6.x86_64 qpid-cpp-client-0.12-6.el6.x86_64 How reproducible: We can repro this in PROD pretty often depending on what we're doing. We've had several migrator runs hang, as well as several rhc-move's hang. Unfortunately it's sporadic enough that we can't really reproduce it at will. Steps to Reproduce: 1. Note: I have no idea if these steps will work, I'm just assuming they will 2. Have a lot of nodes connected through mcollective (like > 20) 3. Run a lot of mc-facts calls (like make 10 a minute or so) 4. Make sure the mc-facts call is complex (mc-ping fails less often than mc-facts) 5. Wait for it to hang. Actual results: Sporadic hangs on broker / mcollective commands Expected results: No hangs Additional info:
I don't know how to recreate this in dev. But where ever this happens we need to: require 'thread-dump' in the source (note you need the latest broker to do this) when it hangs kill -3 the process. That should let us know where it is hanging.
*** Bug 783175 has been marked as a duplicate of this bug. ***
Created attachment 564016 [details] Stack trace / thread-dump from a hung mc-ping process I was able to re-create the mc-ping hang with an mc-ping that had thread-dump enabled. Attached is the output from mc-ping and the thread-dump.
Just to update the bug, we've tracked this down to a bug in Ruby 1.8 and it's "green threads". This is basically making timeout.rb unreliable. See an explanation here: http://ph7spot.com/musings/system-timer We're looking at fixing this issue in 2 ways: 1) Upgrade to Ruby 1.9 on the brokers (and devenv and what not). 2) See if you could get the SystemTimer working with Ruby 1.8 (from the link above).
Wasn't this fixed a long time ago?
Broker has been updated to ruby-1.9 a long time ago. So this bug should have already been fix for a long time. Didn't reproduce on devenv, so moving this bug to verified.