788302 – Broker / mcollective commands sometimes hang

Bug 788302 - Broker / mcollective commands sometimes hang

Summary: Broker / mcollective commands sometimes hang

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Dan McPherson
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	783175 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-02-08 01:22 UTC by Thomas Wiest
Modified:	2015-05-14 22:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-14 00:59:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Stack trace / thread-dump from a hung mc-ping process (4.51 KB, text/plain) 2012-02-17 23:50 UTC, Thomas Wiest	no flags	Details
View All

Description Thomas Wiest 2012-02-08 01:22:58 UTC

Description of problem:
When running broker / mcollective based scripts, these scripts sometimes hang for no apparent reason. I've left them alone for over an hour, seeing if they would eventually either succeed or timeout, but neither seems to ever happen.

The scripts seem to hang more often the more load that's on the broker / mcollective.

I've gone through the mcollective logs with Dan McPherson when hangs occur, and nothing in the logs seems relevant to the hangs.

Examples of scripts that hang periodically:
* rhc-admin-move hangs a lot, if you run more than 2 or 3 at a time.
* migrate-X.X.X will hang if MAX_THREADS is > 1

We're not 100% of the cause of this, but we highly suspect that the hangs are happening in mcollective / qpid.

The reason we believe this is that we have a 2 nagios checks based solely on mcollective commands (mc-ping and mc-facts). We've noticed that from time to time, these calls will hang and we'll have to manually kill the processes (by the time we notice, it's usually around 5 processes hung).

Version-Release number of selected component (if applicable):
rhc-broker-0.85.33-1.el6_2.noarch
mcollective-1.1.2-4.2.el6_0.noarch
mcollective-client-1.1.2-4.2.el6_0.noarch
mcollective-common-1.1.2-4.2.el6_0.noarch
qpid-qmf-0.12-6.el6.x86_64
ruby-qpid-qmf-0.12-6.el6.x86_64
qpid-cpp-client-ssl-0.12-6.el6.x86_64
qpid-cpp-client-0.12-6.el6.x86_64

How reproducible:
We can repro this in PROD pretty often depending on what we're doing. We've had several migrator runs hang, as well as several rhc-move's hang.

Unfortunately it's sporadic enough that we can't really reproduce it at will.

Steps to Reproduce:
1. Note: I have no idea if these steps will work, I'm just assuming they will
2. Have a lot of nodes connected through mcollective (like > 20)
3. Run a lot of mc-facts calls (like make 10 a minute or so)
4. Make sure the mc-facts call is complex (mc-ping fails less often than mc-facts)
5. Wait for it to hang.

Actual results:
Sporadic hangs on broker / mcollective commands

Expected results:
No hangs

Additional info:

Comment 1 Dan McPherson 2012-02-10 15:22:41 UTC

I don't know how to recreate this in dev.  But where ever this happens we need to:

require 'thread-dump'
in the source  (note you need the latest broker to do this)

when it hangs
kill -3 the process.

That should let us know where it is hanging.

Comment 2 Rob Millner 2012-02-16 19:36:32 UTC

*** Bug 783175 has been marked as a duplicate of this bug. ***

Comment 3 Thomas Wiest 2012-02-17 23:50:45 UTC

Created attachment 564016 [details]
Stack trace / thread-dump from a hung mc-ping process

I was able to re-create the mc-ping hang with an mc-ping that had thread-dump enabled.

Attached is the output from mc-ping and the thread-dump.

Comment 4 Thomas Wiest 2012-03-23 18:05:20 UTC

Just to update the bug, we've tracked this down to a bug in Ruby 1.8 and it's "green threads". This is basically making timeout.rb unreliable.

See an explanation here:
http://ph7spot.com/musings/system-timer

We're looking at fixing this issue in 2 ways:
1) Upgrade to Ruby 1.9 on the brokers (and devenv and what not).
2) See if you could get the SystemTimer working with Ruby 1.8 (from the link above).

Comment 5 Dan McPherson 2013-01-29 00:43:00 UTC

Wasn't this fixed a long time ago?

Comment 6 Jianwei Hou 2013-01-29 02:09:21 UTC

Broker has been updated to ruby-1.9 a long time ago.
So this bug should have already been fix for a long time.
Didn't reproduce on devenv, so moving this bug to verified.

Note You need to log in before you can comment on or make changes to this bug.