Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1028177

Summary:	oo-stats causes new threads to be created for activemq that are never reaped
Product:	OpenShift Container Platform	Reporter:	Timothy Williams <tiwillia>
Component:	Installer	Assignee:	Luke Meyer <lmeyer>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2.0.0	CC:	gpei, libra-bugs
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-01-25 11:47:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Timothy Williams 2013-11-07 20:34:56 UTC

Description of problem:
Running oo-stats causes two new activemq threads to be created on the activemq machine, without ever reaping them. If oo-stats is placed in a cron script every hour, eventuall the default limit of PIDS for a process of 1024 is reached.

Version-Release number of selected component (if applicable):
1.2

How reproducible:
Always

Steps to Reproduce:
1. Run ps -eLf | grep activemq | wc -l on the activemq machine
2. Run oo-stats on the broker
3. Run ps -eLf | grep activemq | wc -l again on the activemq machine

Actual results:
There are two new threads created that are never reaped

Expected results:
Two new threads are created and then destroyed after oo-stats is complete

Additional info:
I tested this on a 9 VM openshift instance with 3 activemq machines in a set. I ran the following for loop:

for i in $(seq 1 100); do oo-stats > /dev/null; date >> /tmp/testing; echo $(ssh root.com 'ps -eLf | grep activemq | wc -l') >> /tmp/testing; echo "" >> /tmp/testing; done

And got the following (condensed) results:

Thu Nov  7 17:51:01 EST 2013
365

Thu Nov  7 17:51:16 EST 2013
367

Thu Nov  7 17:51:31 EST 2013
369

Thu Nov  7 17:51:46 EST 2013
371

Thu Nov  7 17:52:00 EST 2013
373

Thu Nov  7 17:52:15 EST 2013
375

Thu Nov  7 17:52:30 EST 2013
377

Thu Nov  7 17:52:44 EST 2013
379

Comment 6 Luke Meyer 2013-12-17 13:22:19 UTC

Just getting up to speed here... a couple questions:
1. Have you observed whether this happens on a network of activemq brokers only, or if it also occurs with a single activemq?
2. Any idea how activemq.xml looked before the workaround, or what procedure was followed to create it?

Our product docs, sample activemq-network.xml file, and openshift.sh now all create policies to expire queues, but I'm wondering if these get named such that they're not covered by the policy.

Comment 7 Timothy Williams 2013-12-19 18:34:21 UTC

Hi Luke,

> 1. Have you observed whether this happens on a network of activemq brokers only, or if it also occurs with a single activemq?

The customer has a network of activemq brokers but I have tested this on both an all-in-one system with one activemq instance and a system with a network (3) of activemq instances. The symptom is exhibited both times.

> 2. Any idea how activemq.xml looked before the workaround, or what procedure was followed to create it?

I unfortunately do not have the customer's activemq.xml from before he implemented the workaround, as he did it very quickly. However, using the installation script (for 2.0 and 1.2) creates the activemq.xml that will exhibit the issue. I have a couple systems up now that are exhibiting the issue, feel free to ping me and I will pass on the credentials/ips

- Tim

Comment 8 Luke Meyer 2013-12-22 01:59:52 UTC

What we were missing is the broker schedulePeriodForDestinationPurge attribute. Without it, evidently nothing checks for the timeout to remove inactive queues.

ActiveMQ docs give us the details: http://activemq.apache.org/delete-inactive-destinations.html
MCollective docs don't mention that particular detail: http://docs.puppetlabs.com/mcollective/deploy/middleware/activemq.html#reply-queue-pruning although it is included in the example file.

I'll add it to our script, our example files, and check docs to see if any changes are needed (I don't think so).

This is nothing specific to oo-stats, BTW; oo-accept-systems or even oo-mco ping create a new reply queue, with name based on the broker host and process. The reason we don't see this balloon out of control normally is that typically the processes making MCollective requests are the OpenShift broker ones, which don't change much, so the same queues get reused a lot.

Comment 9 Luke Meyer 2013-12-22 15:31:57 UTC

https://github.com/openshift/openshift-extras/pull/257
specific commit is https://github.com/openshift/openshift-extras/commit/38d8055a9abe78fadb6f75777cdcb50827510f9c

It does seem to take a little longer than the configured 300 secs (5 minutes) to prune the inactive queues. But they do go away; you could set the limits lower to test.

Similar improvement made for 1.2 scripts/examples.

Comment 10 Gaoyun Pei 2013-12-24 09:10:29 UTC

Verify this bug on 2.0.z/2013-12-23.1 with the new installation script.

Modify the inactiveTimoutBeforeGC="30000", schedulePeriodForDestinationPurge="10000" to test this issue more easily.

[root@broker ~]# date;ps -eLf | grep activemq | wc -l ; oo-stats >> /dev/null ; date ; ps -eLf | grep activemq | wc -l 
Tue Dec 24 04:06:41 EST 2013
56
Tue Dec 24 04:06:50 EST 2013
58
[root@broker ~]# date;ps -eLf | grep activemq | wc -l
Tue Dec 24 04:07:33 EST 2013
56


Monitor the log of activemq, 
tailf /var/log/activemq/activemq.log 

2013-12-24 04:07:29,510 | INFO  | mcollective.reply.broker.ose20-1216-com.cn_16374 Inactive for longer than 30000 ms - removing ... | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[activemq.ose20-1216-com.cn] Scheduler

Comment 11 Luke Meyer 2014-01-25 11:44:48 UTC

Sanitizing this bug of customer information so it can be public.