Red Hat Bugzilla – Bug 1028177
oo-stats causes new threads to be created for activemq that are never reaped
Last modified: 2017-03-08 12:35 EST
Description of problem:
Running oo-stats causes two new activemq threads to be created on the activemq machine, without ever reaping them. If oo-stats is placed in a cron script every hour, eventuall the default limit of PIDS for a process of 1024 is reached.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run ps -eLf | grep activemq | wc -l on the activemq machine
2. Run oo-stats on the broker
3. Run ps -eLf | grep activemq | wc -l again on the activemq machine
There are two new threads created that are never reaped
Two new threads are created and then destroyed after oo-stats is complete
I tested this on a 9 VM openshift instance with 3 activemq machines in a set. I ran the following for loop:
for i in $(seq 1 100); do oo-stats > /dev/null; date >> /tmp/testing; echo $(ssh email@example.com 'ps -eLf | grep activemq | wc -l') >> /tmp/testing; echo "" >> /tmp/testing; done
And got the following (condensed) results:
Thu Nov 7 17:51:01 EST 2013
Thu Nov 7 17:51:16 EST 2013
Thu Nov 7 17:51:31 EST 2013
Thu Nov 7 17:51:46 EST 2013
Thu Nov 7 17:52:00 EST 2013
Thu Nov 7 17:52:15 EST 2013
Thu Nov 7 17:52:30 EST 2013
Thu Nov 7 17:52:44 EST 2013
Just getting up to speed here... a couple questions:
1. Have you observed whether this happens on a network of activemq brokers only, or if it also occurs with a single activemq?
2. Any idea how activemq.xml looked before the workaround, or what procedure was followed to create it?
Our product docs, sample activemq-network.xml file, and openshift.sh now all create policies to expire queues, but I'm wondering if these get named such that they're not covered by the policy.
> 1. Have you observed whether this happens on a network of activemq brokers only, or if it also occurs with a single activemq?
The customer has a network of activemq brokers but I have tested this on both an all-in-one system with one activemq instance and a system with a network (3) of activemq instances. The symptom is exhibited both times.
> 2. Any idea how activemq.xml looked before the workaround, or what procedure was followed to create it?
I unfortunately do not have the customer's activemq.xml from before he implemented the workaround, as he did it very quickly. However, using the installation script (for 2.0 and 1.2) creates the activemq.xml that will exhibit the issue. I have a couple systems up now that are exhibiting the issue, feel free to ping me and I will pass on the credentials/ips
What we were missing is the broker schedulePeriodForDestinationPurge attribute. Without it, evidently nothing checks for the timeout to remove inactive queues.
ActiveMQ docs give us the details: http://activemq.apache.org/delete-inactive-destinations.html
MCollective docs don't mention that particular detail: http://docs.puppetlabs.com/mcollective/deploy/middleware/activemq.html#reply-queue-pruning although it is included in the example file.
I'll add it to our script, our example files, and check docs to see if any changes are needed (I don't think so).
This is nothing specific to oo-stats, BTW; oo-accept-systems or even oo-mco ping create a new reply queue, with name based on the broker host and process. The reason we don't see this balloon out of control normally is that typically the processes making MCollective requests are the OpenShift broker ones, which don't change much, so the same queues get reused a lot.
specific commit is https://github.com/openshift/openshift-extras/commit/38d8055a9abe78fadb6f75777cdcb50827510f9c
It does seem to take a little longer than the configured 300 secs (5 minutes) to prune the inactive queues. But they do go away; you could set the limits lower to test.
Similar improvement made for 1.2 scripts/examples.
Verify this bug on 2.0.z/2013-12-23.1 with the new installation script.
Modify the inactiveTimoutBeforeGC="30000", schedulePeriodForDestinationPurge="10000" to test this issue more easily.
[root@broker ~]# date;ps -eLf | grep activemq | wc -l ; oo-stats >> /dev/null ; date ; ps -eLf | grep activemq | wc -l
Tue Dec 24 04:06:41 EST 2013
Tue Dec 24 04:06:50 EST 2013
[root@broker ~]# date;ps -eLf | grep activemq | wc -l
Tue Dec 24 04:07:33 EST 2013
Monitor the log of activemq,
2013-12-24 04:07:29,510 | INFO | mcollective.reply.broker.ose20-1216-com.cn_16374 Inactive for longer than 30000 ms - removing ... | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[activemq.ose20-1216-com.cn] Scheduler
Sanitizing this bug of customer information so it can be public.