Bug 923400

Summary:	Sigar creates high number of blocked threads (unbounded) if mount is gone
Product:	[Other] RHQ Project	Reporter:	Elias Ross <genman>
Component:	Plugins	Assignee:	Thomas Segismont <tsegismo>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Foley <mfoley>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	hrupp
Target Milestone:	---
Target Release:	RHQ 4.9
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-03-26 08:32:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Elias Ross 2013-03-19 18:33:25 UTC

Description of problem:

RHQ agent has now about 14,000 'active threads'. Many of them are blocked as so:

"ResourceContainer.invoker.daemon-4352" daemon prio=10 tid=0x00002aaab137d000 nid=0x275d waiting for monitor entry [0x000000004a454000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.rhq.core.system.SigarAccess$SigarAccessHandler.invoke(SigarAccess.java:100)
        - waiting to lock <0x00000000f0b07420> (a org.rhq.core.system.SigarAccess$SigarAccessHandler)
        at $Proxy38.getCpuInfoList(Unknown Source)
        at org.rhq.core.system.CpuInformation.refresh(CpuInformation.java:79)
        at org.rhq.plugins.platform.CpuComponent.getAvailability(CpuComponent.java:74)
        at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:634)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

Threads grew over the past few days.

One thread is blocked getting disk info:

"ResourceContainer.invoker.daemon-4226" daemon prio=10 tid=0x00002aaab1bc0800 nid=0x1c65 runnable [0x0000000045505000]
   java.lang.Thread.State: RUNNABLE
        at org.hyperic.sigar.FileSystemUsage.gather(Native Method)
^^^ This does not complete
        at org.hyperic.sigar.FileSystemUsage.fetch(FileSystemUsage.java:30)
        at org.hyperic.sigar.Sigar.getMountedFileSystemUsage(Sigar.java:712)
        at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.rhq.core.system.SigarAccess$SigarAccessHandler.invoke(SigarAccess.java:104)
        - locked <0x00000000f0b07420> (a org.rhq.core.system.SigarAccess$SigarAccessHandler)
        at $Proxy38.getMountedFileSystemUsage(Unknown Source)
        at org.rhq.core.system.FileSystemInfo.refresh(FileSystemInfo.java:64)
        at org.rhq.core.system.FileSystemInfo.<init>(FileSystemInfo.java:47)
        at org.rhq.core.system.NativeSystemInfo.getFileSystem(NativeSystemInfo.java:339)
        at org.rhq.plugins.platform.FileSystemComponent.getFileSystemInfo(FileSystemComponent.java:95)
        at org.rhq.plugins.platform.FileSystemComponent.getValues(FileSystemComponent.java:71)
        at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:634)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)


Version-Release number of selected component (if applicable): RHQ 4.5.1


How reproducible:

It appears if there is a bad disk or other issue, then Sigar is blocked. At least the ResourceContainer.invoker.daemon-X threads should be bound to prevent the agent from adding to the misery.

Comment 1 Heiko W. Rupp 2013-06-29 15:47:37 UTC

Thomas, could you please have a look at this?

Comment 2 Thomas Segismont 2013-07-01 13:30:10 UTC

Can you confirm the root cause of the never ending FileSystemUsage#gather call?

What do you mean by "mount is gone"?

Comment 3 Elias Ross 2013-07-01 16:36:01 UTC

Basically the filesystem failed on the machine, e.g. ls /mntX would hang on this volume.

I'm guessing something similar might happen with an NFS mount that is hanging as well.

Comment 4 Thomas Segismont 2013-07-03 12:12:54 UTC

From rhq-devel ML:

Hi,

I've worked on a fix for Bug 923400 - Sigar creates high number of blocked threads (unbounded) if mount is gone.

The problem is that we use a shared instance of Sigar and threads may be blocked when trying to use it if a call lasts too long or never returns.

The fix is in a bug branch:
https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=bug/923400

It consists in a behavior change of SigarAccessHandler. Let me paste here the new Javadoc of the class:

****
An InvocationHandler for a org.hyperic.sigar.SigarProxy.

A single instance of this class will be created by the SigarAccess class.

This class holds a shared Sigar instance and serializes calls. If a thread waits more than 'sharedSigarLockMaxWait' seconds, it will be given a new Sigar instance, which will be destroyed at the end of the call.

Every 5 minutes, a background task checks that 'sigarInstancesThreshold' has not been exceeded. It it has, a warning message will be logged, optionally with a thread dump.
This class is configurable with System properties:
* sharedSigarLockMaxWait: maximum time a thread will wait for the shared Sigar lock acquistion; defaults to 5
* sigarInstancesThreshold: threshold of currently living Sigar instances at which the background task will print warning messages; defaults to 50
* threadDumpOnSigarInstancesThreshold: if set to true (case insensitive), the background task will also log a thread dump when sigarInstancesThreshold is met
****

This change will not prevent problems like the one Elias reported (a call to #getMountedFileSystemUsage never returning because of a bad FS mount). But it will put the agent in some sort of degraded mode where other calls to Sigar will succeed and warnings will be logged. It should even be possible to fire an alert on this log event if the agent is inventoried.

What's your opinion?

Thanks and regards,
Thomas

Comment 5 Thomas Segismont 2013-07-03 12:13:19 UTC

From rhq-devel ML:

One additional change may be to set a limit for the number of on demand Sigar instances and reject calls when this limit is reached.

Comment 6 Thomas Segismont 2013-07-15 18:10:15 UTC

Fixed in master

commit fe38a28bd0ba7df967ec8b6a7f5f2b4a6bb839d6
Author: Thomas Segismont <tsegismo>
Date:   Wed Jul 3 12:27:55 2013 +0200

Comment 7 Heiko W. Rupp 2014-03-26 08:32:01 UTC

Bulk closing now that 4.10 is out.

If you think an issue is not resolved, please open a new BZ and link to the existing one.