Description of problem: RHQ agent has now about 14,000 'active threads'. Many of them are blocked as so: "ResourceContainer.invoker.daemon-4352" daemon prio=10 tid=0x00002aaab137d000 nid=0x275d waiting for monitor entry [0x000000004a454000] java.lang.Thread.State: BLOCKED (on object monitor) at org.rhq.core.system.SigarAccess$SigarAccessHandler.invoke(SigarAccess.java:100) - waiting to lock <0x00000000f0b07420> (a org.rhq.core.system.SigarAccess$SigarAccessHandler) at $Proxy38.getCpuInfoList(Unknown Source) at org.rhq.core.system.CpuInformation.refresh(CpuInformation.java:79) at org.rhq.plugins.platform.CpuComponent.getAvailability(CpuComponent.java:74) at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:634) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Threads grew over the past few days. One thread is blocked getting disk info: "ResourceContainer.invoker.daemon-4226" daemon prio=10 tid=0x00002aaab1bc0800 nid=0x1c65 runnable [0x0000000045505000] java.lang.Thread.State: RUNNABLE at org.hyperic.sigar.FileSystemUsage.gather(Native Method) ^^^ This does not complete at org.hyperic.sigar.FileSystemUsage.fetch(FileSystemUsage.java:30) at org.hyperic.sigar.Sigar.getMountedFileSystemUsage(Sigar.java:712) at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.rhq.core.system.SigarAccess$SigarAccessHandler.invoke(SigarAccess.java:104) - locked <0x00000000f0b07420> (a org.rhq.core.system.SigarAccess$SigarAccessHandler) at $Proxy38.getMountedFileSystemUsage(Unknown Source) at org.rhq.core.system.FileSystemInfo.refresh(FileSystemInfo.java:64) at org.rhq.core.system.FileSystemInfo.<init>(FileSystemInfo.java:47) at org.rhq.core.system.NativeSystemInfo.getFileSystem(NativeSystemInfo.java:339) at org.rhq.plugins.platform.FileSystemComponent.getFileSystemInfo(FileSystemComponent.java:95) at org.rhq.plugins.platform.FileSystemComponent.getValues(FileSystemComponent.java:71) at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:634) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Version-Release number of selected component (if applicable): RHQ 4.5.1 How reproducible: It appears if there is a bad disk or other issue, then Sigar is blocked. At least the ResourceContainer.invoker.daemon-X threads should be bound to prevent the agent from adding to the misery.
Thomas, could you please have a look at this?
Can you confirm the root cause of the never ending FileSystemUsage#gather call? What do you mean by "mount is gone"?
Basically the filesystem failed on the machine, e.g. ls /mntX would hang on this volume. I'm guessing something similar might happen with an NFS mount that is hanging as well.
From rhq-devel ML: Hi, I've worked on a fix for Bug 923400 - Sigar creates high number of blocked threads (unbounded) if mount is gone. The problem is that we use a shared instance of Sigar and threads may be blocked when trying to use it if a call lasts too long or never returns. The fix is in a bug branch: https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=bug/923400 It consists in a behavior change of SigarAccessHandler. Let me paste here the new Javadoc of the class: **** An InvocationHandler for a org.hyperic.sigar.SigarProxy. A single instance of this class will be created by the SigarAccess class. This class holds a shared Sigar instance and serializes calls. If a thread waits more than 'sharedSigarLockMaxWait' seconds, it will be given a new Sigar instance, which will be destroyed at the end of the call. Every 5 minutes, a background task checks that 'sigarInstancesThreshold' has not been exceeded. It it has, a warning message will be logged, optionally with a thread dump. This class is configurable with System properties: * sharedSigarLockMaxWait: maximum time a thread will wait for the shared Sigar lock acquistion; defaults to 5 * sigarInstancesThreshold: threshold of currently living Sigar instances at which the background task will print warning messages; defaults to 50 * threadDumpOnSigarInstancesThreshold: if set to true (case insensitive), the background task will also log a thread dump when sigarInstancesThreshold is met **** This change will not prevent problems like the one Elias reported (a call to #getMountedFileSystemUsage never returning because of a bad FS mount). But it will put the agent in some sort of degraded mode where other calls to Sigar will succeed and warnings will be logged. It should even be possible to fire an alert on this log event if the agent is inventoried. What's your opinion? Thanks and regards, Thomas
From rhq-devel ML: One additional change may be to set a limit for the number of on demand Sigar instances and reject calls when this limit is reached.
Fixed in master commit fe38a28bd0ba7df967ec8b6a7f5f2b4a6bb839d6 Author: Thomas Segismont <tsegismo> Date: Wed Jul 3 12:27:55 2013 +0200
Bulk closing now that 4.10 is out. If you think an issue is not resolved, please open a new BZ and link to the existing one.