Description of problem: If a platform contains one or more NFS file system resource types and the remote system for the NFS mount silently drops TCP packets to port 111, the NFS ping expected to quickly test an NFS servers availability hangs and results in the platform's runtime discovery taking over 5 minutes to execute. Even then, the file system resource type is blacklisted and no file systems are discovered. The following logs messages are captured in agent.log: 2015-03-24 20:57:28,103 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Executing runtime discovery scan rooted at [platform]... 2015-03-24 21:02:28,108 WARN [InventoryManager.discovery-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- The discovery component for resource type [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been blacklisted 2015-03-24 21:02:28,109 WARN [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Discovery for Resources of [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been running for more than 300000 milliseconds. This may be a plugin bug. org.rhq.core.pc.inventory.TimeoutException: Call to [org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources()] with args [[org.rhq.core.pluginapi.inventory.ResourceDiscoveryContext@1f4d0999]] timed out. Invocation thread will be interrupted. at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:256) at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invoke(DiscoveryComponentProxyFactory.java:217) at com.sun.proxy.$Proxy43.discoverResources(Unknown Source) at org.rhq.core.pc.inventory.InventoryManager.invokeDiscoveryComponent(InventoryManager.java:385) at org.rhq.core.pc.inventory.InventoryManager.executeComponentDiscovery(InventoryManager.java:3001) at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.discoverForResource(RuntimeDiscoveryExecutor.java:281) at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.runtimeDiscover(RuntimeDiscoveryExecutor.java:146) at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:104) at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.run(RuntimeDiscoveryExecutor.java:92) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Exception: Thread[ResourceDiscoveryComponent.invoker.daemon-1,5,main] with id [21] is hung. This exception contains its stack trace. at org.hyperic.sigar.RPC.ping(Native Method) at org.hyperic.sigar.NfsFileSystem.ping(NfsFileSystem.java:52) at org.hyperic.sigar.Sigar.getMountedFileSystemUsage(Sigar.java:707) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.rhq.core.system.SigarAccessHandler.invoke(SigarAccessHandler.java:128) at com.sun.proxy.$Proxy42.getMountedFileSystemUsage(Unknown Source) at org.rhq.core.system.FileSystemInfo.refresh(FileSystemInfo.java:60) at org.rhq.core.system.FileSystemInfo.<init>(FileSystemInfo.java:43) at org.rhq.core.system.NativeSystemInfo.getFileSystems(NativeSystemInfo.java:325) at org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources(FileSystemDiscoveryComponent.java:62) at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:305) at java.util.concurrent.FutureTask.run(FutureTask.java:262) ... 3 more Version-Release number of selected component (if applicable): 3.3 build 4f16df3:e347f77 How reproducible: Always Steps to Reproduce: 1. On remote host, install, configure, and start NFS v3 server. yum install -y nfs-utils mkdir -p /export/home cat > /etc/exports << EOF /export 10.0.0.0/12(rw,sync,no_wdelay,fsid=0,insecure,no_subtree_check) /export/home 10.0.0.0/12(rw,sync,no_wdelay,fsid=2,insecure,nohide,no_subtree_check) EOF cat >> /etc/fstab << EOF # Exports /home /export/home none rbind 0 0 EOF mount -a exportfs -rv chkconfig --level 345 nfs on service rpcbind restart service nfs restart 2. On NFS server, use iptables to silently drop UDP traffic to RPC: iptables -I INPUT 1 -m state --state NEW -m udp -p udp --dport 111 -j DROP 3. On JBoss ON agent/client host, configure and mount using NFS v3: yum install -y nfs-utils mkdir -p /mnt/nfs/v3/home mount -t nfs -o nolock vm130.gsslab.rdu2.redhat.com:/export/home /mnt/nfs/v3/home/ 4. From NFS server, use iptables to silently drop TCP traffic to RPC: iptables -I INPUT 1 -m state --state NEW -m tcp -p tcp --dport 111 -j DROP 5. Install, configure, and start JBoss ON system. 6. From agent installed on NFS client, import platform resource. Actual results: Platforms child resources -- such as networking, CPUs, bundle handler, etc. -- are missing for over five minutes after the platform has been imported. Once child resources finally appear, all file systems are missing -- such as /, /dev/shm, /boot, etc. Expected results: Platforms child resources show up with other platform level servers. This should include all available file systems except NFS. Alternatively, NFS could be discovered but it should be reported as unavailable. Additional info: This is due to the NFS ping that is performed by Sigar to check the availability of NFS. In previous versions of Sigar there was a bug that would result in Sigar hanging when it encountered an NFS mount that was offline or unreachable. This was fixed by performing an RPC info request before attempting to read the file system stats of the NFS mount. However, there is no timeout specified for the NFS ping. This means that the ping will have to wait for the network timeout to occur. Although the reproducer described here is not the typical or expected configuration for the network configuration where an actual NFS mount is in use, it does demonstrate the problem very clearly. If a user even attempt to create an NFS mount to a remote server that does not support RPC, or perhaps later a firewall configuration change causes RPC to stop working, even if temporary, it can have a adverse affect on the JBoss ON agent and its other resources.
commit c7014c2fa26791dbc37e0d3daa2c00cc650b7ab6 Merge: 835cca5 c99fdee Author: Michael Burman <yak> Date: Tue May 31 17:09:54 2016 +0300 Merge pull request #261 from rubenvp8510/Bug/1205429 Bug 1205429 - Platform's file system resources are blacklisted and al… commit c99fdee4422e1079916c3a5166d8e52efe882940 Author: Ruben Vargas <ruben.vp8510> Date: Fri May 27 09:43:19 2016 -0500 Bug 1205429 - Platform's file system resources are blacklisted and all other child resources take 5 minutes to discover if NFS mount exists to host that is blocking RPC port
Moving to ON_QA as available to test with the following build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=502442 Note: jon-server-patch-3.3.0.GA.zip maps to JON 3.3.6(jon-server-3.3.0.GA-update-06.zip)
For successful starting of NFS server 'Steps to Reproduce' were slightly modified. (In original it was broken on step 1 on line: "chkconfig --level 345 nfs on") Instead of it I used: "chkconfig --level 345 nfs-server on" and manual launch/restart of services: nfs.service, nfs-lock.service, rpc-statd.service, rpcbind.service, nfs-idmapd.service, rpc-statd.service Also was modified file '/etc/exports': /export *(rw,sync,no_wdelay,fsid=1,insecure,no_subtree_check) /export/home *(rw,sync,no_wdelay,fsid=2,insecure,nohide,no_subtree_check) Rules for iptables were used as is, without modifications. So now it worked. Results turned out to be as expected: child resources show up with other platform level servers.
In agent.log no 'blacklisted' or 'is hung' is observed. No delay in resources discovery.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-1519.html