Bug 1205429 - Platform's file system resources are blacklisted and all other child resources take 5 minutes to discover if NFS mount exists to host that is blocking RPC port
Summary: Platform's file system resources are blacklisted and all other child resource...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Plugin -- Other
Version: JON 3.3.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ER01
: JON 3.3.6
Assignee: Ruben Vargas Palma
QA Contact: vsorokin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-24 21:45 UTC by Larry O'Leary
Modified: 2019-05-20 11:38 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-27 15:29:37 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1392003 0 None None None Never
Red Hat Product Errata RHSA-2016:1519 0 normal SHIPPED_LIVE Critical: Red Hat JBoss Operations Network 3.3.6 update 2016-08-26 00:44:36 UTC

Description Larry O'Leary 2015-03-24 21:45:46 UTC
Description of problem:
If a platform contains one or more NFS file system resource types and the remote system for the NFS mount silently drops TCP packets to port 111, the NFS ping expected to quickly test an NFS servers availability hangs and results in the platform's runtime discovery taking over 5 minutes to execute. Even then, the file system resource type is blacklisted and no file systems are discovered. The following logs messages are captured in agent.log:

    2015-03-24 20:57:28,103 INFO  [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Executing runtime discovery scan rooted at [platform]...
    2015-03-24 21:02:28,108 WARN  [InventoryManager.discovery-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- The discovery component for resource type [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been blacklisted
    2015-03-24 21:02:28,109 WARN  [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Discovery for Resources of [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been running for more than 300000 milliseconds. This may be a plugin bug.
    org.rhq.core.pc.inventory.TimeoutException: Call to [org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources()] with args [[org.rhq.core.pluginapi.inventory.ResourceDiscoveryContext@1f4d0999]] timed out. Invocation thread will be interrupted.
        at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:256)
        at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invoke(DiscoveryComponentProxyFactory.java:217)
        at com.sun.proxy.$Proxy43.discoverResources(Unknown Source)
        at org.rhq.core.pc.inventory.InventoryManager.invokeDiscoveryComponent(InventoryManager.java:385)
        at org.rhq.core.pc.inventory.InventoryManager.executeComponentDiscovery(InventoryManager.java:3001)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.discoverForResource(RuntimeDiscoveryExecutor.java:281)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.runtimeDiscover(RuntimeDiscoveryExecutor.java:146)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:104)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.run(RuntimeDiscoveryExecutor.java:92)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.Exception: Thread[ResourceDiscoveryComponent.invoker.daemon-1,5,main] with id [21] is hung. This exception contains its stack trace.
        at org.hyperic.sigar.RPC.ping(Native Method)
        at org.hyperic.sigar.NfsFileSystem.ping(NfsFileSystem.java:52)
        at org.hyperic.sigar.Sigar.getMountedFileSystemUsage(Sigar.java:707)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.rhq.core.system.SigarAccessHandler.invoke(SigarAccessHandler.java:128)
        at com.sun.proxy.$Proxy42.getMountedFileSystemUsage(Unknown Source)
        at org.rhq.core.system.FileSystemInfo.refresh(FileSystemInfo.java:60)
        at org.rhq.core.system.FileSystemInfo.<init>(FileSystemInfo.java:43)
        at org.rhq.core.system.NativeSystemInfo.getFileSystems(NativeSystemInfo.java:325)
        at org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources(FileSystemDiscoveryComponent.java:62)
        at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ComponentInvocationThread.call(DiscoveryComponentProxyFactory.java:305)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        ... 3 more


Version-Release number of selected component (if applicable):
3.3 build 4f16df3:e347f77

How reproducible:
Always

Steps to Reproduce:
1.  On remote host, install, configure, and start NFS v3 server.

        yum install -y nfs-utils
	mkdir -p /export/home
	cat > /etc/exports << EOF
/export               10.0.0.0/12(rw,sync,no_wdelay,fsid=0,insecure,no_subtree_check)
/export/home          10.0.0.0/12(rw,sync,no_wdelay,fsid=2,insecure,nohide,no_subtree_check)
EOF
	cat >> /etc/fstab << EOF
# Exports
/home                   /export/home           none    rbind           0 0
EOF
	mount -a
	exportfs -rv
	chkconfig --level 345 nfs on
	service rpcbind restart
	service nfs restart

2.  On NFS server, use iptables to silently drop UDP traffic to RPC:

	iptables -I INPUT 1 -m state --state NEW -m udp -p udp --dport 111 -j DROP

3.  On JBoss ON agent/client host, configure and mount using NFS v3:

	yum install -y nfs-utils
	mkdir -p /mnt/nfs/v3/home
	mount -t nfs -o nolock vm130.gsslab.rdu2.redhat.com:/export/home /mnt/nfs/v3/home/

4.  From NFS server, use iptables to silently drop TCP traffic to RPC:

	iptables -I INPUT 1 -m state --state NEW -m tcp -p tcp --dport 111 -j DROP

5.  Install, configure, and start JBoss ON system.
6.  From agent installed on NFS client, import platform resource.

Actual results:
Platforms child resources -- such as networking, CPUs, bundle handler, etc. -- are missing for over five minutes after the platform has been imported.

Once child resources finally appear, all file systems are missing -- such as /, /dev/shm, /boot, etc.

Expected results:
Platforms child resources show up with other platform level servers. This should include all available file systems except NFS. Alternatively, NFS could be discovered but it should be reported as unavailable.

Additional info:
This is due to the NFS ping that is performed by Sigar to check the availability of NFS. In previous versions of Sigar there was a bug that would result in Sigar hanging when it encountered an NFS mount that was offline or unreachable. This was fixed by performing an RPC info request before attempting to read the file system stats of the NFS mount. However, there is no timeout specified for the NFS ping. This means that the ping will have to wait for the network timeout to occur. 

Although the reproducer described here is not the typical or expected configuration for the network configuration where an actual NFS mount is in use, it does demonstrate the problem very clearly. If a user even attempt to create an NFS mount to a remote server that does not support RPC, or perhaps later a firewall configuration change causes RPC to stop working, even if temporary, it can have a adverse affect on the JBoss ON agent and its other resources.

Comment 4 Josejulio Martínez 2016-05-31 22:35:21 UTC
commit c7014c2fa26791dbc37e0d3daa2c00cc650b7ab6
Merge: 835cca5 c99fdee
Author: Michael Burman <yak>
Date:   Tue May 31 17:09:54 2016 +0300

    Merge pull request #261 from rubenvp8510/Bug/1205429
    
    Bug 1205429 - Platform's file system resources are blacklisted and al…

commit c99fdee4422e1079916c3a5166d8e52efe882940
Author: Ruben Vargas <ruben.vp8510>
Date:   Fri May 27 09:43:19 2016 -0500

    Bug 1205429 - Platform's file system resources are blacklisted and all other child resources take 5 minutes to discover if NFS mount exists to host that is blocking RPC port

Comment 6 Simeon Pinder 2016-07-07 08:23:00 UTC
Moving to ON_QA as available to test with the following build:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=502442

Note: 	jon-server-patch-3.3.0.GA.zip maps to JON 3.3.6(jon-server-3.3.0.GA-update-06.zip)

Comment 8 vsorokin 2016-07-13 21:15:10 UTC
For successful starting of NFS server 'Steps to Reproduce' were slightly modified.
(In original it was broken on step 1 on line: "chkconfig --level 345 nfs on")

Instead of it I used:
"chkconfig --level 345 nfs-server on"
and manual launch/restart of services: 
nfs.service, 
nfs-lock.service, 
rpc-statd.service, 
rpcbind.service, 
nfs-idmapd.service, 
rpc-statd.service

Also was modified file '/etc/exports':
/export               *(rw,sync,no_wdelay,fsid=1,insecure,no_subtree_check)
/export/home          *(rw,sync,no_wdelay,fsid=2,insecure,nohide,no_subtree_check)

Rules for iptables were used as is, without modifications.
So now it worked.

Results turned out to be as expected: child resources show up with other platform level servers.

Comment 9 vsorokin 2016-07-14 11:35:05 UTC
In agent.log no 'blacklisted' or 'is hung' is observed.
No delay in resources discovery.

Comment 11 errata-xmlrpc 2016-07-27 15:29:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1519.html


Note You need to log in before you can comment on or make changes to this bug.