Bug 1632960

Summary:	Hundreds of gvfsd-trash processes are spawned when user runs Xsession/Gnome after an NFS session failed
Product:	Red Hat Enterprise Linux 7	Reporter:	Robert Verstandig <r.verstandig>
Component:	gvfs	Assignee:	Ondrej Holy <oholy>
Status:	CLOSED ERRATA	QA Contact:	Desktop QE <desktop-qa-list>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.5	CC:	jwright, mboisver, r.verstandig, tpelka
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	gvfs-1.36.2-2.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1739117 (view as bug list)		Environment:
Last Closed:	2019-08-06 12:57:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1656436, 1739117

Description Robert Verstandig 2018-09-25 23:02:55 UTC

Description of problem:

I have had this issue come up twice in RHEL 7.5 in the last three months. Both events occurred when NFS shares failed during normal operation. The NFS failures were from shared folders on two different RHEL servers.

In the first event, the problem occurred after an exported NFS3 share on a local el6.8 server was removed from service after the server was decommissioned but an entry for the share was left in the /etc/fstab file on the primary node. After some time the gvfsd-trash process on the primary node continuously respawned for multiple users and eventually crashed the node. The entry in the fstab file was removed and the server restarted to fix the issue.

In the second event, the issue occurred after two NFS3 shares from a local el5.8 node stopped responding on the primary node. The gvfsd-trash process on the primary node continuously respawned for a single user and eventually crashed the node. The node was not recoverable and had to be restarted. The restart took around 15-20 minutes probably due to NFS timeouts. The el5.8 server was not restarted.

Both of these issues were limited to the primary node on a cluster of 12 servers (eight running el7.5 and four running el6.8). The same shared folders on the other nodes continued to work as normal.

There were no errors logged on either the el7.5 or el5.8 servers relating to the second NFS share failures but any type of access to the shares from the primary el7.5 node seemed to hang (probably due to NFS timeouts).

The second event also broke the root login. It appeared to hang during the login process; however, a ctrl-c produces a bash login prompt but the shell was broken and kept throwing vte errors until I manually sourced the .bashrc file in /root. Subsequent logins also failed the same way. The actual error displayed from the shell was:

bash: __vte_prompt_command: command not found.

An additional NFS shared folder mapped from Dell EMC Isilon storage on the main network also continued to work across all nodes including the primary node, i.e., only the two local el5.8 shares failed on the primary node. Note also that the user home folders are located on the Isilon storage that continued to work ok during both events.

The cause of why the NFS shares mapped from the el5.8 server stopped working in the second event is unknown at this point but seemed to be an NFS failure on the el7.5 node rather than any issue on the el5.8 server since the problem was limited to the primary node. Prior to el7.5 we have had no issues with the NFS shared folders on any of the nodes.

Version-Release number of selected component (if applicable):

The gvfs components on the el7 server are:
gvfs-1.30.4-5.el7.x86_64
gvfs-fuse-1.30.4-5.el7.x86_64
gvfs-gphoto2-1.30.4-5.el7.x86_64
gvfs-afc-1.30.4-5.el7.x86_64
gvfs-client-1.30.4-5.el7.x86_64
gvfs-mtp-1.30.4-5.el7.x86_64
gvfs-smb-1.30.4-5.el7.x86_64
gvfs-goa-1.30.4-5.el7.x86_64
gvfs-archive-1.30.4-5.el7.x86_64
gvfs-afp-1.30.4-5.el7.x86_64

The NFS components on the EL5 server are:
nfs-utils-lib-1.0.8-7.9.el5
nfs-utils-1.0.9-60.el5

The server VNC components installed are:
tigervnc-server-1.8.0-5.el7.x86_64
tigervnc-icons-1.8.0-5.el7.noarch
tigervnc-license-1.8.0-5.el7.noarch
tigervnc-1.8.0-5.el7.x86_64
gvnc-0.7.0-3.el7.x86_64
tigervnc-server-minimal-1.8.0-5.el7.x86_64
gtk-vnc2-0.7.0-3.el7.x86_64

How reproducible:
Not reproducible at this point.

Steps to Reproduce:
1. Unable to test on the production system
2.
3.

Actual results:

Expected results:

Additional info:

After each failure, the primary el7.5 node was not recoverable and had to be restarted.

The el7.5 nodes are all all VMWare-based running on an ESXi 6.5 environment. CPU and memory are mapped on a one-to-one basis with the physical hardware (one VM per ESXi server) so there is no over-subscription of resources.

The remaining el6.5 nodes as also all based on a master VM and resources assigned as with the el7.5 nodes. The el5.8 server is not part of the cluster environment and only provides the legacy VNX shares (40TB of data).

The users connect to the primary node via Putty, set up a TurboVNC client 2.1/2.2 session then work from there on the primary node. The nodes all have the TigerVNC server components installed.

The primary node also runs Ganglia, Torque PBS 6.1.2, Open MPI and MPICH3 servers. There are some local apps installed on each of the nodes as well as on the Isilon shared folder. The el5.8 shared folders are use for project data only and are on Dell EMC VNX backend hardware.

cat /etc/redhat-release:
Red Hat Enterprise Linux Server release 7.5 (Maipo)

uname -a
Linux u3rb09s1 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 3 Ondrej Holy 2018-10-01 14:17:59 UTC

Thanks for your detailed report. I think we have enough info here to reproduce and propose some fix thanks to the fact that it relates to NFS failures. This was fixed upstream for some time simply by ignoring network filesystems (because they were marked as system-internal). However, this upstream change has been recently reverted and thus this bug has to be fixed in another way...

Comment 5 Ondrej Holy 2018-10-04 15:38:52 UTC

Actually, it is still not clear what is causing the big number of requests to the trash backend. I don't see it in my environment, it is probably some bug in some client application and it would be nice to fix it. But I don't see any easy way how to find the culprit. But this is not something that would block fixing of the bug in gvfs infrastructure which allows spawning of the big amount of gvfsd-trash processes...

Comment 6 Ondrej Holy 2018-10-29 12:31:33 UTC

I am still trying to find a source of the requests. You are talking about the hundreds of gvfsd-trash processes, but can you please provide also info about how many user sessions are running on that server, resp. how many processes are spawned per one user session?

Comment 7 Robert Verstandig 2018-11-19 22:31:43 UTC

The NFS hangs seemed to be completely random. During semester we have around 20-30 user Turbo VNC sessions running. These seem to generate around 80 processes each including the main application the students run during classes. Right now there are a little over 2300 processes active on the frontend and only around the equivalent of 3 CPUs in use.

There is little actual processing load on the frontend node as most of the processes are idle. In Ganglia during even the busiest periods there are around 4-6 of the 28 CPUs in use at any one time. The nodes have 320 GB RAM with about 40 GB in use. The remaining memory is buffered/cached with around 3GB completely unused. There is 8 GB of swap installed; however, this is usually 100% free. The only time I have seen it change is after this issue occurs and the thousands of trash processes are generated. At that point the whole server is compromised anyway...

The actual processing workload is distributed across the 15 worker nodes via the Torque PBS batch queuing system so only these nodes are under load.

It is end of semester now so classes are over. I will go through and clean up the leftover student sessions early next month. It is difficult to find a time window when the cluster can be restarted as it is still quite busy with several researchers running heavy workloads over long periods of time across the worker nodes.

We will be decommissioning the VNX shared storage over the next two months and replacing it with new storage, which will be connected directly to the frontend. The old RHEL5 node that currently provides the VNX shares will be decommissioned so I expect (am hoping) that this problems will go away.

Let me know if there is anything else you need.

Comment 8 Ondrej Holy 2018-11-20 09:55:55 UTC

Thanks for the info.

I meant how many gvfsd-trash processes were spawned per user session, not processes in general. I'm sorry that I didn't write that clearly at first. Do you know that?

Anyway, the proposed fix ensures that only one gvfsd-trash is spawned per user at max.

Comment 9 Robert Verstandig 2018-11-20 11:22:59 UTC

OK, when the VNX NFS shares crashed, every user session began to spawn the trash processes until the frontend crashed, i.e., hundreds... I tried killing them all off via a search and destroy script but they just kept coming back. From memory one would spawn around every 30 seconds for each user. Times that by 30 users...

Well hopefully your fix will stop that from occurring. I still have no idea why the NFS VNX shares crashed on the frontend only though.

Comment 13 errata-xmlrpc 2019-08-06 12:57:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2145