2152538 – Increase resilience of gvfsd against unresponsive NFS shares

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2152538 - Increase resilience of gvfsd against unresponsive NFS shares

Summary: Increase resilience of gvfsd against unresponsive NFS shares

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	gvfs
Sub Component:
Version:	8.7
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ondrej Holy
QA Contact:	Tomas Pelka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-12-12 09:31 UTC by Phil Jasbutis
Modified:	2023-09-11 08:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-11 08:41:58 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
GNOME Gitlab	GNOME gvfs merge_requests 169	None	merged	Increase trash backend resilience of against stale NFS mounts	2023-08-24 06:50:23 UTC
Red Hat Issue Tracker	RHEL-2824	None	Migrated	None	2023-09-11 08:42:02 UTC
Red Hat Issue Tracker	RHELPLAN-141857	None	None	None	2022-12-12 09:33:46 UTC

Description Phil Jasbutis 2022-12-12 09:31:11 UTC

In enterprise environments where a RHEL 8.7 workstation uses autofs with hundreds or thousands of
NFS shares in the back-end, unresponsive or misbehaving shares can cause serious impact to applications
relying on gvfsd-trash.

Symptoms seen on users desktop:

  (1) Trash and home icon not shown on the desktop (requires desktop extension in use)
  (2) The lock-screen can hang forever so a user is trapped there.

This symptoms could be narrowed down to gvfsd-trash (gvfs-1.36.2-14.el8.x86_64), which seem struggling
when facing non-responsive or misbehaving NFS endpoints (even if they are not actively mounted but
recognized by autofs).

Killing the gvfsd-trash causes the screen to unlock immediately and icons are operational again.


This bugzilla is meant to enhance resilience of gvfsd with following thoughts:

  (A) gvfsd-trash should only check / stat NFS shares if it is really required
    -> Reduce or disable proactive NFS endpoint lookup (for no reason)

  (B) gvfsd-trash should answer application requests even when there are unresponsive NFS shares
    -> This should only return responsive endpoints, to keep applications unaffected and break the
       issue chain.

Comment 6 Ondrej Holy 2023-01-17 15:28:36 UTC

I made some experiments and can reproduce the Unlock screen hang more-or-less reliably the following way:

1) Enable the Desktop icons extension.
2) Mount nfs e.g. in /media/test with default options.
3) Stop the nfs server.
4) Open trash folder in Nautilus to confirm that the folder is not loading. It sometimes take some time (probably until rescan is scheduled).
5) Lock the screen.
6) Now it is not possible to show password prompt and/or unlock the screen after filling password resp. it took 30 seconds before it is possible...

I can obtain the following gnome-shell backtrace at that moment:
Thread 1 (Thread 0x7f002fdcd500 (LWP 6409)):
#0  0x00007f002d005ae1 in poll () at /lib64/libc.so.6
#1  0x00007f002eec8c86 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#2  0x00007f002eec9042 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#3  0x00007f002f4b02e7 in g_dbus_connection_send_message_with_reply_sync () at /lib64/libgio-2.0.so.0
#4  0x00007f002f4b06df in g_dbus_connection_call_sync_internal () at /lib64/libgio-2.0.so.0
#5  0x00007f002f4bcce9 in g_dbus_proxy_call_sync_internal () at /lib64/libgio-2.0.so.0
#6  0x00007f002f4be0e8 in g_dbus_proxy_call_sync () at /lib64/libgio-2.0.so.0
#7  0x00007efff40514a6 in gvfs_dbus_mount_call_query_info_sync (proxy=proxy@entry=0x561bd1a70600, arg_path_data=<optimized out>, arg_attributes=arg_attributes@entry=0x561bd3e89230 "metadata::*,standard::*,access::*,time::modified,unix::mode", arg_flags=arg_flags@entry=0, arg_uri=arg_uri@entry=0x561bd2634380 "trash:///", out_info=out_info@entry=0x7ffe030d60f8, cancellable=0x7efff813a4a0, error=0x7ffe030d6100) at gvfsdbus.c:11662
#8  0x00007efff42819c4 in g_daemon_file_query_info (file=0x7efff8068290, attributes=0x561bd3e89230 "metadata::*,standard::*,access::*,time::modified,unix::mode", flags=G_FILE_QUERY_INFO_NONE, cancellable=0x7efff813a4a0, error=0x7ffe030d63e0) at gdaemonfile.c:807
#9  0x00007f0029e9114e in ffi_call_unix64 () at /lib64/libffi.so.6
#10 0x00007f0029e90aff in ffi_call () at /lib64/libffi.so.6
#11 0x00007f002de5a777 in gjs_invoke_c_function(JSContext*, Function*, JS::HandleObject, JS::HandleValueArray const&, mozilla::Maybe<JS::MutableHandle<JS::Value> >, GIArgument*) (context=<optimized out>, function=0x561bd3e04760, obj=..., args=..., js_rval=..., r_value=<optimized out>) at gi/function.cpp:1110
#12 0x00007f002de5c0bb in function_call(JSContext*, unsigned int, JS::Value*) (context=0x561bcffee500, js_argc=3, vp=0x561bd27052c8) at /usr/include/mozjs-60/js/RootingAPI.h:1090
#13 0x00007f0024f42534 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) () at /lib64/libmozjs-60.so.0
#14 0x00007f0024f355bf in Interpret(JSContext*, js::RunState&) () at /lib64/libmozjs-60.so.0
#15 0x00007f0024f41ef6 in js::RunScript(JSContext*, js::RunState&) () at /lib64/libmozjs-60.so.0
#16 0x00007f0024f42499 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) () at /lib64/libmozjs-60.so.0
#17 0x00007f0024f355bf in Interpret(JSContext*, js::RunState&) () at /lib64/libmozjs-60.so.0
#18 0x00007f0024f41ef6 in js::RunScript(JSContext*, js::RunState&) () at /lib64/libmozjs-60.so.0
#19 0x00007f0024f42499 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) () at /lib64/libmozjs-60.so.0
#20 0x00007f0024f426fd in js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>) () at /lib64/libmozjs-60.so.0
#21 0x00007f00253db022 in js::CallSelfHostedFunction(JSContext*, JS::Handle<js::PropertyName*>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>) () at /lib64/libmozjs-60.so.0
#22 0x00007f00252a7c19 in AsyncFunctionResume(JSContext*, JS::Handle<js::PromiseObject*>, JS::Handle<JS::Value>, ResumeKind, JS::Handle<JS::Value>) () at /lib64/libmozjs-60.so.0
#23 0x00007f0024fae065 in PromiseReactionJob(JSContext*, unsigned int, JS::Value*) () at /lib64/libmozjs-60.so.0
#24 0x00007f0024f42314 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) () at /lib64/libmozjs-60.so.0
#25 0x00007f0024f426fd in js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>) () at /lib64/libmozjs-60.so.0
#26 0x00007f002527c26d in JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) () at /lib64/libmozjs-60.so.0
#27 0x00007f002de7f530 in JS::Call (thisv=..., rval=..., args=..., funObj=..., cx=<optimized out>) at /usr/include/mozjs-60/js/RootingAPI.h:1090
#28 0x00007f002de7f530 in GjsContextPrivate::run_jobs() (this=0x561bcffe2120) at gjs/context.cpp:701
#29 0x00007f002de7f63d in GjsContextPrivate::drain_job_queue_idle_handler(void*) (data=0x561bcffe2120) at gjs/context.cpp:627
#30 0x00007f002eec527b in g_idle_dispatch () at /lib64/libglib-2.0.so.0
#31 0x00007f002eec895d in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#32 0x00007f002eec8d18 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#33 0x00007f002eec9042 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#34 0x00007f002d35bb60 in meta_run () at /lib64/libmutter-4.so.0
#35 0x0000561bcec1a56f in main (argc=<optimized out>, argv=<optimized out>) at ../src/main.c:503

So it is not hanged totally, but still this is something that should not happen for the Unlock screen. I can't reproduce it without the Desktop icons extension enabled, so it seems related to them. Does really one synchronous call in an extension can hang the whole gnome-shell? Ups! So the Desktop icons extension needs to be definitely ported to use asynchronous methods instead of synchronous. Florian?

Comment 7 Ondrej Holy 2023-01-17 15:30:37 UTC

(In reply to Phil Jasbutis from comment #0)
...
> This bugzilla is meant to enhance resilience of gvfsd with following
> thoughts:
> 
>   (A) gvfsd-trash should only check / stat NFS shares if it is really
> required
>     -> Reduce or disable proactive NFS endpoint lookup (for no reason)

Unfortunately, there is a reason, the Desktop Icons extension / the Nautilus application / the File Chooser dialog make requests to the trash backend to determine what icon should be shown (full or empty). I know, this is a really stupid reason, but it is as it is currently. I am not saying this can't be optimized. Definitely, a wrong icon is better than hanged system...
 
>   (B) gvfsd-trash should answer application requests even when there are
> unresponsive NFS shares
>     -> This should only return responsive endpoints, to keep applications
> unaffected and break the
>        issue chain.

The main problem with NFS is that all operations are hanged forever and can't be interrupted by default. I am not aware of any safe way to determine whether the NFS share is responsive without being hanged. But we can perhaps sacrifice one thread instead of blocking the whole backend. I will have to look at the code more deeply...

Comment 8 Florian Müllner 2023-01-17 16:15:52 UTC

(In reply to Ondrej Holy from comment #6)
> Does really one synchronous call in an extension can hang the whole gnome-shell?

Other than worker threads or (in newer mutter versions) the input thread: Yes.

In particular all javascript code runs in a single thread.


> So the Desktop icons extension needs to be definitely ported to use asynchronous methods
> instead of synchronous.

Agreed, I'll look into it.

Comment 10 Florian Müllner 2023-01-18 00:29:53 UTC

@oholy As it sounds like you have a test system set up, could you test the scratch build at http://brew-task-repos.usersys.redhat.com/repos/scratch/fmuellne/gnome-shell-extensions/3.32.1/33.el8/?

Comment 11 Ondrej Holy 2023-01-18 10:57:12 UTC

I've just tested your scratch build and can't reproduce the Unlock screen issue anymore, great!

Comment 13 Florian Müllner 2023-01-18 15:47:55 UTC

I opened https://bugzilla.redhat.com/show_bug.cgi?id=2162017 (RHEL 8) and https://bugzilla.redhat.com/show_bug.cgi?id=2162019 (RHEL 9) to address the extension issue.

Comment 19 RHEL Program Management 2023-09-11 08:39:34 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 20 RHEL Program Management 2023-09-11 08:41:58 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.