1664215 – Toggling readdir-ahead translator off causes some clients to umount some of its volumes

Bug 1664215 - Toggling readdir-ahead translator off causes some clients to umount some of its volumes

Summary: Toggling readdir-ahead translator off causes some clients to umount some of i...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	read-ahead
Sub Component:
Version:	3.12
Hardware:	other
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Assignee:	Sheetal Pamecha
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-08 04:27 UTC by Amgad
Modified:	2023-09-14 04:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-10-04 04:59:05 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Glusterfs client logs (251.37 KB, application/gzip) 2019-01-08 04:27 UTC, Amgad	no flags	Details
View All

Description Amgad 2019-01-08 04:27:41 UTC

Created attachment 1519119 [details]
Glusterfs client logs

Description of problem:

After turning the Read ahead translator "off", some of the clients (fuse) got disconnected (umount) to one of the data volumes. Attached the glusterfs logs from the client that experienced the disconnect. 

The following is an excerpt of the messages in the glusterfs/data.log.* logfiles:
---
[2019-01-07 07:40:44.625789] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 8
[2019-01-07 07:40:44.629594] I [MSGID: 114021] [client.c:2369:notify] 6-el_data-client-0: current graph is no longer active, destroying rpc_client
[2019-01-07 07:40:44.629651] I [MSGID: 114021] [client.c:2369:notify] 6-el_data-client-1: current graph is no longer active, destroying rpc_client
[2019-01-07 07:40:44.629668] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 6-el_data-client-0: disconnected from el_data-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-01-07 07:40:44.629724] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 6-el_data-client-1: disconnected from el_data-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2019-01-07 07:40:44.629732] E [MSGID: 108006] [afr-common.c:5118:__afr_handle_child_down_event] 6-el_data-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2019-01-07 07:40:44.869481] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2019-01-07 07:40:44.916540] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
----

Version-Release number of selected component (if applicable):
3.12.13

How reproducible:
Turn the Read ahead translator "off" on the server side.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Nithya Balachandran 2019-01-08 13:37:01 UTC

(In reply to Amgad from comment #0)
> Created attachment 1519119 [details]
> Glusterfs client logs
> 
> Description of problem:
> 
> After turning the Read ahead translator "off", some of the clients (fuse)
> got disconnected (umount) to one of the data volumes. Attached the glusterfs
> logs from the client that experienced the disconnect. 
> 
> The following is an excerpt of the messages in the glusterfs/data.log.*
> logfiles:
> ---
> [2019-01-07 07:40:44.625789] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse:
> switched to graph 8
> [2019-01-07 07:40:44.629594] I [MSGID: 114021] [client.c:2369:notify]
> 6-el_data-client-0: current graph is no longer active, destroying rpc_client
> [2019-01-07 07:40:44.629651] I [MSGID: 114021] [client.c:2369:notify]
> 6-el_data-client-1: current graph is no longer active, destroying rpc_client
> [2019-01-07 07:40:44.629668] I [MSGID: 114018]
> [client.c:2285:client_rpc_notify] 6-el_data-client-0: disconnected from
> el_data-client-0. Client process will keep trying to connect to glusterd
> until brick's port is available
> [2019-01-07 07:40:44.629724] I [MSGID: 114018]
> [client.c:2285:client_rpc_notify] 6-el_data-client-1: disconnected from
> el_data-client-1. Client process will keep trying to connect to glusterd
> until brick's port is available
> [2019-01-07 07:40:44.629732] E [MSGID: 108006]
> [afr-common.c:5118:__afr_handle_child_down_event] 6-el_data-replicate-0: All
> subvolumes are down. Going offline until atleast one of them comes back up.
> [2019-01-07 07:40:44.869481] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
> Volume file changed
> [2019-01-07 07:40:44.916540] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
> Volume file changed
> ----
> 
> Version-Release number of selected component (if applicable):
> 3.12.13
> 
> How reproducible:
> Turn the Read ahead translator "off" on the server side.
> 

Do you mean read-ahead or readdir-ahead? They are 2 different translators and the memory leak was in readdir-ahead.

Do clients lose access to the volume and do you see errors on the mount point? The graph switch messages in the logs are expected.


@Amar, please assign the BZ to the appropriate person.

Comment 2 Amgad 2019-01-08 19:21:50 UTC

Thanks Nithya:

It is the "readdir-ahead" translator. Yes the client loses access to the volume and we saw the graph switch messages.
For instance in the attached logs, you can see:
-------
[2019-01-04 04:55:51.453074] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 4
[2019-01-04 04:55:51.454990] I [MSGID: 114021] [client.c:2369:notify] 2-el_data-client-0: current graph is no longer active, destroying rpc_client
[2019-01-04 04:55:51.455028] I [MSGID: 114021] [client.c:2369:notify] 2-el_data-client-1: current graph is no longer active, destroying rpc_client
[2019-01-04 04:55:51.455037] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 2-el_data-client-0: disconnected from el_data-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-01-04 04:55:51.455080] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 2-el_data-client-1: disconnected from el_data-client-1. Client process will keep trying to connect to glusterd until brick's port is available
--------

Regards,
Amgad

Comment 3 Amar Tumballi 2019-01-09 04:47:36 UTC

> [2019-01-04 04:55:51.453074] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 4

Notice the graph number is 4 here.

> [2019-01-04 04:55:51.454990] I [MSGID: 114021] [client.c:2369:notify] 2-el_data-client-0: current graph is no longer active, destroying rpc_client
> [2019-01-04 04:55:51.455028] I [MSGID: 114021] [client.c:2369:notify] 2-el_data-client-1: current graph is no longer active, destroying rpc_client
> [2019-01-04 04:55:51.455037] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 2-el_data-client-0: disconnected from el_data-client-0. Client process will keep trying to connect to glusterd until brick's port is available
> [2019-01-04 04:55:51.455080] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 2-el_data-client-1: disconnected from el_data-client-1. Client process will keep trying to connect to glusterd until brick's port is available

All the above logs are from graph 2, (2-el_data-client-0 signifies, in graph 2, volume name el_data, and client-0 is protocol/client translator). So this is expected.

I don't think this is the reason for the umount. These logs can turn up in any such graph change options (like write-behind on/off, read-ahead on/off, etc). Not sure if this is enough to debug this umount issues.

Comment 4 Amgad 2019-01-09 14:59:44 UTC

Thanks Amar:

The full log-set on the client experienced the disconnect was attached to the ticket for your investigation. Please let us know if any additional traces can be turned-on when it happens again to help debugging on your side.

Regards,
Amgad

Comment 5 Amar Tumballi 2019-05-11 11:50:50 UTC

Amgad, we recommend to you to upgrade to glusterfs-6.1 (or 6.2 which comes out in another week), so you can avail some of the latest fixes.

Comment 6 Sheetal Pamecha 2019-09-26 06:59:13 UTC

Hi Amgad,

I tried to reproduce this issue on 3.12.13 but could not see any issue with Toggling readdir-ahead translator. None of clients face unmount of volumes.
There was nothing suspicious on my set up. Same goes with the client logs provided by you.

As amar already mentioned these graph switch messages are expected with toggle of any translator.

Are you still using 3.12.13 and experiencing the disconnect?

Comment 7 Sheetal Pamecha 2019-10-01 06:35:41 UTC

Hi Amgad,
As 3.12 is already EOL, recommended is to be on a release that is still maintained. If you are still facing this issue do let us know, else would like to close this issue?

Comment 8 Sheetal Pamecha 2019-10-04 04:59:05 UTC

I tried to reproduce this with several cases and didn't ran into the scenario where toggling readdir-ahead causes clients to unmount. Closing this bug as WORKSFORME.

Comment 9 Amgad 2019-11-11 14:53:40 UTC

upgraded to release 5.5, then 6.5+ to avoid the issue

Comment 10 Red Hat Bugzilla 2023-09-14 04:44:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.