Bug 1793852 - Mounts fails after reboot of 1/3 gluster nodes
Summary: Mounts fails after reboot of 1/3 gluster nodes
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: Mohit Agrawal
QA Contact:
URL:
Whiteboard:
Depends On: 1793035
Blocks: 1788913 1794019 1794020 1804512
TreeView+ depends on / blocked
 
Reported: 2020-01-22 05:00 UTC by Mohit Agrawal
Modified: 2020-03-02 07:53 UTC (History)
13 users (show)

Fixed In Version:
Clone Of: 1793035
: 1794019 1794020 1804512 (view as bug list)
Environment:
Last Closed: 2020-01-22 14:05:28 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 24053 0 None Merged server: Mount fails after reboot 1/3 gluster nodes 2020-01-22 14:05:25 UTC

Comment 1 Mohit Agrawal 2020-01-22 05:04:17 UTC
Reproducer:
1) created 3 node cluster(host10/11/12), bmux enabled
2) 500 vols of either arbiter of x3 created
3) mounted all 500 volumes on 3 clients(rhsqa6/7/8)
4) started linux untar on 4 volumes parallely on each of the 3 clients(basically, 4 screen sessions in each client, with linux untar being done sequentially on set of 25 vols in each session)
5) did node reboot of rhsqa11
6) post reboot, checked all clients for vol mount sanity
client_host->all vols mounted
client_host-> arb_b68n0p9myx2id failed to mount
[root@client_host glusterfs]# grep -r "failed: Authentication failed"
mnt-arb_b68n0p9myx2id.log:[2020-01-21 12:32:06.281545] E [MSGID: 114044] [client-handshake.c:1031:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: SETVOLUME on remote-host failed: Authentication failed [Permission denied]

RCA:
Below are the client logs throwing at the time of getting client_setvolume_cbk failed.
As we can see here client is getting error only for brick (client-1) not for other clients.
It means other clients are already connected. Here we can see client is getting AUTH_FAILED event 
and when fuse gets AUTH_FAILED it calls fini so client is unmounted.

>>>>>>>>>>>>>>>.

[2020-01-21 11:43:01.806402] I [fuse-bridge.c:5840:fuse_graph_sync] 0-fuse: switched to graph 0
[2020-01-21 12:26:48.626004] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-arb_b68n0p9myx2id-client-1: disconnected from arb_b68n0p9myx2id-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2020-01-21 12:31:56.095015] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-arb_b68n0p9myx2id-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2020-01-21 12:31:56.095094] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-arb_b68n0p9myx2id-client-1: disconnected from arb_b68n0p9myx2id-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2020-01-21 12:32:06.071586] I [rpc-clnt.c:2035:rpc_clnt_reconfig] 0-arb_b68n0p9myx2id-client-1: changing port to 49152 (from 0)
[2020-01-21 12:32:06.281470] W [MSGID: 114043] [client-handshake.c:997:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: failed to set the volume [Permission denied]
[2020-01-21 12:32:06.281528] W [MSGID: 114007] [client-handshake.c:1026:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: failed to get 'process-uuid' from reply dict [Invalid argument]
[2020-01-21 12:32:06.281545] E [MSGID: 114044] [client-handshake.c:1031:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: SETVOLUME on remote-host failed: Authentication failed [Permission denied]
[2020-01-21 12:32:06.281558] I [MSGID: 114049] [client-handshake.c:1115:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: sending AUTH_FAILED event
[2020-01-21 12:32:06.281596] E [fuse-bridge.c:6358:notify] 0-fuse: Server authenication failed. Shutting down.
[2020-01-21 12:32:06.281609] I [fuse-bridge.c:6900:fini] 0-fuse: Unmounting '/mnt/arb_b68n0p9myx2id'.
[2020-01-21 12:32:06.309745] I [fuse-bridge.c:6106:fuse_thread_proc] 0-fuse: initating unmount of /mnt/arb_b68n0p9myx2id
[2020-01-21 12:32:06.309916] I [fuse-bridge.c:6905:fini] 0-fuse: Closing fuse connection to '/mnt/arb_b68n0p9myx2id'.
[2020-01-21 12:32:06.311119] W [glusterfsd.c:1581:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7ea5) [0x7f3a88e0cea5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55a28f6002b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55a28f60011b] ) 0-: received signum (15), shutting down

>>>>>>>>>>>>>>>>>>>>

The client was getting Permission denied because brick was not attached at that moment,server_setvolume execute below code before authenticating a client request.
Here we can see if get_xlator_by_name is returning NULL we are updating this to xl so it means the brick process is assuming if no xlator(volname) is found in graph connect with already running brick but gf_authenticate failed and return EPERM.

LOCK(&ctx->volfile_lock);
    {
        xl = get_xlator_by_name(this, name);
        if (!xl)
            xl = this;
    }
    UNLOCK(&ctx->volfile_lock);


We need to correct this condition to avoid the issue. This code was changed from this patch(https://review.gluster.org/#/c/glusterfs/+/18048/).


Thanks,
Mohit Agrawal

Comment 2 Mohit Agrawal 2020-01-22 05:04:55 UTC
The patch is posted to resolve the same
https://review.gluster.org/24053

Comment 3 Worker Ant 2020-01-22 14:05:28 UTC
REVIEW: https://review.gluster.org/24053 (server: Mount fails after reboot 1/3 gluster nodes) merged (#6) on master by MOHIT AGRAWAL


Note You need to log in before you can comment on or make changes to this bug.