Reproducer: 1) created 3 node cluster(host10/11/12), bmux enabled 2) 500 vols of either arbiter of x3 created 3) mounted all 500 volumes on 3 clients(rhsqa6/7/8) 4) started linux untar on 4 volumes parallely on each of the 3 clients(basically, 4 screen sessions in each client, with linux untar being done sequentially on set of 25 vols in each session) 5) did node reboot of rhsqa11 6) post reboot, checked all clients for vol mount sanity client_host->all vols mounted client_host-> arb_b68n0p9myx2id failed to mount [root@client_host glusterfs]# grep -r "failed: Authentication failed" mnt-arb_b68n0p9myx2id.log:[2020-01-21 12:32:06.281545] E [MSGID: 114044] [client-handshake.c:1031:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: SETVOLUME on remote-host failed: Authentication failed [Permission denied] RCA: Below are the client logs throwing at the time of getting client_setvolume_cbk failed. As we can see here client is getting error only for brick (client-1) not for other clients. It means other clients are already connected. Here we can see client is getting AUTH_FAILED event and when fuse gets AUTH_FAILED it calls fini so client is unmounted. >>>>>>>>>>>>>>>. [2020-01-21 11:43:01.806402] I [fuse-bridge.c:5840:fuse_graph_sync] 0-fuse: switched to graph 0 [2020-01-21 12:26:48.626004] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-arb_b68n0p9myx2id-client-1: disconnected from arb_b68n0p9myx2id-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-01-21 12:31:56.095015] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-arb_b68n0p9myx2id-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-01-21 12:31:56.095094] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-arb_b68n0p9myx2id-client-1: disconnected from arb_b68n0p9myx2id-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-01-21 12:32:06.071586] I [rpc-clnt.c:2035:rpc_clnt_reconfig] 0-arb_b68n0p9myx2id-client-1: changing port to 49152 (from 0) [2020-01-21 12:32:06.281470] W [MSGID: 114043] [client-handshake.c:997:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: failed to set the volume [Permission denied] [2020-01-21 12:32:06.281528] W [MSGID: 114007] [client-handshake.c:1026:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: failed to get 'process-uuid' from reply dict [Invalid argument] [2020-01-21 12:32:06.281545] E [MSGID: 114044] [client-handshake.c:1031:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: SETVOLUME on remote-host failed: Authentication failed [Permission denied] [2020-01-21 12:32:06.281558] I [MSGID: 114049] [client-handshake.c:1115:client_setvolume_cbk] 0-arb_b68n0p9myx2id-client-1: sending AUTH_FAILED event [2020-01-21 12:32:06.281596] E [fuse-bridge.c:6358:notify] 0-fuse: Server authenication failed. Shutting down. [2020-01-21 12:32:06.281609] I [fuse-bridge.c:6900:fini] 0-fuse: Unmounting '/mnt/arb_b68n0p9myx2id'. [2020-01-21 12:32:06.309745] I [fuse-bridge.c:6106:fuse_thread_proc] 0-fuse: initating unmount of /mnt/arb_b68n0p9myx2id [2020-01-21 12:32:06.309916] I [fuse-bridge.c:6905:fini] 0-fuse: Closing fuse connection to '/mnt/arb_b68n0p9myx2id'. [2020-01-21 12:32:06.311119] W [glusterfsd.c:1581:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7ea5) [0x7f3a88e0cea5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55a28f6002b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55a28f60011b] ) 0-: received signum (15), shutting down >>>>>>>>>>>>>>>>>>>> The client was getting Permission denied because brick was not attached at that moment,server_setvolume execute below code before authenticating a client request. Here we can see if get_xlator_by_name is returning NULL we are updating this to xl so it means the brick process is assuming if no xlator(volname) is found in graph connect with already running brick but gf_authenticate failed and return EPERM. LOCK(&ctx->volfile_lock); { xl = get_xlator_by_name(this, name); if (!xl) xl = this; } UNLOCK(&ctx->volfile_lock); We need to correct this condition to avoid the issue. This code was changed from this patch(https://review.gluster.org/#/c/glusterfs/+/18048/). Thanks, Mohit Agrawal
The patch is posted to resolve the same https://review.gluster.org/24053
REVIEW: https://review.gluster.org/24053 (server: Mount fails after reboot 1/3 gluster nodes) merged (#6) on master by MOHIT AGRAWAL