Description of problem: ================ When we delete a base volume(the volume which was created first and based on whose volfile name and log name the glusterfsd is running) and remove the brick of this base volume, the deletion affects all the volumes using the same glusterfsd process as the mounts face transport end point error Version-Release number of selected component (if applicable): ======== 3.8.4-25 How reproducible: ====== always Steps to Reproduce: 1.enable brick mux on a cluster setup, in my case a 3 node setup, and have multiple LVs for creating bricks 2.create v1 which is 1x3 and create the bricks in the reccommended way by mentioning a directory under the LV mount rather than using the LV path directly 3.now create v2 which is also 1x3 and use different LVs 4. v2 must be using same glusterfsd pid as the v1 due to brick mux 5. now fuse mount v2 and keep performing IOs 6. Now stop v1 and delete v1 7. now delete the brick directory of the deleted base volume v1. 8. You can see that the IOs of v2 or any other mounted volume is stopped and errored with transport end point error logged in logs 9. try to create a new volume v3 and mount it, the mount to will fail with transport end point error Actual results: ================ when you delete the brick directory of the deleted base volume v1, all the volumes which are mounted and using the same glusterfsd as v1, will have IO error out due to transport end point error Expected results: =============== Detach the directory from glusterfsd and NO impact should be seen as we are deleting a directory which has nothing to do with gluster anymore Additional info: Following is fuse mount log [2017-05-16 13:30:58.069280] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-0: Connected to xyz-5-client-0, attached to remote volume '/rhs/brick5/xyz-5'. [2017-05-16 13:30:58.069307] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-0: Server and Client lk-version numbers are not same, reopening the fds [2017-05-16 13:30:58.069408] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-0: Defering sending CHILD_UP message as the client translators are not yet ready to serve. [2017-05-16 13:30:58.069568] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-0: Server lk version = 1 [2017-05-16 13:30:58.070004] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-xyz-5-client-2: changing port to 49152 (from 0) [2017-05-16 13:30:58.074365] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-xyz-5-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-05-16 13:30:58.074631] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-xyz-5-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-05-16 13:30:58.075266] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-1: Connected to xyz-5-client-1, attached to remote volume '/rhs/brick5/xyz-5'. [2017-05-16 13:30:58.075288] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-1: Server and Client lk-version numbers are not same, reopening the fds [2017-05-16 13:30:58.075389] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-1: Defering sending CHILD_UP message as the client translators are not yet ready to serve. [2017-05-16 13:30:58.075572] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-1: Server lk version = 1 [2017-05-16 13:30:58.075609] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-2: Connected to xyz-5-client-2, attached to remote volume '/rhs/brick5/xyz-5'. [2017-05-16 13:30:58.075620] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-2: Server and Client lk-version numbers are not same, reopening the fds [2017-05-16 13:30:58.075697] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-2: Defering sending CHILD_UP message as the client translators are not yet ready to serve. [2017-05-16 13:30:58.075877] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-2: Server lk version = 1 [2017-05-16 13:31:09.033495] I [fuse-bridge.c:5251:fuse_graph_setup] 0-fuse: switched to graph 0 [2017-05-16 13:31:09.036893] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22 [2017-05-16 13:31:09.037235] I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-xyz-5-replicate-0: no subvolumes up [2017-05-16 13:31:09.037616] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected) [2017-05-16 13:31:09.041393] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: unmounting /mnt/xyz-5 The message "I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-xyz-5-replicate-0: no subvolumes up" repeated 2 times between [2017-05-16 13:31:09.037235] and [2017-05-16 13:31:09.040337] [2017-05-16 13:31:09.041943] W [glusterfsd.c:1291:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7fe5678cadc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7fe568f60f45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7fe568f60d6b] ) 0-: received signum (15), shutting down [2017-05-16 13:31:09.041973] I [fuse-bridge.c:5803:fini] 0-fuse: Unmounting '/mnt/xyz-5'. (END)
This bug will block the testing of 1444926 - Brick Multiplexing: creating a volume with same base name and base brick after it was deleted brings down all the bricks associated with the same brick process
upstream patch : https://review.gluster.org/17356
downstream patch :https://code.engineering.redhat.com/gerrit/#/c/108021/
Validation 3.8.4-27 I Don't see transport end point error anymore and also the IO is going on Smooth. However I see the below posix warnings still when i delete the brick directory of the deleted volume Broadcast message from systemd-journald.eng.blr.redhat.com (Wed 2017-06-07 19:12:25 IST): rhs-brick30-test3_30[23121]: [2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down Message from syslogd@localhost at Jun 7 19:12:25 ... rhs-brick30-test3_30[23121]:[2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down But moving to verified as we are tracking the posix errors with bz#1451602 - Brick Multiplexing:Even clean Deleting of the brick directories of base volume is resulting in posix health check errors(just as we see in ungraceful delete methods) I moved bz#1451602 to failed_qa
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774