Bug 1451598 - Brick Multiplexing: Deleting brick directories of the base volume must gracefully detach from glusterfsd without impacting other volumes IO(currently seeing transport end point error)
Summary: Brick Multiplexing: Deleting brick directories of the base volume must gracef...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.3.0
Assignee: Mohit Agrawal
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard: brick-multiplexing
Depends On: 1453977
Blocks: 1417151 1444926 1458113
TreeView+ depends on / blocked
 
Reported: 2017-05-17 06:35 UTC by Nag Pavan Chilakam
Modified: 2017-09-21 04:43 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.8.4-27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1453977 1458113 (view as bug list)
Environment:
Last Closed: 2017-09-21 04:43:23 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-05-17 06:35:41 UTC
Description of problem:
================
When we delete a base volume(the volume which was created first and based on whose volfile name and log name the glusterfsd is running) and remove the brick of this base volume, the deletion affects all the volumes using the same glusterfsd process as the mounts face transport end point error



Version-Release number of selected component (if applicable):
========
3.8.4-25

How reproducible:
======
always

Steps to Reproduce:
1.enable brick mux on a cluster setup, in my case a 3 node setup, and have multiple LVs for creating bricks
2.create v1 which is 1x3 and create the bricks in the reccommended way by mentioning a directory under the LV mount rather than using the LV path directly 
3.now create v2 which is also 1x3 and use different LVs 
4. v2 must be using same glusterfsd pid as the v1 due to brick mux
5. now fuse mount v2 and keep performing IOs
6. Now stop v1 and delete v1
7. now delete the brick directory of the deleted base volume  v1.
8. You can see that the IOs of v2 or any other mounted volume is stopped and errored  with transport end point error logged in logs
9. try to create a new volume v3 and mount it, the mount to will fail with transport end point error

Actual results:
================
when you delete the brick directory of the deleted base volume  v1, all the volumes which are mounted and using the same glusterfsd as v1, will have IO error out due to transport end point error 

Expected results:
===============
Detach the directory from glusterfsd and NO impact should be seen as we are deleting a directory which has nothing to do with gluster anymore

Additional info:
Following is fuse mount log

[2017-05-16 13:30:58.069280] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-0: Connected to xyz-5-client-0, attached to remote volume '/rhs/brick5/xyz-5'.
[2017-05-16 13:30:58.069307] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-16 13:30:58.069408] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-0: Defering sending CHILD_UP message as the client translators are not yet ready to serve.
[2017-05-16 13:30:58.069568] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-0: Server lk version = 1
[2017-05-16 13:30:58.070004] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-xyz-5-client-2: changing port to 49152 (from 0)
[2017-05-16 13:30:58.074365] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-xyz-5-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-16 13:30:58.074631] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-xyz-5-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-16 13:30:58.075266] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-1: Connected to xyz-5-client-1, attached to remote volume '/rhs/brick5/xyz-5'.
[2017-05-16 13:30:58.075288] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-16 13:30:58.075389] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-1: Defering sending CHILD_UP message as the client translators are not yet ready to serve.
[2017-05-16 13:30:58.075572] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-1: Server lk version = 1
[2017-05-16 13:30:58.075609] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-2: Connected to xyz-5-client-2, attached to remote volume '/rhs/brick5/xyz-5'.
[2017-05-16 13:30:58.075620] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-16 13:30:58.075697] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-2: Defering sending CHILD_UP message as the client translators are not yet ready to serve.
[2017-05-16 13:30:58.075877] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-2: Server lk version = 1
[2017-05-16 13:31:09.033495] I [fuse-bridge.c:5251:fuse_graph_setup] 0-fuse: switched to graph 0
[2017-05-16 13:31:09.036893] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2017-05-16 13:31:09.037235] I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-xyz-5-replicate-0: no subvolumes up
[2017-05-16 13:31:09.037616] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)
[2017-05-16 13:31:09.041393] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: unmounting /mnt/xyz-5
The message "I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-xyz-5-replicate-0: no subvolumes up" repeated 2 times between [2017-05-16 13:31:09.037235] and [2017-05-16 13:31:09.040337]
[2017-05-16 13:31:09.041943] W [glusterfsd.c:1291:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7fe5678cadc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7fe568f60f45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7fe568f60d6b] ) 0-: received signum (15), shutting down
[2017-05-16 13:31:09.041973] I [fuse-bridge.c:5803:fini] 0-fuse: Unmounting '/mnt/xyz-5'.
(END)

Comment 2 Nag Pavan Chilakam 2017-05-17 06:37:35 UTC
This bug will block the testing of  1444926 - Brick Multiplexing: creating a volume with same base name and base brick after it was deleted brings down all the bricks associated with the same brick process

Comment 5 Atin Mukherjee 2017-05-22 12:06:35 UTC
upstream patch : https://review.gluster.org/17356

Comment 6 Atin Mukherjee 2017-06-05 04:50:16 UTC
downstream patch :https://code.engineering.redhat.com/gerrit/#/c/108021/

Comment 8 Nag Pavan Chilakam 2017-06-07 13:47:49 UTC
Validation
3.8.4-27

I Don't see transport end point error anymore and also the IO is going on Smooth.


However I see the below posix warnings still when i delete the brick directory of the deleted volume


Broadcast message from systemd-journald.eng.blr.redhat.com (Wed 2017-06-07 19:12:25 IST):

rhs-brick30-test3_30[23121]: [2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


Message from syslogd@localhost at Jun  7 19:12:25 ...
 rhs-brick30-test3_30[23121]:[2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


But moving to verified as we are tracking the posix errors with bz#1451602 - Brick Multiplexing:Even clean Deleting of the brick directories of base volume is resulting in posix health check errors(just as we see in ungraceful delete methods)
I moved bz#1451602 to failed_qa

Comment 10 errata-xmlrpc 2017-09-21 04:43:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.