1451598 – Brick Multiplexing: Deleting brick directories of the base volume must gracefully detach from glusterfsd without impacting other volumes IO(currently seeing transport end point error)

Bug 1451598 - Brick Multiplexing: Deleting brick directories of the base volume must gracefully detach from glusterfsd without impacting other volumes IO(currently seeing transport end point error)

Summary: Brick Multiplexing: Deleting brick directories of the base volume must gracef...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Mohit Agrawal
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:	1453977
Blocks:	1417151 1444926 1458113
TreeView+	depends on / blocked

Reported:	2017-05-17 06:35 UTC by Nag Pavan Chilakam
Modified:	2017-09-21 04:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-3.8.4-27
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1453977 1458113 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:43:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-05-17 06:35:41 UTC

Description of problem:
================
When we delete a base volume(the volume which was created first and based on whose volfile name and log name the glusterfsd is running) and remove the brick of this base volume, the deletion affects all the volumes using the same glusterfsd process as the mounts face transport end point error



Version-Release number of selected component (if applicable):
========
3.8.4-25

How reproducible:
======
always

Steps to Reproduce:
1.enable brick mux on a cluster setup, in my case a 3 node setup, and have multiple LVs for creating bricks
2.create v1 which is 1x3 and create the bricks in the reccommended way by mentioning a directory under the LV mount rather than using the LV path directly 
3.now create v2 which is also 1x3 and use different LVs 
4. v2 must be using same glusterfsd pid as the v1 due to brick mux
5. now fuse mount v2 and keep performing IOs
6. Now stop v1 and delete v1
7. now delete the brick directory of the deleted base volume  v1.
8. You can see that the IOs of v2 or any other mounted volume is stopped and errored  with transport end point error logged in logs
9. try to create a new volume v3 and mount it, the mount to will fail with transport end point error

Actual results:
================
when you delete the brick directory of the deleted base volume  v1, all the volumes which are mounted and using the same glusterfsd as v1, will have IO error out due to transport end point error 

Expected results:
===============
Detach the directory from glusterfsd and NO impact should be seen as we are deleting a directory which has nothing to do with gluster anymore

Additional info:
Following is fuse mount log

[2017-05-16 13:30:58.069280] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-0: Connected to xyz-5-client-0, attached to remote volume '/rhs/brick5/xyz-5'.
[2017-05-16 13:30:58.069307] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-16 13:30:58.069408] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-0: Defering sending CHILD_UP message as the client translators are not yet ready to serve.
[2017-05-16 13:30:58.069568] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-0: Server lk version = 1
[2017-05-16 13:30:58.070004] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-xyz-5-client-2: changing port to 49152 (from 0)
[2017-05-16 13:30:58.074365] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-xyz-5-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-16 13:30:58.074631] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-xyz-5-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-16 13:30:58.075266] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-1: Connected to xyz-5-client-1, attached to remote volume '/rhs/brick5/xyz-5'.
[2017-05-16 13:30:58.075288] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-16 13:30:58.075389] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-1: Defering sending CHILD_UP message as the client translators are not yet ready to serve.
[2017-05-16 13:30:58.075572] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-1: Server lk version = 1
[2017-05-16 13:30:58.075609] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-xyz-5-client-2: Connected to xyz-5-client-2, attached to remote volume '/rhs/brick5/xyz-5'.
[2017-05-16 13:30:58.075620] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-xyz-5-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-16 13:30:58.075697] I [MSGID: 114064] [client-handshake.c:148:client_notify_parents_child_up] 0-xyz-5-client-2: Defering sending CHILD_UP message as the client translators are not yet ready to serve.
[2017-05-16 13:30:58.075877] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-xyz-5-client-2: Server lk version = 1
[2017-05-16 13:31:09.033495] I [fuse-bridge.c:5251:fuse_graph_setup] 0-fuse: switched to graph 0
[2017-05-16 13:31:09.036893] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2017-05-16 13:31:09.037235] I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-xyz-5-replicate-0: no subvolumes up
[2017-05-16 13:31:09.037616] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)
[2017-05-16 13:31:09.041393] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: unmounting /mnt/xyz-5
The message "I [MSGID: 108006] [afr-common.c:4827:afr_local_init] 0-xyz-5-replicate-0: no subvolumes up" repeated 2 times between [2017-05-16 13:31:09.037235] and [2017-05-16 13:31:09.040337]
[2017-05-16 13:31:09.041943] W [glusterfsd.c:1291:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7fe5678cadc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7fe568f60f45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7fe568f60d6b] ) 0-: received signum (15), shutting down
[2017-05-16 13:31:09.041973] I [fuse-bridge.c:5803:fini] 0-fuse: Unmounting '/mnt/xyz-5'.
(END)

Comment 2 Nag Pavan Chilakam 2017-05-17 06:37:35 UTC

This bug will block the testing of  1444926 - Brick Multiplexing: creating a volume with same base name and base brick after it was deleted brings down all the bricks associated with the same brick process

Comment 5 Atin Mukherjee 2017-05-22 12:06:35 UTC

upstream patch : https://review.gluster.org/17356

Comment 6 Atin Mukherjee 2017-06-05 04:50:16 UTC

downstream patch :https://code.engineering.redhat.com/gerrit/#/c/108021/

Comment 8 Nag Pavan Chilakam 2017-06-07 13:47:49 UTC

Validation
3.8.4-27

I Don't see transport end point error anymore and also the IO is going on Smooth.


However I see the below posix warnings still when i delete the brick directory of the deleted volume


Broadcast message from systemd-journald.eng.blr.redhat.com (Wed 2017-06-07 19:12:25 IST):

rhs-brick30-test3_30[23121]: [2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


Message from syslogd@localhost at Jun  7 19:12:25 ...
 rhs-brick30-test3_30[23121]:[2017-06-07 13:42:25.016770] M [MSGID: 113075] [posix-helpers.c:1905:posix_health_check_thread_proc] 0-test3_30-posix: health-check failed, going down


But moving to verified as we are tracking the posix errors with bz#1451602 - Brick Multiplexing:Even clean Deleting of the brick directories of base volume is resulting in posix health check errors(just as we see in ungraceful delete methods)
I moved bz#1451602 to failed_qa

Comment 10 errata-xmlrpc 2017-09-21 04:43:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.