1447389 – Brick Multiplexing: seeing Input/Output Error for .trashcan

Bug 1447389 - Brick Multiplexing: seeing Input/Output Error for .trashcan

Summary: Brick Multiplexing: seeing Input/Output Error for .trashcan

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Jiffin
QA Contact:
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:
Blocks:	1443941 1450728 1450729
TreeView+	depends on / blocked

Reported:	2017-05-02 15:19 UTC by Jiffin
Modified:	2018-03-24 07:20 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.12.0
Clone Of:
Clones:	1450728 1450729 (view as bug list)
Environment:
Last Closed:	2017-09-05 17:28:29 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1447390	0	unspecified	CLOSED	Brick Multiplexing :- .trashcan not able to heal after replace brick	2021-02-22 00:41:40 UTC

Internal Links: 1447390

Description Jiffin 2017-05-02 15:19:54 UTC

+++ This bug was initially created as a clone of Bug #1443941 +++

Description of problem:
=====================
Observation: I had enabled brick multiplexing 
I am seeing EIO for .trashcan folder on the mount for an ecvol as below
# ls -la
total 12
drwxr-xr-x.  4 root root 4096 Apr 20 15:22 .
drwxr-xr-x. 15 root root 4096 Apr 20 15:17 ..
drwxr-xr-x.  2 root root 4096 Apr 20 15:22 dir1
# ls -lA
ls: cannot access .trashcan: Input/output error
total 4
drwxr-xr-x. 2 root root 4096 Apr 20 15:22 dir1
d?????????? ? ?    ?       ?            ? .trashcan
 



Steps
=====
Step1:
I had 6 node setup on which i created below volumes

Step2:
enable brick multiplexing

Step3:
ecreated below vols
ecv82-->an ec volume of 2x(8+2) spanning across nodes n1..n5 
distrep3-->a distrep x3 volume 2x3 spanning across n1..n3

now as expected the brick PIDs for bricks hosted by one node is same due to brick mux enabled (check under logs)

Step4:
I then went ahead and changed the log level to debug for distrep3 (which should use same brick log as ecv82)

Step5:
I then mounted distrep3 on  a fuse client
=======>NOTE: I am not seeing .trashcan folder on the mount, don't know why

Step6:
Did some IOs

Step7:
Set min-free disk limit to 50% for distrep3

step8:
did IOs to see if i am getting a warn for breaching minfree and got the below on client fuse log

[2017-04-20 09:45:58.717617] W [MSGID: 109033] [dht-diskusage.c:263:dht_is_subvol_filled] 0-distrep3-dht: disk space on subvolume 'distrep3-replicate-1' is getting full (55.00 %), consider adding more bricks
[2017-04-20 09:46:50.749409] W [MSGID: 109033] [dht-diskusage.c:263:dht_is_subvol_filled] 0-distrep3-dht: disk space on subvolume 'distrep3-replicate-2' is getting full (54.00 %), consider adding more bricks


Step9:
Now mounted ecv82 on a fuse client

Step10:
did an ls -lA and got the EIO


[root@dhcp35-103 ecv82]# ls -lA
ls: cannot access .trashcan: Input/output error
total 4
drwxr-xr-x. 2 root root 4096 Apr 20 15:22 dir1
d?????????? ? ?    ?       ?            ? .trashcan
[root@dhcp35-103 ecv82]# 






#############logs ##############

Task Status of Volume ecv82
------------------------------------------------------------------------------
There are no active volume tasks
 # gluster v info
 
Volume Name: distrep3
Type: Distributed-Replicate
Volume ID: 28a6c08e-b7a0-4135-88fa-4b9ae250d609
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.70.35.138:/rhs/brick11/distrep3
Brick2: 10.70.35.130:/rhs/brick11/distrep3
Brick3: 10.70.35.122:/rhs/brick11/distrep3
Brick4: 10.70.35.138:/rhs/brick12/distrep3
Brick5: 10.70.35.130:/rhs/brick12/distrep3
Brick6: 10.70.35.122:/rhs/brick12/distrep3
Brick7: 10.70.35.138:/rhs/brick13/distrep3
Brick8: 10.70.35.130:/rhs/brick13/distrep3
Brick9: 10.70.35.122:/rhs/brick13/distrep3
Options Reconfigured:
cluster.min-free-disk: 50
cluster.quorum-count: 1
diagnostics.brick-log-level: DEBUG
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: enable
 
Volume Name: ecv82
Type: Distributed-Disperse
Volume ID: c2a84a0f-a95f-4264-984b-2e0879da7f99
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (8 + 2) = 20
Transport-type: tcp
Bricks:
Brick1: 10.70.35.138:/rhs/brick1/ecv82
Brick2: 10.70.35.130:/rhs/brick1/ecv82
Brick3: 10.70.35.122:/rhs/brick1/ecv82
Brick4: 10.70.35.23:/rhs/brick1/ecv82
Brick5: 10.70.35.112:/rhs/brick1/ecv82
Brick6: 10.70.35.138:/rhs/brick2/ecv82
Brick7: 10.70.35.130:/rhs/brick2/ecv82
Brick8: 10.70.35.122:/rhs/brick2/ecv82
Brick9: 10.70.35.23:/rhs/brick2/ecv82
Brick10: 10.70.35.112:/rhs/brick2/ecv82
Brick11: 10.70.35.138:/rhs/brick3/ecv82
Brick12: 10.70.35.130:/rhs/brick3/ecv82
Brick13: 10.70.35.122:/rhs/brick3/ecv82
Brick14: 10.70.35.23:/rhs/brick3/ecv82
Brick15: 10.70.35.112:/rhs/brick3/ecv82
Brick16: 10.70.35.138:/rhs/brick4/ecv82
Brick17: 10.70.35.130:/rhs/brick4/ecv82
Brick18: 10.70.35.122:/rhs/brick4/ecv82
Brick19: 10.70.35.23:/rhs/brick4/ecv82
Brick20: 10.70.35.112:/rhs/brick4/ecv82
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: enable
[root@dhcp35-45 ~]# 




Rationale Of the testing:
I wanted to check the behavior when we have brick mux in effect and we try to change some brick settings




--- Additional comment from nchilaka on 2017-04-20 06:28:57 EDT ---

fuse mount log:
[2017-04-20 09:48:19.851779] W [fuse-resolve.c:61:fuse_resolve_entry_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001/.trashcan: failed to resolve (Input/output error)
[2017-04-20 09:48:19.854563] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-ecv82-dht: Found anomalies in /.trashcan (gfid = 00000000-0000-0000-0000-000000000005). Holes=1 overlaps=0
[2017-04-20 09:48:19.855996] W [MSGID: 109065] [dht-selfheal.c:1410:dht_selfheal_dir_mkdir_lock_cbk] 0-ecv82-dht: acquiring inodelk failed for /.trashcan [Input/output error]
[2017-04-20 09:48:19.856077] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 23: LOOKUP() /.trashcan => -1 (Input/output error)
[2017-04-20 09:52:35.145993] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-ecv82-dht: Found anomalies in /.trashcan (gfid = 00000000-0000-0000-0000-000000000005). Holes=1 overlaps=0
[2017-04-20 09:52:35.147543] W [MSGID: 109065] [dht-selfheal.c:1410:dht_selfheal_dir_mkdir_lock_cbk] 0-ecv82-dht: acquiring inodelk failed for /.trashcan [Input/output error]
[2017-04-20 09:52:35.147583] W [fuse-resolve.c:61:fuse_resolve_entry_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001/.trashcan: failed to resolve (Input/output error)
[2017-04-20 09:52:35.152434] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 866: LOOKUP() /.trashcan => -1 (Input/output error)
[2017-04-20 09:52:35.150938] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-ecv82-dht: Found anomalies in /.trashcan (gfid = 00000000-0000-0000-0000-000000000005). Holes=1 overlaps=0
[2017-04-20 09:52:35.152401] W [MSGID: 109065] [dht-selfheal.c:1410:dht_selfheal_dir_mkdir_lock_cbk] 0-ecv82-dht: acquiring inodelk failed for /.trashcan [Input/output error]

--- Additional comment from Jiffin on 2017-04-27 06:17:27 EDT ---

While trying out this bug, i have found the following. When volume is started with brick multiplexing is enabled, ".trashcan" was created only on three bricks out of 20 bricks(2x(8+2)) , not on all the subvolume. I have gut feeling that this might cause for this bug and bz1443939.

P.S : I don't have enough knowledge about brick multiplexing to command why it is happening and the test was performed in my workstation.

Also if possible I request QA to retest above scenario using following steps
1.) create the volume
2.) start the volume
3.) enable brick-mulitplexing
4.) restart the volume(stop and start)
5.) Then retest the case

Comment 1 Worker Ant 2017-05-09 15:47:40 UTC

REVIEW: https://review.gluster.org/17225 (glusterfsd: send PARENT_UP on brick attach) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 2 Worker Ant 2017-05-11 05:11:47 UTC

REVIEW: https://review.gluster.org/17225 (glusterfsd: send PARENT_UP on brick attach) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 3 Worker Ant 2017-05-11 11:55:23 UTC

REVIEW: https://review.gluster.org/17225 (glusterfsd: send PARENT_UP on brick attach) posted (#3) for review on master by Atin Mukherjee (amukherj)

Comment 4 Worker Ant 2017-05-11 18:10:52 UTC

REVIEW: https://review.gluster.org/17225 (glusterfsd: send PARENT_UP on brick attach) posted (#4) for review on master by Atin Mukherjee (amukherj)

Comment 5 Worker Ant 2017-05-12 08:36:01 UTC

REVIEW: https://review.gluster.org/17225 (glusterfsd: send PARENT_UP on brick attach) posted (#5) for review on master by Atin Mukherjee (amukherj)

Comment 6 Worker Ant 2017-05-12 15:56:39 UTC

REVIEW: https://review.gluster.org/17225 (glusterfsd: send PARENT_UP on brick attach) posted (#6) for review on master by Atin Mukherjee (amukherj)

Comment 7 Worker Ant 2017-05-14 13:29:26 UTC

REVIEW: https://review.gluster.org/17225 (glusterfsd: send PARENT_UP on brick attach) posted (#7) for review on master by Atin Mukherjee (amukherj)

Comment 8 Worker Ant 2017-05-14 21:10:34 UTC

COMMIT: https://review.gluster.org/17225 committed in master by Jeff Darcy (jeff.us) 
------
commit 86ad032949cb80b6ba3df9dc8268243529d4eb84
Author: Atin Mukherjee <amukherj>
Date:   Tue May 9 21:05:50 2017 +0530

    glusterfsd: send PARENT_UP on brick attach
    
    With brick multiplexing being enabled, if a brick is instance attached to a
    process then a PARENT_UP event is needed so that it reaches right till
    posix layer and then from posix CHILD_UP event is sent back to all the
    children.
    
    Change-Id: Ic341086adb3bbbde0342af518e1b273dd2f669b9
    BUG: 1447389
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/17225
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jeff.us>

Comment 9 Atin Mukherjee 2017-05-19 05:04:00 UTC

https://review.gluster.org/17225 patch is broken, moving it back to ASSIGNED.

Comment 10 Shyamsundar 2017-09-05 17:28:29 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.