1215550 – glusterfsd crashed after directory was removed from the mount point, while self-heal and rebalance were running on the volume

Bug 1215550 - glusterfsd crashed after directory was removed from the mount point, while self-heal and rebalance were running on the volume

Summary: glusterfsd crashed after directory was removed from the mount point, while se...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Vijaikumar Mallikarjuna
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	qe_tracker_everglades glusterfs-3.7.0 1217406 1217423 1217433 1223916
TreeView+	depends on / blocked

Reported:	2015-04-27 05:43 UTC by Shruti Sampat
Modified:	2016-06-16 12:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:
Clones:	1217406 1217423 1217433 1223916 (view as bug list)
Environment:
Last Closed:	2016-06-16 12:55:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shruti Sampat 2015-04-27 05:43:43 UTC

Description of problem:
--------------------------

On a 6x3 volume, some bricks were brought down when rebalance was in progress. This caused the mount to be read-only (client quorum was enabled). While rebalance was in progress, the bricks were brought back up. This triggered self-heal on the volume.

While self-heal was in progress, attempt to remove a directory was made from the mount point. After a while a brick was found to have crashed.

Following is from logs of the brick that crashed -

[2015-04-22 12:31:23.720702] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/netprio_cgroup.h
frame : type(0) op(0)
frame : type(0) op(20)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(40)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-04-22 12:31:23
configuration details:
[2015-04-22 12:31:23.731428] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/raw.h
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7dev
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3ad30221c6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3ad303de2f]
/lib64/libc.so.6[0x3ad14326a0]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x3ad180c380]
/usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(+0x589b)[0x7f484d94689b]
/usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(quota_fill_inodectx+0x1fa)[0x7f484d94f6aa]
/usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(quota_readdirp_cbk+0x13e)[0x7f484d94fa9e]
/usr/lib64/glusterfs/3.7dev/xlator/features/marker.so(marker_readdirp_cbk+0x13e)[0x7f484db71bbe]
/usr/lib64/libglusterfs.so.0(default_readdirp_cbk+0xc2)[0x3ad302e622]
/usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_readdirp_cbk+0x18b)[0x7f484e5b6cfb]
/usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_readdirp_cbk+0x27a)[0x7f484e7d0b7a]
/usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_readdirp_cbk+0x214)[0x7f484e9db304]
/usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_do_readdir+0x1b8)[0x7f484f871498]
/usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_readdirp+0x1ee)[0x7f484f872fde]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_readdirp+0x259)[0x7f484e9d8e29]
/usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_readdirp+0x19d)[0x7f484e7cd4bd]
/usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_readdirp+0x204)[0x7f484e5b5d94]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/libglusterfs.so.0(default_readdirp_resume+0x142)[0x3ad3029db2]
/usr/lib64/libglusterfs.so.0(call_resume+0x80)[0x3ad3046470]
/usr/lib64/glusterfs/3.7dev/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f484e1a1388]
/lib64/libpthread.so.0[0x3ad18079d1]
/lib64/libc.so.6(clone+0x6d)[0x3ad14e88fd]
---------

Attempt to remove a directory from the mount point failed -

# rm -fr linux-3.19.4
rm: cannot remove `linux-3.19.4/include/crypto': Directory not empty
rm: cannot remove `linux-3.19.4/include/drm': Directory not empty
rm: cannot remove `linux-3.19.4/include/media': Directory not empty
rm: cannot remove `linux-3.19.4/include/net/netfilter': Directory not empty
rm: cannot remove `linux-3.19.4/include/net/bluetooth': Directory not empty

I also see a lot of the following messages in brick logs -

<snip>

[2015-04-22 12:31:14.461836] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/netns
[2015-04-22 12:31:18.132176] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/caif/caif_layer.h
[2015-04-22 12:31:23.675448] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/ip6_fib.h
[2015-04-22 12:31:23.691089] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/ip_fib.h
[2015-04-22 12:31:23.699589] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/lib80211.h
[2015-04-22 12:31:23.711344] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/neighbour.h
 
</snip>

See volume info below -

# gluster volume info vol
 
Volume Name: vol
Type: Distributed-Replicate
Volume ID: 133fe4f3-987c-474d-9904-c28475d4812f
Status: Started
Number of Bricks: 6 x 3 = 18
Transport-type: tcp
Bricks:
Brick1: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick2: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick3: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick4: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick5: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick6: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick7: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick8: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick9: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick10: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick11: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick12: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick13: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick14: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick15: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick16: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick17: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick18: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1
Options Reconfigured:
cluster.quorum-type: auto
client.event-threads: 4
server.event-threads: 5
features.uss: enable
features.quota: on
cluster.consistent-metadata: on

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

On the server - glusterfs-3.7dev-0.965.git2788ddd.el6.x86_64
On the client - glusterfs-3.7dev-0.1009.git8b987be.el6.x86_64

How reproducible:
------------------

Saw it once.

Steps to Reproduce:
--------------------

1. 1. On a 6x3 volume, started remove-brick operation of one replica set. 
2. After completion of data migration for the remove-brick operation, executed stop remove-brick.
3. Started rebalance operation on the volume.
4. While rebalance was in progress, killed two bricks each in 3 replica sets. 
5. After a while, while rebalance was still running, started the volume using force.
6. Was monitoring volume heal info output when I noticed that one of the bricks was not connected (See BZ#1214169 for details)
7. While self-heal was in progress, tried to remove a directory from the mount point -

#rm -fr linux-3.19.4

8. After a while a brick was found to have crashed (found brick to be disconnected in heal info output)

Actual results:
----------------

Brick process crashed.

Expected results:
------------------

Brick process is not expected to crash.

Additional info:

Comment 2 Anand Avati 2015-04-28 07:30:16 UTC

REVIEW: http://review.gluster.org/10416 (quota: Validate NULL inode from the entries received in readdirp_cbk) posted (#1) for review on master by Vijaikumar Mallikarjuna (vmallika)

Comment 3 Anand Avati 2015-04-30 09:54:54 UTC

COMMIT: http://review.gluster.org/10416 committed in master by Raghavendra G (rgowdapp) 
------
commit a152fa7ad96053093b88a010bff20e48aa5e5b70
Author: vmallika <vmallika>
Date:   Tue Apr 28 12:52:56 2015 +0530

    quota: Validate NULL inode from the entries received in readdirp_cbk
    
    In quota readdirp_cbk, inode ctx filled for the all entries
    received.
    In marker readdirp_cbk, files/directories are inspected for
    dirty
    
    There is no guarantee that entry->inode is populated.
    If entry->inode is NULL, this needs to be treated as readdir
    
    Change-Id: Id2d17bb89e4770845ce1f13d73abc2b3c5826c06
    BUG: 1215550
    Signed-off-by: vmallika <vmallika>
    Reviewed-on: http://review.gluster.org/10416
    Tested-by: NetBSD Build System
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra G <rgowdapp>
    Tested-by: Raghavendra G <rgowdapp>

Comment 4 Niels de Vos 2015-05-15 17:09:34 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2016-06-16 12:55:58 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.