Bug 1488354

Summary:	gluster-blockd process crashed and core generated
Product:	[Community] GlusterFS	Reporter:	Pranith Kumar K <pkarampu>
Component:	sharding	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:	bugs <bugs>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs, kdhananj, knarra, kramdoss, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.13.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1488152
Clones:	1488387 (view as bug list)		Environment:
Last Closed:	2017-12-08 17:39:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1488152, 1488381
Bug Blocks:	1488387, 1488391

Comment 1 Pranith Kumar K 2017-09-05 08:10:46 UTC

(gdb) bt
#0  0x00007f549eaf3c30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f549e207f15 in fd_anonymous () from /lib64/libglusterfs.so.0
#2  0x00007f54869d1927 in shard_common_inode_write_do ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#3  0x00007f54869d1c7d in shard_common_inode_write_post_mknod_handler ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#4  0x00007f54869ca77f in shard_common_mknod_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#5  0x00007f5486c1164b in dht_newfile_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so
#6  0x00007f5486e71ab1 in afr_mknod_unwind ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#7  0x00007f5486e73eeb in __afr_dir_write_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#8  0x00007f5486e7482d in afr_mknod_wind_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#9  0x00007f54870f6168 in client3_3_mknod_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so
#10 0x00007f549dfac840 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0
#11 0x00007f549dfacb27 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#12 0x00007f549dfa89e3 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#13 0x00007f5490be63d6 in socket_event_poll_in ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
---Type <return> to continue, or q <return> to quit--- 
#14 0x00007f5490be897c in socket_event_handler ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#15 0x00007f549e23e1e6 in event_dispatch_epoll_worker ()
   from /lib64/libglusterfs.so.0
#16 0x00007f549eaf1e25 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f549d8b034d in clone () from /lib64/libc.so.6


Based on the core-file, the only way it can happen is because it doesn't create all the shards.
(gdb) fr 2
#2  0x00007f54869d1927 in shard_common_inode_write_do (frame=0x7f548c0dbbe0, 
    this=0x7f54800120d0) at shard.c:3883
3883	                        anon_fd = fd_anonymous (local->inode_list[i]);
(gdb) p i
$1 = 255
(gdb) p local->inode_list[i]
$2 = (inode_t *) 0x0
(gdb) p lical->inode_list[i-1]
No symbol "lical" in current context.
(gdb) p local->inode_list[i-1]
$3 = (inode_t *) 0x7f5474765440
(gdb) p local->offset
$4 = 0
(gdb) p local->num_blocks
$5 = 256

Based on this data, I went through the code and I see two races:
1) In shard_common_mknod_cbk()
local->eexist_count is incremented without frame->lock
2) In shard_common_lookup_shards_cbk()
local->create_count is incremented without frame->lock

This can lead to the counts being less than what they need to be, so mknod is done just on 255 shards instead of 256 shards.

Comment 2 Worker Ant 2017-09-05 08:11:09 UTC

REVIEW: https://review.gluster.org/18203 (features/shard: Increment counts in locks) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 3 Worker Ant 2017-09-06 02:31:41 UTC

COMMIT: https://review.gluster.org/18203 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit e50fc8f4e7eb51386f47bea9e6ca8d8490c09003
Author: Pranith Kumar K <pkarampu>
Date:   Tue Sep 5 13:30:53 2017 +0530

    features/shard: Increment counts in locks
    
    Problem:
    Because create_count/eexist_count are incremented without locks, all the shards may not
    be created because call_count will be lesser than what it needs to be. This can lead
    to crash in shard_common_inode_write_do() because inode on which we want to do
    fd_anonymous() is NULL
    
    Fix:
    Increment the counts in frame->lock
    
    Change-Id: Ibc87dcb1021e9f4ac2929f662da07aa7662ab0d6
    BUG: 1488354
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: https://review.gluster.org/18203
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Shyamsundar 2017-12-08 17:39:41 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/