Bug 1488387 - gluster-blockd process crashed and core generated
Summary: gluster-blockd process crashed and core generated
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: sharding
Version: 3.12
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact: bugs@gluster.org
URL:
Whiteboard:
Depends On: 1488152 1488354 1488381
Blocks: 1488391 glusterfs-3.12.1
TreeView+ depends on / blocked
 
Reported: 2017-09-05 09:14 UTC by Pranith Kumar K
Modified: 2017-09-14 07:42 UTC (History)
7 users (show)

Fixed In Version: glusterfs-glusterfs-3.12.1
Clone Of: 1488354
: 1488391 (view as bug list)
Environment:
Last Closed: 2017-09-14 07:42:56 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Pranith Kumar K 2017-09-05 09:15:25 UTC
(gdb) bt
#0  0x00007f549eaf3c30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f549e207f15 in fd_anonymous () from /lib64/libglusterfs.so.0
#2  0x00007f54869d1927 in shard_common_inode_write_do ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#3  0x00007f54869d1c7d in shard_common_inode_write_post_mknod_handler ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#4  0x00007f54869ca77f in shard_common_mknod_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#5  0x00007f5486c1164b in dht_newfile_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so
#6  0x00007f5486e71ab1 in afr_mknod_unwind ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#7  0x00007f5486e73eeb in __afr_dir_write_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#8  0x00007f5486e7482d in afr_mknod_wind_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#9  0x00007f54870f6168 in client3_3_mknod_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so
#10 0x00007f549dfac840 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0
#11 0x00007f549dfacb27 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#12 0x00007f549dfa89e3 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#13 0x00007f5490be63d6 in socket_event_poll_in ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
---Type <return> to continue, or q <return> to quit--- 
#14 0x00007f5490be897c in socket_event_handler ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#15 0x00007f549e23e1e6 in event_dispatch_epoll_worker ()
   from /lib64/libglusterfs.so.0
#16 0x00007f549eaf1e25 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f549d8b034d in clone () from /lib64/libc.so.6


Based on the core-file, the only way it can happen is because it doesn't create all the shards.
(gdb) fr 2
#2  0x00007f54869d1927 in shard_common_inode_write_do (frame=0x7f548c0dbbe0, 
    this=0x7f54800120d0) at shard.c:3883
3883	                        anon_fd = fd_anonymous (local->inode_list[i]);
(gdb) p i
$1 = 255
(gdb) p local->inode_list[i]
$2 = (inode_t *) 0x0
(gdb) p lical->inode_list[i-1]
No symbol "lical" in current context.
(gdb) p local->inode_list[i-1]
$3 = (inode_t *) 0x7f5474765440
(gdb) p local->offset
$4 = 0
(gdb) p local->num_blocks
$5 = 256

Based on this data, I went through the code and I see two races:
1) In shard_common_mknod_cbk()
local->eexist_count is incremented without frame->lock
2) In shard_common_lookup_shards_cbk()
local->create_count is incremented without frame->lock

This can lead to the counts being less than what they need to be, so mknod is done just on 255 shards instead of 256 shards.

Comment 2 Worker Ant 2017-09-05 09:16:51 UTC
REVIEW: https://review.gluster.org/18204 (features/shard: Increment counts in locks) posted (#1) for review on release-3.12 by Pranith Kumar Karampuri (pkarampu)

Comment 3 Worker Ant 2017-09-06 07:48:15 UTC
COMMIT: https://review.gluster.org/18204 committed in release-3.12 by jiffin tony Thottan (jthottan) 
------
commit f5170d49e44d0327020335de0b0fc2999a455aad
Author: Pranith Kumar K <pkarampu>
Date:   Tue Sep 5 13:30:53 2017 +0530

    features/shard: Increment counts in locks
    
           Backport of https://review.gluster.org/18203
    
    Problem:
    Because create_count/eexist_count are incremented without locks, all the shards may not
    be created because call_count will be lesser than what it needs to be. This can lead
    to crash in shard_common_inode_write_do() because inode on which we want to do
    fd_anonymous() is NULL
    
    Fix:
    Increment the counts in frame->lock
    
     >Change-Id: Ibc87dcb1021e9f4ac2929f662da07aa7662ab0d6
     >BUG: 1488354
     >Signed-off-by: Pranith Kumar K <pkarampu>
    
    Change-Id: Ibc87dcb1021e9f4ac2929f662da07aa7662ab0d6
    BUG: 1488387
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: https://review.gluster.org/18204
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Jiffin 2017-09-14 07:42:56 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-glusterfs-3.12.1, please open a new bug report.

glusterfs-glusterfs-3.12.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-September/032441.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.