Bug 1456331 - [Bitrot]: Brick process crash observed while trying to recover a bad file in disperse volume
Summary: [Bitrot]: Brick process crash observed while trying to recover a bad file in ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: bitrot
Version: 3.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Kotresh HR
QA Contact:
bugs@gluster.org
URL:
Whiteboard:
Depends On: 1454317
Blocks: 1451280
TreeView+ depends on / blocked
 
Reported: 2017-05-29 05:39 UTC by Kotresh HR
Modified: 2017-05-30 18:53 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1454317
Environment:
Last Closed: 2017-05-30 18:53:41 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Kotresh HR 2017-05-29 05:39:04 UTC
+++ This bug was initially created as a clone of Bug #1454317 +++

+++ This bug was initially created as a clone of Bug #1451280 +++

Description of problem:
=======================
Had a 6 node cluster with 3.8.4-23 build. Created a 1 * (4+2) EC volume and mounted it via fuse. Created two files 'test1' and 'test2' and corrupted both. The scrubber detected both the files as corrupted. Updated the build to 3.8.4-25 and restarted glusterd. Followed the steps of recovering the file as mentioned in the admin guide. 'test2' recovered successfully, but 'test1' failed with 'Input/output error' on the mountpoint. Volume status showed 2 brick processes down. 


Version-Release number of selected component (if applicable):
===========================================================



How reproducible:
=================
1:1


Additional info:
================

[root@dhcp47-121 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp47-113.lab.eng.blr.redhat.com
Uuid: a0557927-4e5e-4ff7-8dce-94873f867707
State: Peer in Cluster (Connected)

Hostname: dhcp47-114.lab.eng.blr.redhat.com
Uuid: c0dac197-5a4d-4db7-b709-dbf8b8eb0896
State: Peer in Cluster (Connected)

Hostname: dhcp47-115.lab.eng.blr.redhat.com
Uuid: f828fdfa-e08f-4d12-85d8-2121cafcf9d0
State: Peer in Cluster (Connected)

Hostname: dhcp47-116.lab.eng.blr.redhat.com
Uuid: a96e0244-b5ce-4518-895c-8eb453c71ded
State: Peer in Cluster (Connected)

Hostname: dhcp47-117.lab.eng.blr.redhat.com
Uuid: 17eb3cef-17e7-4249-954b-fc19ec608304
State: Peer in Cluster (Connected)
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v status disp2
Status of volume: disp2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.121:/bricks/brick8/disp2_0   49154     0          Y       5552 
Brick 10.70.47.113:/bricks/brick8/disp2_1   N/A       N/A        N       N/A  
Brick 10.70.47.114:/bricks/brick8/disp2_2   49154     0          Y       30916
Brick 10.70.47.115:/bricks/brick8/disp2_3   49154     0          Y       23469
Brick 10.70.47.116:/bricks/brick8/disp2_4   49153     0          Y       27754
Brick 10.70.47.117:/bricks/brick8/disp2_5   N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       5497 
Bitrot Daemon on localhost                  N/A       N/A        Y       5515 
Scrubber Daemon on localhost                N/A       N/A        Y       5525 
Self-heal Daemon on dhcp47-113.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       5893 
Bitrot Daemon on dhcp47-113.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       5911 
Scrubber Daemon on dhcp47-113.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       5921 
Self-heal Daemon on dhcp47-114.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       30858
Bitrot Daemon on dhcp47-114.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       30876
Scrubber Daemon on dhcp47-114.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       30886
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       27708
Bitrot Daemon on dhcp47-116.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       27726
Scrubber Daemon on dhcp47-116.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       27736
Self-heal Daemon on dhcp47-117.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       9684 
Bitrot Daemon on dhcp47-117.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       9702 
Scrubber Daemon on dhcp47-117.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       9712 
Self-heal Daemon on dhcp47-115.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       23411
Bitrot Daemon on dhcp47-115.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       23429
Scrubber Daemon on dhcp47-115.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       23439
 
Task Status of Volume disp2
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v info disp2
 
Volume Name: disp2
Type: Disperse
Volume ID: d7b0d170-f0e0-4e26-9369-f0a52dc92d38
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.121:/bricks/brick8/disp2_0
Brick2: 10.70.47.113:/bricks/brick8/disp2_1
Brick3: 10.70.47.114:/bricks/brick8/disp2_2
Brick4: 10.70.47.115:/bricks/brick8/disp2_3
Brick5: 10.70.47.116:/bricks/brick8/disp2_4
Brick6: 10.70.47.117:/bricks/brick8/disp2_5
Options Reconfigured:
performance.stat-prefetch: off
nfs.disable: on
transport.address-family: inet
features.bitrot: on
features.scrub: Active
features.scrub-freq: hourly
cluster.brick-multiplex: disable
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v  bitrot disp2 scrub status

Volume name : disp2

State of scrub: Active (In Progress)

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:12

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-114.lab.eng.blr.redhat.com

Number of Scrubbed files: 1

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:12

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-116.lab.eng.blr.redhat.com

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:14

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-113.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 08:35:24

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:11

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-117.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 08:35:23

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

[root@dhcp47-121 ~]# gluster v heal disp2 info
Brick 10.70.47.121:/bricks/brick8/disp2_0
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.113:/bricks/brick8/disp2_1
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.47.114:/bricks/brick8/disp2_2
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.115:/bricks/brick8/disp2_3
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.116:/bricks/brick8/disp2_4
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.117:/bricks/brick8/disp2_5
Status: Transport endpoint is not connected
Number of entries: -

[root@dhcp47-121 ~]# 


[2017-05-16 08:54:10.160132] E [MSGID: 115070] [server-rpc-fops.c:1474:server_open_cbk] 0-disp2-server: 4619: OPEN /d1/d2/d3/d4/test2 (3673eecb-e5b5-4014-9bc6-a2fc007f08cb) ==> (Input/output error) [Input/output error]
pending frames:
frame : type(0) op(29)
frame : type(0) op(11)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-05-16 08:55:01
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f0e805201b2]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f0e80529bd4]
/lib64/libc.so.6(+0x35250)[0x7f0e7ec02250]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xadf4)[0x7f0e7174cdf4]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xde56)[0x7f0e7174fe56]
/usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so(+0x5815)[0x7f0e71535815]
/usr/lib64/glusterfs/3.8.4/xlator/features/locks.so(+0x6dc8)[0x7f0e71312dc8]
/usr/lib64/glusterfs/3.8.4/xlator/features/worm.so(+0x7e59)[0x7f0e71106e59]
/usr/lib64/glusterfs/3.8.4/xlator/features/read-only.so(+0x4478)[0x7f0e70efb478]
/usr/lib64/glusterfs/3.8.4/xlator/features/leases.so(+0x50b4)[0x7f0e70ce70b4]
/usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so(+0xf143)[0x7f0e70ad7143]
/lib64/libglusterfs.so.0(default_open_resume+0x1c9)[0x7f0e805b1269]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f0e80542b25]
/usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so(+0x4957)[0x7f0e708c1957]
/lib64/libpthread.so.0(+0x7dc5)[0x7f0e7f37fdc5]
/lib64/libc.so.6(clone+0x6d)[0x7f0e7ecc473d]

BT:

Program terminated with signal 11, Segmentation fault.
#0  list_add_tail (head=0x7f0e28001908, new=0x18) at ../../../../../libglusterfs/src/list.h:40
40		new->next = head;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  list_add_tail (head=0x7f0e28001908, new=0x18) at ../../../../../libglusterfs/src/list.h:40
#1  br_stub_add_fd_to_inode (this=this@entry=0x7f0e6c012440, fd=fd@entry=0x7f0e6c0a5050, ctx=ctx@entry=0x0) at bit-rot-stub.c:2398
#2  0x00007f0e7174fe56 in br_stub_open (frame=0x7f0e28000ca0, this=0x7f0e6c012440, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at bit-rot-stub.c:2352
#3  0x00007f0e71535815 in posix_acl_open (frame=0x7f0e280014b0, this=0x7f0e6c013d70, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at posix-acl.c:1129
#4  0x00007f0e71312dc8 in pl_open (frame=frame@entry=0x7f0e28000ac0, this=this@entry=0x7f0e6c015320, loc=loc@entry=0x7f0e6c0ccf90, flags=flags@entry=2, fd=fd@entry=0x7f0e6c0a5050, 
    xdata=xdata@entry=0x0) at posix.c:1698
#5  0x00007f0e71106e59 in worm_open (frame=0x7f0e28000ac0, this=<optimized out>, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at worm.c:43
#6  0x00007f0e70efb478 in ro_open (frame=0x7f0e28001740, this=0x7f0e6c018130, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at read-only-common.c:341
#7  0x00007f0e70ce70b4 in leases_open (frame=0x7f0e28001b50, this=0x7f0e6c019880, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at leases.c:75
#8  0x00007f0e70ad7143 in up_open (frame=0x7f0e28002250, this=0x7f0e6c01af20, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at upcall.c:75
#9  0x00007f0e805b1269 in default_open_resume (frame=0x7f0e6c002020, this=0x7f0e6c01c690, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at defaults.c:1726
#10 0x00007f0e80542b25 in call_resume (stub=0x7f0e6c0ccf40) at call-stub.c:2508
#11 0x00007f0e708c1957 in iot_worker (data=0x7f0e6c0550e0) at io-threads.c:220
#12 0x00007f0e7f37fdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f0e7ecc473d in clone () from /lib64/libc.so.6
(gdb)

--- Additional comment from Worker Ant on 2017-05-22 09:00:24 EDT ---

REVIEW: https://review.gluster.org/17357 (features/bitrot: Fix glusterfsd crash) posted (#1) for review on master by Kotresh HR (khiremat)

--- Additional comment from Worker Ant on 2017-05-28 23:52:14 EDT ---

COMMIT: https://review.gluster.org/17357 committed in master by Atin Mukherjee (amukherj) 
------
commit 6908e962f6293d38f0ee65c088247a66f2832e4a
Author: Kotresh HR <khiremat>
Date:   Mon May 22 08:47:07 2017 -0400

    features/bitrot: Fix glusterfsd crash
    
    With object versioning being optional, it can
    so happen the bitrot stub context is not always
    set. When it's not found, it's initialized. But
    was not being assigned to use in the local
    function. This was leading for brick crash.
    Fixed the same.
    
    Change-Id: I0dab6435cdfe16a8c7f6a31ffec1a370822597a8
    BUG: 1454317
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: https://review.gluster.org/17357
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra Bhat <raghavendra>

Comment 1 Worker Ant 2017-05-29 05:46:58 UTC
REVIEW: https://review.gluster.org/17406 (features/bitrot: Fix glusterfsd crash) posted (#1) for review on release-3.11 by Kotresh HR (khiremat)

Comment 2 Worker Ant 2017-05-29 13:57:15 UTC
COMMIT: https://review.gluster.org/17406 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit a89449a1b831b626a6ba170cf3b16af6df190471
Author: Kotresh HR <khiremat>
Date:   Mon May 22 08:47:07 2017 -0400

    features/bitrot: Fix glusterfsd crash
    
    With object versioning being optional, it can
    so happen the bitrot stub context is not always
    set. When it's not found, it's initialized. But
    was not being assigned to use in the local
    function. This was leading for brick crash.
    Fixed the same.
    
    > Change-Id: I0dab6435cdfe16a8c7f6a31ffec1a370822597a8
    > BUG: 1454317
    > Signed-off-by: Kotresh HR <khiremat>
    > Reviewed-on: https://review.gluster.org/17357
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Raghavendra Bhat <raghavendra>
    (cherry picked from commit 6908e962f6293d38f0ee65c088247a66f2832e4a)
    
    Change-Id: I0dab6435cdfe16a8c7f6a31ffec1a370822597a8
    BUG: 1456331
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: https://review.gluster.org/17406
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 3 Shyamsundar 2017-05-30 18:53:41 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.