1451280 – [Bitrot]: Brick process crash observed while trying to recover a bad file in disperse volume

Bug 1451280 - [Bitrot]: Brick process crash observed while trying to recover a bad file in disperse volume

Summary: [Bitrot]: Brick process crash observed while trying to recover a bad file in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	bitrot
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Kotresh HR
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:	1454317 1456331
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-05-16 09:56 UTC by Sweta Anandpara
Modified:	2017-09-21 04:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.8.4-27
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1454317 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:43:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Sweta Anandpara 2017-05-16 09:56:09 UTC

Description of problem:
=======================
Had a 6 node cluster with 3.8.4-23 build. Created a 1 * (4+2) EC volume and mounted it via fuse. Created two files 'test1' and 'test2' and corrupted both. The scrubber detected both the files as corrupted. Updated the build to 3.8.4-25 and restarted glusterd. Followed the steps of recovering the file as mentioned in the admin guide. 'test2' recovered successfully, but 'test1' failed with 'Input/output error' on the mountpoint. Volume status showed 2 brick processes down. Core file and sosreports will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/


Version-Release number of selected component (if applicable):
===========================================================
3.8.4-25


How reproducible:
=================
1:1


Additional info:
================

[root@dhcp47-121 ~]# rpm -qa | grep gluster
glusterfs-libs-3.8.4-25.el7rhgs.x86_64
glusterfs-events-3.8.4-25.el7rhgs.x86_64
glusterfs-cli-3.8.4-25.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64
glusterfs-server-3.8.4-25.el7rhgs.x86_64
glusterfs-rdma-3.8.4-25.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-api-3.8.4-25.el7rhgs.x86_64
python-gluster-3.8.4-25.el7rhgs.noarch
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
glusterfs-fuse-3.8.4-25.el7rhgs.x86_64
glusterfs-3.8.4-25.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64
[root@dhcp47-121 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp47-113.lab.eng.blr.redhat.com
Uuid: a0557927-4e5e-4ff7-8dce-94873f867707
State: Peer in Cluster (Connected)

Hostname: dhcp47-114.lab.eng.blr.redhat.com
Uuid: c0dac197-5a4d-4db7-b709-dbf8b8eb0896
State: Peer in Cluster (Connected)

Hostname: dhcp47-115.lab.eng.blr.redhat.com
Uuid: f828fdfa-e08f-4d12-85d8-2121cafcf9d0
State: Peer in Cluster (Connected)

Hostname: dhcp47-116.lab.eng.blr.redhat.com
Uuid: a96e0244-b5ce-4518-895c-8eb453c71ded
State: Peer in Cluster (Connected)

Hostname: dhcp47-117.lab.eng.blr.redhat.com
Uuid: 17eb3cef-17e7-4249-954b-fc19ec608304
State: Peer in Cluster (Connected)
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v status disp2
Status of volume: disp2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.121:/bricks/brick8/disp2_0   49154     0          Y       5552 
Brick 10.70.47.113:/bricks/brick8/disp2_1   N/A       N/A        N       N/A  
Brick 10.70.47.114:/bricks/brick8/disp2_2   49154     0          Y       30916
Brick 10.70.47.115:/bricks/brick8/disp2_3   49154     0          Y       23469
Brick 10.70.47.116:/bricks/brick8/disp2_4   49153     0          Y       27754
Brick 10.70.47.117:/bricks/brick8/disp2_5   N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       5497 
Bitrot Daemon on localhost                  N/A       N/A        Y       5515 
Scrubber Daemon on localhost                N/A       N/A        Y       5525 
Self-heal Daemon on dhcp47-113.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       5893 
Bitrot Daemon on dhcp47-113.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       5911 
Scrubber Daemon on dhcp47-113.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       5921 
Self-heal Daemon on dhcp47-114.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       30858
Bitrot Daemon on dhcp47-114.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       30876
Scrubber Daemon on dhcp47-114.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       30886
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       27708
Bitrot Daemon on dhcp47-116.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       27726
Scrubber Daemon on dhcp47-116.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       27736
Self-heal Daemon on dhcp47-117.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       9684 
Bitrot Daemon on dhcp47-117.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       9702 
Scrubber Daemon on dhcp47-117.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       9712 
Self-heal Daemon on dhcp47-115.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       23411
Bitrot Daemon on dhcp47-115.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       23429
Scrubber Daemon on dhcp47-115.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       23439
 
Task Status of Volume disp2
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v info disp2
 
Volume Name: disp2
Type: Disperse
Volume ID: d7b0d170-f0e0-4e26-9369-f0a52dc92d38
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.121:/bricks/brick8/disp2_0
Brick2: 10.70.47.113:/bricks/brick8/disp2_1
Brick3: 10.70.47.114:/bricks/brick8/disp2_2
Brick4: 10.70.47.115:/bricks/brick8/disp2_3
Brick5: 10.70.47.116:/bricks/brick8/disp2_4
Brick6: 10.70.47.117:/bricks/brick8/disp2_5
Options Reconfigured:
performance.stat-prefetch: off
nfs.disable: on
transport.address-family: inet
features.bitrot: on
features.scrub: Active
features.scrub-freq: hourly
cluster.brick-multiplex: disable
[root@dhcp47-121 ~]# 
[root@dhcp47-121 ~]# gluster v  bitrot disp2 scrub status

Volume name : disp2

State of scrub: Active (In Progress)

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: localhost

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:12

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-114.lab.eng.blr.redhat.com

Number of Scrubbed files: 1

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:12

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-116.lab.eng.blr.redhat.com

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:14

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-113.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 08:35:24

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:11

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0


=========================================================

Node: dhcp47-117.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 08:35:23

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

[root@dhcp47-121 ~]# gluster v heal disp2 info
Brick 10.70.47.121:/bricks/brick8/disp2_0
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.113:/bricks/brick8/disp2_1
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.47.114:/bricks/brick8/disp2_2
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.115:/bricks/brick8/disp2_3
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.116:/bricks/brick8/disp2_4
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.117:/bricks/brick8/disp2_5
Status: Transport endpoint is not connected
Number of entries: -

[root@dhcp47-121 ~]# 


[2017-05-16 08:54:10.160132] E [MSGID: 115070] [server-rpc-fops.c:1474:server_open_cbk] 0-disp2-server: 4619: OPEN /d1/d2/d3/d4/test2 (3673eecb-e5b5-4014-9bc6-a2fc007f08cb) ==> (Input/output error) [Input/output error]
pending frames:
frame : type(0) op(29)
frame : type(0) op(11)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-05-16 08:55:01
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f0e805201b2]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f0e80529bd4]
/lib64/libc.so.6(+0x35250)[0x7f0e7ec02250]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xadf4)[0x7f0e7174cdf4]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xde56)[0x7f0e7174fe56]
/usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so(+0x5815)[0x7f0e71535815]
/usr/lib64/glusterfs/3.8.4/xlator/features/locks.so(+0x6dc8)[0x7f0e71312dc8]
/usr/lib64/glusterfs/3.8.4/xlator/features/worm.so(+0x7e59)[0x7f0e71106e59]
/usr/lib64/glusterfs/3.8.4/xlator/features/read-only.so(+0x4478)[0x7f0e70efb478]
/usr/lib64/glusterfs/3.8.4/xlator/features/leases.so(+0x50b4)[0x7f0e70ce70b4]
/usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so(+0xf143)[0x7f0e70ad7143]
/lib64/libglusterfs.so.0(default_open_resume+0x1c9)[0x7f0e805b1269]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f0e80542b25]
/usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so(+0x4957)[0x7f0e708c1957]
/lib64/libpthread.so.0(+0x7dc5)[0x7f0e7f37fdc5]
/lib64/libc.so.6(clone+0x6d)[0x7f0e7ecc473d]

Comment 2 Sweta Anandpara 2017-05-16 09:57:19 UTC

[qe@rhsqe-repo 1451280]$ 
[qe@rhsqe-repo 1451280]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1451280]$ 
[qe@rhsqe-repo 1451280]$ pwd
/home/repo/sosreports/1451280
[qe@rhsqe-repo 1451280]$ 
[qe@rhsqe-repo 1451280]$ ll
total 708976
-rwxr-xr-x. 1 qe qe 157433856 May 16 15:13 core.5950
-rwxr-xr-x. 1 qe qe 157433856 May 16 15:13 core.9730
-rwxr-xr-x. 1 qe qe  73012628 May 16 15:12 sosreport-sysreg-prod-20170516050748.tar.xz_dhcp47_121
-rwxr-xr-x. 1 qe qe  69134612 May 16 15:12 sosreport-sysreg-prod-20170516050917.tar.xz_dhcp47_113
-rwxr-xr-x. 1 qe qe  69795020 May 16 15:12 sosreport-sysreg-prod-20170516051025.tar.xz_dhcp47_114
-rwxr-xr-x. 1 qe qe  69256712 May 16 15:12 sosreport-sysreg-prod-20170516051259.tar.xz_dhcp47_115
-rwxr-xr-x. 1 qe qe  65140528 May 16 15:12 sosreport-sysreg-prod-20170516051545.tar.xz_dhcp47_116
-rwxr-xr-x. 1 qe qe  64772920 May 16 15:12 sosreport-sysreg-prod-20170516051639.tar.xz_dhcp47_117
[qe@rhsqe-repo 1451280]$

Comment 3 Kotresh HR 2017-05-22 12:52:23 UTC

Following the bt:

Program terminated with signal 11, Segmentation fault.
#0  list_add_tail (head=0x7f0e28001908, new=0x18) at ../../../../../libglusterfs/src/list.h:40
40		new->next = head;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  list_add_tail (head=0x7f0e28001908, new=0x18) at ../../../../../libglusterfs/src/list.h:40
#1  br_stub_add_fd_to_inode (this=this@entry=0x7f0e6c012440, fd=fd@entry=0x7f0e6c0a5050, ctx=ctx@entry=0x0) at bit-rot-stub.c:2398
#2  0x00007f0e7174fe56 in br_stub_open (frame=0x7f0e28000ca0, this=0x7f0e6c012440, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at bit-rot-stub.c:2352
#3  0x00007f0e71535815 in posix_acl_open (frame=0x7f0e280014b0, this=0x7f0e6c013d70, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at posix-acl.c:1129
#4  0x00007f0e71312dc8 in pl_open (frame=frame@entry=0x7f0e28000ac0, this=this@entry=0x7f0e6c015320, loc=loc@entry=0x7f0e6c0ccf90, flags=flags@entry=2, fd=fd@entry=0x7f0e6c0a5050, 
    xdata=xdata@entry=0x0) at posix.c:1698
#5  0x00007f0e71106e59 in worm_open (frame=0x7f0e28000ac0, this=<optimized out>, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at worm.c:43
#6  0x00007f0e70efb478 in ro_open (frame=0x7f0e28001740, this=0x7f0e6c018130, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at read-only-common.c:341
#7  0x00007f0e70ce70b4 in leases_open (frame=0x7f0e28001b50, this=0x7f0e6c019880, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at leases.c:75
#8  0x00007f0e70ad7143 in up_open (frame=0x7f0e28002250, this=0x7f0e6c01af20, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at upcall.c:75
#9  0x00007f0e805b1269 in default_open_resume (frame=0x7f0e6c002020, this=0x7f0e6c01c690, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at defaults.c:1726
#10 0x00007f0e80542b25 in call_resume (stub=0x7f0e6c0ccf40) at call-stub.c:2508
#11 0x00007f0e708c1957 in iot_worker (data=0x7f0e6c0550e0) at io-threads.c:220
#12 0x00007f0e7f37fdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f0e7ecc473d in clone () from /lib64/libc.so.6

Comment 4 Kotresh HR 2017-05-22 13:01:44 UTC

Upstream Patch:

https://review.gluster.org/17357

Comment 7 Kotresh HR 2017-05-29 06:17:44 UTC

Upstream Patches:
https://review.gluster.org/17357      (master)
https://review.gluster.org/#/c/17406/ (release-3.11)

Downstream Patches:
https://code.engineering.redhat.com/gerrit/#/c/107534/

Comment 9 Sweta Anandpara 2017-07-31 10:22:59 UTC

Tested and verified this on the build glusterfs-3.8.4-35.

A round of testing has taken place on bitrot on the said build, and I have not seen this crash again anytime in my logs. 

Moving this to verified in 3.3.0.

Comment 11 errata-xmlrpc 2017-09-21 04:43:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.