Bug 1453160 - [Bitrot]: Multiple bricks crash seen on a cifs setup with nl-cache and II-readdirp enabled volume
Summary: [Bitrot]: Multiple bricks crash seen on a cifs setup with nl-cache and II-rea...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: bitrot
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.3.0
Assignee: Kotresh HR
QA Contact: Sweta Anandpara
URL:
Whiteboard:
Depends On:
Blocks: 1417151
TreeView+ depends on / blocked
 
Reported: 2017-05-22 09:42 UTC by Sweta Anandpara
Modified: 2017-09-21 04:58 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.8.4-27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-21 04:45:37 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Sweta Anandpara 2017-05-22 09:42:23 UTC
Description of problem:
======================
Had a 4*2 volume on a 4node cluster with parallel-readirp and negative-lookup-cache enabled. Had it mounted over 3 clients via cifs and created a mix of large files, small files, hardlinks, softlinks and directories.

Went about a bitrot life-cycle of corrupting files/links at different nodes and levels, and the entire process of recovering/restoring a bad file. Over a period of time, volume status showed the brick process down. 

4 cores are generated in 3 nodes, all of which seem to be of similar nature. Sosreports and core files will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/

Version-Release number of selected component (if applicable):
===========================================================
3.8.4-25


How reproducible:
==================
1:1


Additional info:
=================

[root@dhcp47-127 ~]# rpm -qa | grep -i glusterfs
glusterfs-libs-3.8.4-25.el7rhgs.x86_64
glusterfs-cli-3.8.4-25.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64
glusterfs-server-3.8.4-25.el7rhgs.x86_64
glusterfs-events-3.8.4-25.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
glusterfs-api-3.8.4-25.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64
glusterfs-fuse-3.8.4-25.el7rhgs.x86_64
glusterfs-3.8.4-25.el7rhgs.x86_64
glusterfs-rdma-3.8.4-25.el7rhgs.x86_64
[root@dhcp47-127 ~]# gluster v status
Status of volume: ctdb
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.127:/bricks/brick8/ctdb      49152     0          Y       22460
Brick 10.70.46.181:/bricks/brick8/ctdb      49152     0          Y       3734
Brick 10.70.46.47:/bricks/brick8/ctdb       49152     0          Y       12996
Brick 10.70.47.140:/bricks/brick8/ctdb      49152     0          Y       18260
Self-heal Daemon on localhost               N/A       N/A        Y       22449
Self-heal Daemon on dhcp46-181.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       3723
Self-heal Daemon on dhcp47-140.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       18249
Self-heal Daemon on dhcp46-47.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       12985
 
Task Status of Volume ctdb
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: saturday-saturday
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick dhcp47-127.lab.eng.blr.redhat.com:/br
icks/brick0/saturday-saturday_brick0        49153     0          Y       22466
Brick dhcp46-181.lab.eng.blr.redhat.com:/br
icks/brick0/saturday-saturday_brick1        49153     0          Y       3740
Brick dhcp46-47.lab.eng.blr.redhat.com:/bri
cks/brick0/saturday-saturday_brick2         49153     0          Y       13002
Brick dhcp47-140.lab.eng.blr.redhat.com:/br
icks/brick0/saturday-saturday_brick3        49153     0          Y       18266
Brick dhcp47-127.lab.eng.blr.redhat.com:/br
icks/brick1/saturday-saturday_brick4        N/A       N/A        N       N/A  
Brick dhcp46-181.lab.eng.blr.redhat.com:/br
icks/brick1/saturday-saturday_brick5        49154     0          Y       3750
Brick dhcp46-47.lab.eng.blr.redhat.com:/bri
cks/brick1/saturday-saturday_brick6         N/A       N/A        N       N/A  
Brick dhcp47-140.lab.eng.blr.redhat.com:/br
icks/brick1/saturday-saturday_brick7        49154     0          Y       18276
Brick dhcp47-127.lab.eng.blr.redhat.com:/br
icks/brick2/saturday-saturday_brick8        N/A       N/A        N       N/A  
Brick dhcp46-181.lab.eng.blr.redhat.com:/br
icks/brick2/saturday-saturday_brick9        N/A       N/A        N       N/A  
Brick dhcp46-47.lab.eng.blr.redhat.com:/bri
cks/brick2/saturday-saturday_brick10        49155     0          Y       13020
Brick dhcp47-140.lab.eng.blr.redhat.com:/br
icks/brick2/saturday-saturday_brick11       49155     0          Y       18284
Snapshot Daemon on localhost                49162     0          Y       22546
Self-heal Daemon on localhost               N/A       N/A        Y       22449
Bitrot Daemon on localhost                  N/A       N/A        Y       9001
Scrubber Daemon on localhost                N/A       N/A        Y       9018
Snapshot Daemon on dhcp46-181.lab.eng.blr.r
edhat.com                                   49162     0          Y       3821
Self-heal Daemon on dhcp46-181.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       3723
Bitrot Daemon on dhcp46-181.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       3023
Scrubber Daemon on dhcp46-181.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       3034
Snapshot Daemon on dhcp47-140.lab.eng.blr.r
edhat.com                                   49162     0          Y       18348
Self-heal Daemon on dhcp47-140.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       18249
Bitrot Daemon on dhcp47-140.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       14952
Scrubber Daemon on dhcp47-140.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       14963
Snapshot Daemon on dhcp46-47.lab.eng.blr.re
dhat.com                                    49162     0          Y       13083
Self-heal Daemon on dhcp46-47.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       12985
Bitrot Daemon on dhcp46-47.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       10810
Scrubber Daemon on dhcp46-47.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       10822
 
Task Status of Volume saturday-saturday
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp47-127 ~]# gluster v info
 
Volume Name: ctdb
Type: Replicate
Volume ID: 64980fe9-85ea-487e-8d0d-39b70c8626b0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.127:/bricks/brick8/ctdb
Brick2: 10.70.46.181:/bricks/brick8/ctdb
Brick3: 10.70.46.47:/bricks/brick8/ctdb
Brick4: 10.70.47.140:/bricks/brick8/ctdb
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
 
Volume Name: saturday-saturday
Type: Distributed-Replicate
Volume ID: 4a24c34c-1144-4f07-9763-6e232c037a67
Status: Started
Snapshot Count: 2
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick0
Brick2: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick1
Brick3: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick2
Brick4: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick3
Brick5: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick4
Brick6: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick5
Brick7: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick6
Brick8: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick7
Brick9: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick8
Brick10: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick9
Brick11: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick10
Brick12: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick11
Options Reconfigured:
features.scrub: Active
features.bitrot: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.nl-cache: on
features.barrier: disable
features.show-snapshot-directory: enable
features.uss: enable
transport.address-family: inet
nfs.disable: on
server.allow-insecure: on
performance.stat-prefetch: on
storage.batch-fsync-delay-usec: 0
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
performance.cache-samba-metadata: on
performance.parallel-readdir: on
[root@dhcp47-127 ~]# gluster v list
ctdb
saturday-saturday
[root@dhcp47-127 ~]# gluster v list
ctdb
saturday-saturday
[root@dhcp47-127 ~]#

Comment 2 Sweta Anandpara 2017-05-22 09:45:10 UTC
[qe@rhsqe-repo 1453160]$ 
[qe@rhsqe-repo 1453160]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1453160]$ 
[qe@rhsqe-repo 1453160]$ pwd
/home/repo/sosreports/1453160
[qe@rhsqe-repo 1453160]$ 
[qe@rhsqe-repo 1453160]$ ll
total 980952
-rwxr-xr-x. 1 qe qe      3473 May 22 15:03 client_logs
-rwxr-xr-x. 1 qe qe 239718400 May 22 15:00 core.13011
-rwxr-xr-x. 1 qe qe 233590784 May 22 15:00 core.22469
-rwxr-xr-x. 1 qe qe 222420992 May 22 15:00 core.22484
-rwxr-xr-x. 1 qe qe 199450624 May 22 15:00 core.3758
-rwxr-xr-x. 1 qe qe  30375496 May 22 14:54 sosreport-sysreg-prod-20170522090552_dhcp_47_127.tar.xz
-rwxr-xr-x. 1 qe qe  26819980 May 22 14:54 sosreport-sysreg-prod-20170522090554_dhcp_46_181.tar.xz
-rwxr-xr-x. 1 qe qe  26213468 May 22 14:54 sosreport-sysreg-prod-20170522090556_dhcp_46_47.tar.xz
-rwxr-xr-x. 1 qe qe  25884564 May 22 14:54 sosreport-sysreg-prod-20170522090558_dhcp_47_140.tar.xz
[qe@rhsqe-repo 1453160]$

Comment 4 Atin Mukherjee 2017-05-22 13:03:23 UTC
upstream patch : https://review.gluster.org/17357

Comment 7 Kotresh HR 2017-05-22 13:06:42 UTC
Backtrace:

#0  br_stub_need_versioning (this=this@entry=0x7fcda8012aa0, fd=fd@entry=0x7fcd94008480, versioning=versioning@entry=0x7fcda19a4058, modified=modified@entry=0x7fcda19a405c, 
    ctx=ctx@entry=0x7fcda19a4060) at bit-rot-stub.c:466
466	        br_stub_inode_ctx_t *c = NULL;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  br_stub_need_versioning (this=this@entry=0x7fcda8012aa0, fd=fd@entry=0x7fcd94008480, versioning=versioning@entry=0x7fcda19a4058, modified=modified@entry=0x7fcda19a405c, 
    ctx=ctx@entry=0x7fcda19a4060) at bit-rot-stub.c:466
#1  0x00007fcdae8465d8 in br_stub_writev (frame=frame@entry=0x7fcd6400b5f0, this=this@entry=0x7fcda8012aa0, fd=fd@entry=0x7fcd94008480, vector=vector@entry=0x7fcd94030a30, 
    count=count@entry=1, offset=offset@entry=1577648128, flags=flags@entry=0, iobref=iobref@entry=0x7fcd940086b0, xdata=xdata@entry=0x7fcd9402e710) at bit-rot-stub.c:1959
#2  0x00007fcdbd8955f9 in default_writev (frame=0x7fcd6400b5f0, this=<optimized out>, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, 
    xdata=0x7fcd9402e710) at defaults.c:2543
#3  0x00007fcdae411d24 in pl_writev (frame=frame@entry=0x7fcd6400b920, this=this@entry=0x7fcda8015980, fd=fd@entry=0x7fcd94008480, vector=vector@entry=0x7fcd94030a30, count=count@entry=1, 
    offset=offset@entry=1577648128, flags=flags@entry=0, iobref=iobref@entry=0x7fcd940086b0, xdata=xdata@entry=0x7fcd9402e710) at posix.c:2030
#4  0x00007fcdae1fc4c4 in worm_writev (frame=frame@entry=0x7fcd6400b920, this=this@entry=0x7fcda8016f60, fd=fd@entry=0x7fcd94008480, vector=vector@entry=0x7fcd94030a30, 
    count=count@entry=1, offset=offset@entry=1577648128, flags=flags@entry=0, iobref=iobref@entry=0x7fcd940086b0, xdata=xdata@entry=0x7fcd9402e710) at worm.c:386
#5  0x00007fcdadff1ea7 in ro_writev (frame=0x7fcd6400b920, this=<optimized out>, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, 
    xdata=0x7fcd9402e710) at read-only-common.c:383
#6  0x00007fcdadddd83d in leases_writev (frame=0x7fcd64006ce0, this=0x7fcda8019ef0, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, 
    xdata=0x7fcd9402e710) at leases.c:127
#7  0x00007fcdadbcc8ad in up_writev (frame=0x7fcd64006f00, this=0x7fcda801b630, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, 
    xdata=0x7fcd9402e710) at upcall.c:131
#8  0x00007fcdbd8ae736 in default_writev_resume (frame=0x7fcd94036c30, this=0x7fcda801cf60, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, 
    iobref=0x7fcd940086b0, xdata=0x7fcd9402e710) at defaults.c:1849
#9  0x00007fcdbd83e698 in call_resume_wind (stub=0x7fcd9404de20) at call-stub.c:2045
#10 0x00007fcdbd83eb25 in call_resume (stub=0x7fcd9404de20) at call-stub.c:2508
#11 0x00007fcdad9b7957 in iot_worker (data=0x7fcda8056490) at io-threads.c:220
#12 0x00007fcdbc67bdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fcdbbfc073d in clone () from /lib64/libc.so.6

Comment 8 Kotresh HR 2017-05-22 13:08:56 UTC
The fix sent for BZ 1451280 solves this crash as well. Keeping this open as the bt is different.

Upstream Patch:
https://review.gluster.org/17357

Comment 10 Atin Mukherjee 2017-05-29 06:30:31 UTC
Upstream Patches:
https://review.gluster.org/17357      (master)
https://review.gluster.org/#/c/17406/ (release-3.11)

Downstream Patches:
https://code.engineering.redhat.com/gerrit/#/c/107534/

Comment 12 Sweta Anandpara 2017-07-31 10:22:04 UTC
Tested and verified this on the build glusterfs-3.8.4-35. A round of testing has taken place on bitrot and have not seen this crash again anytime in my logs. 

Moving this bug to verified in 3.3.0.

Comment 14 errata-xmlrpc 2017-09-21 04:45:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 15 errata-xmlrpc 2017-09-21 04:58:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.