Bug 1635784

Summary: brick process segfault
Product: [Community] GlusterFS Reporter: kyung-pyo,kim <hgichon>
Component: indexAssignee: bugs <bugs>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: aspandey, bugs, jahernan, pasik, sankarshan
Target Milestone: ---Flags: hgichon: needinfo+
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-03 13:34:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
GDB dump log
none
brick.log none

Description kyung-pyo,kim 2018-10-03 16:17:36 UTC
Created attachment 1490178 [details]
GDB dump log

Description of problem:

-Few days ago I found my EC(4+2) volume was degraded.
-I am using 3.12.13-1.el7.x86_64.
-One brick was down, below is bricklog
-I am suspicious loc->inode bug in index.c (see attached picture)
-In GDB, loc->inode is null
- inode_find (loc->inode->table, loc->gfid); 


Version-Release number of selected component (if applicable):
- 3.12.13

How reproducible:
- i don't know

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

glusterfsd brick error log
[2018-09-29 13:22:36.536532] W [inode.c:942:inode_find] (-->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd01c) [0x7f9bd249401c] -->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc638) [0x7f9bd2493638] -->/lib64/libglusterfs.so.0(inode_find+0x92) [
0x7f9be7090a82] ) 0-gluvol02-05-server: table not found
[2018-09-29 13:22:36.536579] W [inode.c:680:inode_new] (-->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd048) [0x7f9bd2494048] -->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc14d) [0x7f9bd249314d] -->/lib64/libglusterfs.so.0(inode_new+0x8a) [0x
7f9be70900ba] ) 0-gluvol02-05-server: inode not found
[2018-09-29 13:22:36.537568] W [inode.c:2305:inode_is_linked] (-->/usr/lib64/glusterfs/3.12.13/xlator/features/quota.so(+0x4fc6) [0x7f9bd2b1cfc6] -->/usr/lib64/glusterfs/3.12.13/xlator/features/index.so(+0x4bb9) [0x7f9bd2d43bb9] -->/lib64/libglusterfs.so.0(inode_is_linke
d+0x8a) [0x7f9be70927ea] ) 0-gluvol02-05-index: inode not found
pending frames:
frame : type(0) op(18)
frame : type(0) op(18)
frame : type(0) op(28)
--snip --
frame : type(0) op(28)
frame : type(0) op(28)
frame : type(0) op(18)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-09-29 13:22:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.13
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9be70804c0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9be708a3f4]
/lib64/libc.so.6(+0x362f0)[0x7f9be56e02f0]
/usr/lib64/glusterfs/3.12.13/xlator/features/index.so(+0x4bc4)[0x7f9bd2d43bc4]
/usr/lib64/glusterfs/3.12.13/xlator/features/quota.so(+0x4fc6)[0x7f9bd2b1cfc6]
/usr/lib64/glusterfs/3.12.13/xlator/debug/io-stats.so(+0x4e53)[0x7f9bd28eee53]
/lib64/libglusterfs.so.0(default_lookup+0xbd)[0x7f9be70fddfd]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc342)[0x7f9bd2493342]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd048)[0x7f9bd2494048]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd2c0)[0x7f9bd24942c0]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc89e)[0x7f9bd249389e]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd354)[0x7f9bd2494354]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0x2f829)[0x7f9bd24b6829]
/lib64/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f9be6e42246]
/lib64/libpthread.so.0(+0x7e25)[0x7f9be5edfe25]
/lib64/libc.so.6(clone+0x6d)[0x7f9be57a8bad]

Comment 2 kyung-pyo,kim 2018-10-06 06:58:16 UTC
Created attachment 1490994 [details]
brick.log

Comment 3 kyung-pyo,kim 2018-10-06 06:59:33 UTC
Yesterday, another brick was died with same symptom.
core file : http://ac2repo.gluesys.com/ac2repo/down/core.50570.tgz 

Volume configuration:
- We have 10 EC Volume with 6 nodes (4+2).
- Each brick size is 37TB
- This is one cluser infomation of 10 EC volumes.
- All volume configuration is same.

Each volume Data charateristics
- df -i : 8M, 
- 95% mp4 files(~10MB), some txt infomation 

File/Dir Layout
- /
  └── indexdir ( about 1000)
       └── datadir ( about 800) 
            └── data : about 10 (txt and mp4)      
NOTE:
- Currently there are aggressive selfhealing job.
- We delete all brick in one node, and then monitoring self-heal status.
- self-heal daemon memory reaches 75GB 
- Before brick segfaulted, there are many gfid fd clean up calls

Volume Name: xxxxxxx-01
Type: Disperse
Volume ID: cac0ab6a-55bd-48ed-ac7a-92f0cb4aca80
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: xxxxxxx-GLUSTER2-1:/gluster/brick1/data
Brick2: xxxxxxx-GLUSTER2-2:/gluster/brick1/data
Brick3: xxxxxxx-GLUSTER2-3:/gluster/brick1/data
Brick4: xxxxxxx-GLUSTER2-4:/gluster/brick1/data
Brick5: xxxxxxx-GLUSTER2-5:/gluster/brick1/data
Brick6: xxxxxxx-GLUSTER2-6:/gluster/brick1/data
Options Reconfigured:
performance.io-thread-count: 64
performance.least-prio-threads: 64
performance.high-prio-threads: 64
performance.normal-prio-threads: 64
performance.low-prio-threads: 64
server.event-threads: 1024
client.event-threads: 32
cluster.lookup-optimize: on
performance.parallel-readdir: on
cluster.use-compound-fops: on
performance.nl-cache: on
performance.nl-cache-positive-entry: on
performance.nl-cache-limit: 1GB
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
disperse.shd-wait-qlength: 32768
disperse.shd-max-threads: 16
disperse.self-heal-window-size: 16
disperse.heal-wait-qlength: 2048
disperse.background-heals: 64
performance.write-behind-window-size: 50MB
performance.cache-size: 4GB
cluster.shd-wait-qlength: 32768
cluster.background-self-heal-count: 64
cluster.self-heal-window-size: 16
transport.address-family: inet
nfs.disable: on
cluster.localtime-logging: enable

Comment 4 kyung-pyo,kim 2018-10-06 06:59:54 UTC
Yesterday, another brick was died with same symptom.
core file : http://ac2repo.gluesys.com/ac2repo/down/core.50570.tgz 

Volume configuration:
- We have 10 EC Volume with 6 nodes (4+2).
- Each brick size is 37TB
- This is one cluser infomation of 10 EC volumes.
- All volume configuration is same.

Each volume Data charateristics
- df -i : 8M, 
- 95% mp4 files(~10MB), some txt infomation 

File/Dir Layout
- /
  └── indexdir ( about 1000)
       └── datadir ( about 800) 
            └── data : about 10 (txt and mp4)      
NOTE:
- Currently there are aggressive selfhealing job.
- We delete all brick in one node, and then monitoring self-heal status.
- self-heal daemon memory reaches 75GB 
- Before brick segfaulted, there are many gfid fd clean up calls

Volume Name: xxxxxxx-01
Type: Disperse
Volume ID: cac0ab6a-55bd-48ed-ac7a-92f0cb4aca80
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: xxxxxxx-GLUSTER2-1:/gluster/brick1/data
Brick2: xxxxxxx-GLUSTER2-2:/gluster/brick1/data
Brick3: xxxxxxx-GLUSTER2-3:/gluster/brick1/data
Brick4: xxxxxxx-GLUSTER2-4:/gluster/brick1/data
Brick5: xxxxxxx-GLUSTER2-5:/gluster/brick1/data
Brick6: xxxxxxx-GLUSTER2-6:/gluster/brick1/data
Options Reconfigured:
performance.io-thread-count: 64
performance.least-prio-threads: 64
performance.high-prio-threads: 64
performance.normal-prio-threads: 64
performance.low-prio-threads: 64
server.event-threads: 1024
client.event-threads: 32
cluster.lookup-optimize: on
performance.parallel-readdir: on
cluster.use-compound-fops: on
performance.nl-cache: on
performance.nl-cache-positive-entry: on
performance.nl-cache-limit: 1GB
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
disperse.shd-wait-qlength: 32768
disperse.shd-max-threads: 16
disperse.self-heal-window-size: 16
disperse.heal-wait-qlength: 2048
disperse.background-heals: 64
performance.write-behind-window-size: 50MB
performance.cache-size: 4GB
cluster.shd-wait-qlength: 32768
cluster.background-self-heal-count: 64
cluster.self-heal-window-size: 16
transport.address-family: inet
nfs.disable: on
cluster.localtime-logging: enable

Comment 5 Shyamsundar 2018-10-23 14:55:20 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 6 Yaniv Kaul 2019-04-18 09:56:03 UTC
Does it still happen on newer releases?

Comment 7 Yaniv Kaul 2019-06-03 13:34:29 UTC
(In reply to Yaniv Kaul from comment #6)
> Does it still happen on newer releases?

Closing for the time being. Please re-open if you have more information.

Comment 8 kyung-pyo,kim 2020-10-12 04:58:21 UTC
This issue is not occurred in current my system (glusterfs-6.x).