Bug 1635784 - brick process segfault [NEEDINFO]
Summary: brick process segfault
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: GlusterFS
Classification: Community
Component: index
Version: mainline
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-03 16:17 UTC by kyung-pyo,kim
Modified: 2019-06-03 13:34 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-03 13:34:29 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
ykaul: needinfo? (hgichon)


Attachments (Terms of Use)
GDB dump log (1.57 MB, image/jpeg)
2018-10-03 16:17 UTC, kyung-pyo,kim
no flags Details
brick.log (268.92 KB, text/plain)
2018-10-06 06:58 UTC, kyung-pyo,kim
no flags Details

Description kyung-pyo,kim 2018-10-03 16:17:36 UTC
Created attachment 1490178 [details]
GDB dump log

Description of problem:

-Few days ago I found my EC(4+2) volume was degraded.
-I am using 3.12.13-1.el7.x86_64.
-One brick was down, below is bricklog
-I am suspicious loc->inode bug in index.c (see attached picture)
-In GDB, loc->inode is null
- inode_find (loc->inode->table, loc->gfid); 


Version-Release number of selected component (if applicable):
- 3.12.13

How reproducible:
- i don't know

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

glusterfsd brick error log
[2018-09-29 13:22:36.536532] W [inode.c:942:inode_find] (-->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd01c) [0x7f9bd249401c] -->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc638) [0x7f9bd2493638] -->/lib64/libglusterfs.so.0(inode_find+0x92) [
0x7f9be7090a82] ) 0-gluvol02-05-server: table not found
[2018-09-29 13:22:36.536579] W [inode.c:680:inode_new] (-->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd048) [0x7f9bd2494048] -->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc14d) [0x7f9bd249314d] -->/lib64/libglusterfs.so.0(inode_new+0x8a) [0x
7f9be70900ba] ) 0-gluvol02-05-server: inode not found
[2018-09-29 13:22:36.537568] W [inode.c:2305:inode_is_linked] (-->/usr/lib64/glusterfs/3.12.13/xlator/features/quota.so(+0x4fc6) [0x7f9bd2b1cfc6] -->/usr/lib64/glusterfs/3.12.13/xlator/features/index.so(+0x4bb9) [0x7f9bd2d43bb9] -->/lib64/libglusterfs.so.0(inode_is_linke
d+0x8a) [0x7f9be70927ea] ) 0-gluvol02-05-index: inode not found
pending frames:
frame : type(0) op(18)
frame : type(0) op(18)
frame : type(0) op(28)
--snip --
frame : type(0) op(28)
frame : type(0) op(28)
frame : type(0) op(18)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-09-29 13:22:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.13
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9be70804c0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9be708a3f4]
/lib64/libc.so.6(+0x362f0)[0x7f9be56e02f0]
/usr/lib64/glusterfs/3.12.13/xlator/features/index.so(+0x4bc4)[0x7f9bd2d43bc4]
/usr/lib64/glusterfs/3.12.13/xlator/features/quota.so(+0x4fc6)[0x7f9bd2b1cfc6]
/usr/lib64/glusterfs/3.12.13/xlator/debug/io-stats.so(+0x4e53)[0x7f9bd28eee53]
/lib64/libglusterfs.so.0(default_lookup+0xbd)[0x7f9be70fddfd]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc342)[0x7f9bd2493342]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd048)[0x7f9bd2494048]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd2c0)[0x7f9bd24942c0]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc89e)[0x7f9bd249389e]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd354)[0x7f9bd2494354]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0x2f829)[0x7f9bd24b6829]
/lib64/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f9be6e42246]
/lib64/libpthread.so.0(+0x7e25)[0x7f9be5edfe25]
/lib64/libc.so.6(clone+0x6d)[0x7f9be57a8bad]

Comment 2 kyung-pyo,kim 2018-10-06 06:58:16 UTC
Created attachment 1490994 [details]
brick.log

Comment 3 kyung-pyo,kim 2018-10-06 06:59:33 UTC
Yesterday, another brick was died with same symptom.
core file : http://ac2repo.gluesys.com/ac2repo/down/core.50570.tgz 

Volume configuration:
- We have 10 EC Volume with 6 nodes (4+2).
- Each brick size is 37TB
- This is one cluser infomation of 10 EC volumes.
- All volume configuration is same.

Each volume Data charateristics
- df -i : 8M, 
- 95% mp4 files(~10MB), some txt infomation 

File/Dir Layout
- /
  └── indexdir ( about 1000)
       └── datadir ( about 800) 
            └── data : about 10 (txt and mp4)      
NOTE:
- Currently there are aggressive selfhealing job.
- We delete all brick in one node, and then monitoring self-heal status.
- self-heal daemon memory reaches 75GB 
- Before brick segfaulted, there are many gfid fd clean up calls

Volume Name: xxxxxxx-01
Type: Disperse
Volume ID: cac0ab6a-55bd-48ed-ac7a-92f0cb4aca80
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: xxxxxxx-GLUSTER2-1:/gluster/brick1/data
Brick2: xxxxxxx-GLUSTER2-2:/gluster/brick1/data
Brick3: xxxxxxx-GLUSTER2-3:/gluster/brick1/data
Brick4: xxxxxxx-GLUSTER2-4:/gluster/brick1/data
Brick5: xxxxxxx-GLUSTER2-5:/gluster/brick1/data
Brick6: xxxxxxx-GLUSTER2-6:/gluster/brick1/data
Options Reconfigured:
performance.io-thread-count: 64
performance.least-prio-threads: 64
performance.high-prio-threads: 64
performance.normal-prio-threads: 64
performance.low-prio-threads: 64
server.event-threads: 1024
client.event-threads: 32
cluster.lookup-optimize: on
performance.parallel-readdir: on
cluster.use-compound-fops: on
performance.nl-cache: on
performance.nl-cache-positive-entry: on
performance.nl-cache-limit: 1GB
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
disperse.shd-wait-qlength: 32768
disperse.shd-max-threads: 16
disperse.self-heal-window-size: 16
disperse.heal-wait-qlength: 2048
disperse.background-heals: 64
performance.write-behind-window-size: 50MB
performance.cache-size: 4GB
cluster.shd-wait-qlength: 32768
cluster.background-self-heal-count: 64
cluster.self-heal-window-size: 16
transport.address-family: inet
nfs.disable: on
cluster.localtime-logging: enable

Comment 4 kyung-pyo,kim 2018-10-06 06:59:54 UTC
Yesterday, another brick was died with same symptom.
core file : http://ac2repo.gluesys.com/ac2repo/down/core.50570.tgz 

Volume configuration:
- We have 10 EC Volume with 6 nodes (4+2).
- Each brick size is 37TB
- This is one cluser infomation of 10 EC volumes.
- All volume configuration is same.

Each volume Data charateristics
- df -i : 8M, 
- 95% mp4 files(~10MB), some txt infomation 

File/Dir Layout
- /
  └── indexdir ( about 1000)
       └── datadir ( about 800) 
            └── data : about 10 (txt and mp4)      
NOTE:
- Currently there are aggressive selfhealing job.
- We delete all brick in one node, and then monitoring self-heal status.
- self-heal daemon memory reaches 75GB 
- Before brick segfaulted, there are many gfid fd clean up calls

Volume Name: xxxxxxx-01
Type: Disperse
Volume ID: cac0ab6a-55bd-48ed-ac7a-92f0cb4aca80
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: xxxxxxx-GLUSTER2-1:/gluster/brick1/data
Brick2: xxxxxxx-GLUSTER2-2:/gluster/brick1/data
Brick3: xxxxxxx-GLUSTER2-3:/gluster/brick1/data
Brick4: xxxxxxx-GLUSTER2-4:/gluster/brick1/data
Brick5: xxxxxxx-GLUSTER2-5:/gluster/brick1/data
Brick6: xxxxxxx-GLUSTER2-6:/gluster/brick1/data
Options Reconfigured:
performance.io-thread-count: 64
performance.least-prio-threads: 64
performance.high-prio-threads: 64
performance.normal-prio-threads: 64
performance.low-prio-threads: 64
server.event-threads: 1024
client.event-threads: 32
cluster.lookup-optimize: on
performance.parallel-readdir: on
cluster.use-compound-fops: on
performance.nl-cache: on
performance.nl-cache-positive-entry: on
performance.nl-cache-limit: 1GB
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
disperse.shd-wait-qlength: 32768
disperse.shd-max-threads: 16
disperse.self-heal-window-size: 16
disperse.heal-wait-qlength: 2048
disperse.background-heals: 64
performance.write-behind-window-size: 50MB
performance.cache-size: 4GB
cluster.shd-wait-qlength: 32768
cluster.background-self-heal-count: 64
cluster.self-heal-window-size: 16
transport.address-family: inet
nfs.disable: on
cluster.localtime-logging: enable

Comment 5 Shyamsundar 2018-10-23 14:55:20 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 6 Yaniv Kaul 2019-04-18 09:56:03 UTC
Does it still happen on newer releases?

Comment 7 Yaniv Kaul 2019-06-03 13:34:29 UTC
(In reply to Yaniv Kaul from comment #6)
> Does it still happen on newer releases?

Closing for the time being. Please re-open if you have more information.


Note You need to log in before you can comment on or make changes to this bug.