1635784 – brick process segfault

Bug 1635784 - brick process segfault

Summary: brick process segfault

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	index
Sub Component:
Version:	mainline
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-03 16:17 UTC by kyung-pyo,kim
Modified:	2020-10-12 04:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-06-03 13:34:29 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:
Flags:	hgichon: needinfo+

Attachments	(Terms of Use)
GDB dump log (1.57 MB, image/jpeg) 2018-10-03 16:17 UTC, kyung-pyo,kim	no flags	Details
brick.log (268.92 KB, text/plain) 2018-10-06 06:58 UTC, kyung-pyo,kim	no flags	Details
View All

Description kyung-pyo,kim 2018-10-03 16:17:36 UTC

Created attachment 1490178 [details]
GDB dump log

Description of problem:

-Few days ago I found my EC(4+2) volume was degraded.
-I am using 3.12.13-1.el7.x86_64.
-One brick was down, below is bricklog
-I am suspicious loc->inode bug in index.c (see attached picture)
-In GDB, loc->inode is null
- inode_find (loc->inode->table, loc->gfid); 


Version-Release number of selected component (if applicable):
- 3.12.13

How reproducible:
- i don't know

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

glusterfsd brick error log
[2018-09-29 13:22:36.536532] W [inode.c:942:inode_find] (-->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd01c) [0x7f9bd249401c] -->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc638) [0x7f9bd2493638] -->/lib64/libglusterfs.so.0(inode_find+0x92) [
0x7f9be7090a82] ) 0-gluvol02-05-server: table not found
[2018-09-29 13:22:36.536579] W [inode.c:680:inode_new] (-->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd048) [0x7f9bd2494048] -->/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc14d) [0x7f9bd249314d] -->/lib64/libglusterfs.so.0(inode_new+0x8a) [0x
7f9be70900ba] ) 0-gluvol02-05-server: inode not found
[2018-09-29 13:22:36.537568] W [inode.c:2305:inode_is_linked] (-->/usr/lib64/glusterfs/3.12.13/xlator/features/quota.so(+0x4fc6) [0x7f9bd2b1cfc6] -->/usr/lib64/glusterfs/3.12.13/xlator/features/index.so(+0x4bb9) [0x7f9bd2d43bb9] -->/lib64/libglusterfs.so.0(inode_is_linke
d+0x8a) [0x7f9be70927ea] ) 0-gluvol02-05-index: inode not found
pending frames:
frame : type(0) op(18)
frame : type(0) op(18)
frame : type(0) op(28)
--snip --
frame : type(0) op(28)
frame : type(0) op(28)
frame : type(0) op(18)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-09-29 13:22:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.13
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9be70804c0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9be708a3f4]
/lib64/libc.so.6(+0x362f0)[0x7f9be56e02f0]
/usr/lib64/glusterfs/3.12.13/xlator/features/index.so(+0x4bc4)[0x7f9bd2d43bc4]
/usr/lib64/glusterfs/3.12.13/xlator/features/quota.so(+0x4fc6)[0x7f9bd2b1cfc6]
/usr/lib64/glusterfs/3.12.13/xlator/debug/io-stats.so(+0x4e53)[0x7f9bd28eee53]
/lib64/libglusterfs.so.0(default_lookup+0xbd)[0x7f9be70fddfd]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc342)[0x7f9bd2493342]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd048)[0x7f9bd2494048]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd2c0)[0x7f9bd24942c0]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xc89e)[0x7f9bd249389e]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0xd354)[0x7f9bd2494354]
/usr/lib64/glusterfs/3.12.13/xlator/protocol/server.so(+0x2f829)[0x7f9bd24b6829]
/lib64/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f9be6e42246]
/lib64/libpthread.so.0(+0x7e25)[0x7f9be5edfe25]
/lib64/libc.so.6(clone+0x6d)[0x7f9be57a8bad]

Comment 2 kyung-pyo,kim 2018-10-06 06:58:16 UTC

Created attachment 1490994 [details]
brick.log

Comment 3 kyung-pyo,kim 2018-10-06 06:59:33 UTC

Yesterday, another brick was died with same symptom.
core file : http://ac2repo.gluesys.com/ac2repo/down/core.50570.tgz 

Volume configuration:
- We have 10 EC Volume with 6 nodes (4+2).
- Each brick size is 37TB
- This is one cluser infomation of 10 EC volumes.
- All volume configuration is same.

Each volume Data charateristics
- df -i : 8M, 
- 95% mp4 files(~10MB), some txt infomation 

File/Dir Layout
- /
  └── indexdir ( about 1000)
       └── datadir ( about 800) 
            └── data : about 10 (txt and mp4)      
NOTE:
- Currently there are aggressive selfhealing job.
- We delete all brick in one node, and then monitoring self-heal status.
- self-heal daemon memory reaches 75GB 
- Before brick segfaulted, there are many gfid fd clean up calls

Volume Name: xxxxxxx-01
Type: Disperse
Volume ID: cac0ab6a-55bd-48ed-ac7a-92f0cb4aca80
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: xxxxxxx-GLUSTER2-1:/gluster/brick1/data
Brick2: xxxxxxx-GLUSTER2-2:/gluster/brick1/data
Brick3: xxxxxxx-GLUSTER2-3:/gluster/brick1/data
Brick4: xxxxxxx-GLUSTER2-4:/gluster/brick1/data
Brick5: xxxxxxx-GLUSTER2-5:/gluster/brick1/data
Brick6: xxxxxxx-GLUSTER2-6:/gluster/brick1/data
Options Reconfigured:
performance.io-thread-count: 64
performance.least-prio-threads: 64
performance.high-prio-threads: 64
performance.normal-prio-threads: 64
performance.low-prio-threads: 64
server.event-threads: 1024
client.event-threads: 32
cluster.lookup-optimize: on
performance.parallel-readdir: on
cluster.use-compound-fops: on
performance.nl-cache: on
performance.nl-cache-positive-entry: on
performance.nl-cache-limit: 1GB
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
disperse.shd-wait-qlength: 32768
disperse.shd-max-threads: 16
disperse.self-heal-window-size: 16
disperse.heal-wait-qlength: 2048
disperse.background-heals: 64
performance.write-behind-window-size: 50MB
performance.cache-size: 4GB
cluster.shd-wait-qlength: 32768
cluster.background-self-heal-count: 64
cluster.self-heal-window-size: 16
transport.address-family: inet
nfs.disable: on
cluster.localtime-logging: enable

Comment 4 kyung-pyo,kim 2018-10-06 06:59:54 UTC

Yesterday, another brick was died with same symptom.
core file : http://ac2repo.gluesys.com/ac2repo/down/core.50570.tgz 

Volume configuration:
- We have 10 EC Volume with 6 nodes (4+2).
- Each brick size is 37TB
- This is one cluser infomation of 10 EC volumes.
- All volume configuration is same.

Each volume Data charateristics
- df -i : 8M, 
- 95% mp4 files(~10MB), some txt infomation 

File/Dir Layout
- /
  └── indexdir ( about 1000)
       └── datadir ( about 800) 
            └── data : about 10 (txt and mp4)      
NOTE:
- Currently there are aggressive selfhealing job.
- We delete all brick in one node, and then monitoring self-heal status.
- self-heal daemon memory reaches 75GB 
- Before brick segfaulted, there are many gfid fd clean up calls

Volume Name: xxxxxxx-01
Type: Disperse
Volume ID: cac0ab6a-55bd-48ed-ac7a-92f0cb4aca80
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: xxxxxxx-GLUSTER2-1:/gluster/brick1/data
Brick2: xxxxxxx-GLUSTER2-2:/gluster/brick1/data
Brick3: xxxxxxx-GLUSTER2-3:/gluster/brick1/data
Brick4: xxxxxxx-GLUSTER2-4:/gluster/brick1/data
Brick5: xxxxxxx-GLUSTER2-5:/gluster/brick1/data
Brick6: xxxxxxx-GLUSTER2-6:/gluster/brick1/data
Options Reconfigured:
performance.io-thread-count: 64
performance.least-prio-threads: 64
performance.high-prio-threads: 64
performance.normal-prio-threads: 64
performance.low-prio-threads: 64
server.event-threads: 1024
client.event-threads: 32
cluster.lookup-optimize: on
performance.parallel-readdir: on
cluster.use-compound-fops: on
performance.nl-cache: on
performance.nl-cache-positive-entry: on
performance.nl-cache-limit: 1GB
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
disperse.shd-wait-qlength: 32768
disperse.shd-max-threads: 16
disperse.self-heal-window-size: 16
disperse.heal-wait-qlength: 2048
disperse.background-heals: 64
performance.write-behind-window-size: 50MB
performance.cache-size: 4GB
cluster.shd-wait-qlength: 32768
cluster.background-self-heal-count: 64
cluster.self-heal-window-size: 16
transport.address-family: inet
nfs.disable: on
cluster.localtime-logging: enable

Comment 5 Shyamsundar 2018-10-23 14:55:20 UTC

Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 6 Yaniv Kaul 2019-04-18 09:56:03 UTC

Does it still happen on newer releases?

Comment 7 Yaniv Kaul 2019-06-03 13:34:29 UTC

(In reply to Yaniv Kaul from comment #6)
> Does it still happen on newer releases?

Closing for the time being. Please re-open if you have more information.

Comment 8 kyung-pyo,kim 2020-10-12 04:58:21 UTC

This issue is not occurred in current my system (glusterfs-6.x).

Note You need to log in before you can comment on or make changes to this bug.