1749625 – [GlusterFS 6.1] GlusterFS brick process crash

Bug 1749625 - [GlusterFS 6.1] GlusterFS brick process crash

Summary: [GlusterFS 6.1] GlusterFS brick process crash

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	rpc
Sub Component:
Version:	6
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1739884 1740413 (view as bug list)
Depends On:	1782495
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-06 03:06 UTC by joe.chan
Modified:	2020-01-27 11:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1782495 (view as bug list)
Environment:
Last Closed:	2019-12-24 05:20:55 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
bricks log file (122.75 KB, text/plain) 2019-09-06 03:06 UTC, joe.chan	no flags	Details
glusterd.log (22.42 KB, text/plain) 2019-09-06 03:07 UTC, joe.chan	no flags	Details
glustershd.log (117.89 KB, text/plain) 2019-09-06 03:07 UTC, joe.chan	no flags	Details
2 coredump files (15.73 MB, application/zip) 2019-09-23 04:35 UTC, joe.chan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23872	0	None	Merged	socket: fix error handling	2019-12-24 05:20:54 UTC

Description joe.chan 2019-09-06 03:06:05 UTC

Created attachment 1612173 [details]
bricks log file

Description of problem:
The gluster client is somehow crash after running for a period (1~2 weeks), I would like to know is it a bug exist in GlusterFS 6.1 or other problem. 

Firstly, The glusterfs is installed by using the rpm from http://mirror.centos.org/centos/7/storage/x86_64/gluster-4.0/

client rpm:
glusterfs-6.1-1.el7.x86_64.rpm

server rpm:
glusterfs-server-6.1-1.el7.x86_64.rpm

Below is the structure of the GlusterFS in our servers:

Volume Name: k8s-volume
Type: Replicate
Volume ID: d5d673d6-a1bb-4d14-bc91-1ceab7ad761d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: hk2hp057:/gfs/k8s_pv
Brick2: hk2hp058:/gfs/k8s_pv
Brick3: hk2hp059:/gfs/k8s_pv
Options Reconfigured:
auth.reject: 172.20.117.144
auth.allow: 172.20.117.*
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on


As shown above, there are total three servers using the "Replicated Volume". It is created three replicas in pursuit of better reliability and data redundancy with regard to the official recommendation. 


Version-Release number of selected component (if applicable):


However, the gluster process go down after running for a period. it is easily to see by executing the "gluster" command - $ gluster volume status

Status of volume: k8s-volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hk2hp057:/gfs/k8s_pv                  49152     0          Y       25141
Brick hk2hp058:/gfs/k8s_pv                  N/A       N/A        N       N/A
Brick hk2hp059:/gfs/k8s_pv                  49152     0          Y       6031
Self-heal Daemon on localhost               N/A       N/A        Y       6065
Self-heal Daemon on hk2hp057                N/A       N/A        Y       25150
Self-heal Daemon on hk2hp059                N/A       N/A        Y       6048

Task Status of Volume k8s-volume
------------------------------------------------------------------------------
There are no active volume tasks

the brick under "hk2hp058" is offline.
When i check the log file under "/var/log/glusterfs/bricks/", the message as below:

[2019-09-04 07:33:50.004661] W [MSGID: 113117] [posix-metadata.c:627:posix_set_ctime] 0-k8s-volume-posix: posix set mdata failed, No ctime : /gfs/k8s_pv/.glusterfs/dd/56/dd56eca4-e7e8-4208-b589-763a428408e1 gfid:dd56eca4-e7e8-4208-b589-763a428408e1
[2019-09-04 07:33:50.004776] W [MSGID: 113117] [posix-metadata.c:627:posix_set_ctime] 0-k8s-volume-posix: posix set mdata failed, No ctime : /gfs/k8s_pv/.glusterfs/dd/56/dd56eca4-e7e8-4208-b589-763a428408e1 gfid:dd56eca4-e7e8-4208-b589-763a428408e1
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-09-04 10:53:02
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.1
/lib64/libglusterfs.so.0(+0x26db0)[0x7fa1fa2d1db0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fa1fa2dc7b4]
/lib64/libc.so.6(+0x36340)[0x7fa1f8911340]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7fa1ee6964cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7fa1fa337286]
/lib64/libpthread.so.0(+0x7dd5)[0x7fa1f9111dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fa1f89d902d]
---------

The glusterfs process is resumed after manually restart the process - "glusterd".
[root@hk2hp058 /var/log/glusterfs/bricks] systemctl restart glusterd
[root@hk2hp058 /var/log/glusterfs/bricks] gluster volume status
Status of volume: k8s-volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hk2hp057:/gfs/k8s_pv                  49152     0          Y       25141
Brick hk2hp058:/gfs/k8s_pv                  49152     0          Y       18814
Brick hk2hp059:/gfs/k8s_pv                  49152     0          Y       6031
Self-heal Daemon on localhost               N/A       N/A        Y       18846
Self-heal Daemon on hk2hp059                N/A       N/A        Y       6048
Self-heal Daemon on hk2hp057                N/A       N/A        Y       25150

Task Status of Volume k8s-volume
------------------------------------------------------------------------------
There are no active volume tasks

Thus, is it a bug or other reason lead to this issue?

Comment 1 joe.chan 2019-09-06 03:07:07 UTC

Created attachment 1612174 [details]
glusterd.log

Comment 2 joe.chan 2019-09-06 03:07:35 UTC

Created attachment 1612175 [details]
glustershd.log

Comment 3 Nithya Balachandran 2019-09-09 03:24:46 UTC

Do you have a coredump? If yes, please install the debuginfo, and send send the bt for the core.

Comment 4 joe.chan 2019-09-09 04:03:38 UTC

Thank you for your reply.
Unfortunately, I cannot find any coredump file in the servers with pattern "core.*"

Comment 5 Nithya Balachandran 2019-09-12 03:12:40 UTC

In that case please use the steps in https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-backtrace.sh and the backtrace in the  log file to get more information as to what happened.

Comment 6 joe.chan 2019-09-12 06:46:08 UTC

(In reply to Nithya Balachandran from comment #5)
> In that case please use the steps in
> https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-
> backtrace.sh and the backtrace in the  log file to get more information as
> to what happened.

Thank you for your suggestion, however, I have encountered the error while running the corresponding script. Is there something wrong?

[hk2hp060@projadm /home/projadm/user/joechan/gfs]  ./print-backtrace_v2.sh /home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm glusterfs/bricks/20190912.log
eu-addr2line --functions --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
/usr/sbin/glusterfsd
eu-addr2line --functions --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
/usr/sbin/glusterfsd

Comment 7 joe.chan 2019-09-12 06:52:57 UTC

(In reply to joe.chan from comment #6)
> (In reply to Nithya Balachandran from comment #5)
> > In that case please use the steps in
> > https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-
> > backtrace.sh and the backtrace in the  log file to get more information as
> > to what happened.
> 
> Thank you for your suggestion, however, I have encountered the error while
> running the corresponding script. Is there something wrong?
> 
> [hk2hp060@projadm /home/projadm/user/joechan/gfs]  ./print-backtrace_v2.sh
> /home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm
> glusterfs/bricks/20190912.log
> eu-addr2line --functions
> --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/
> usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
> eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
> /usr/sbin/glusterfsd
> eu-addr2line --functions
> --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/
> usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
> eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
> /usr/sbin/glusterfsd

Sorry, it is a mistake. Below is the backtrace in log file:

[hk2hp060@projadm /home/projadm/user/joechan/gfs]  ./print-backtrace.sh /home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm glusterfs/bricks/20190912.log
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc] socket_is_connected inlined at /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2887 in socket_event_handler /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2619

Comment 8 Nithya Balachandran 2019-09-12 07:55:33 UTC

Can you confirm that you have copied the stacktrace alone to a different file and run the script on thta file? You should be seeing one line per line in the stacktrace.

Comment 9 joe.chan 2019-09-12 08:10:09 UTC

(In reply to Nithya Balachandran from comment #8)
> Can you confirm that you have copied the stacktrace alone to a different
> file and run the script on thta file? You should be seeing one line per line
> in the stacktrace.

Yes it is confirmed that the procedure is correct.

stacktrace from log:
/lib64/libglusterfs.so.0(+0x26db0)[0x7f506d81adb0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f506d8257b4]
/lib64/libc.so.6(+0x36340)[0x7f506be5a340]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7f506d880286]
/lib64/libpthread.so.0(+0x7dd5)[0x7f506c65add5]
/lib64/libc.so.6(clone+0x6d)[0x7f506bf2202d]

only one line while running the script:
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc] socket_is_connected inlined at /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2887 in socket_event_handler /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2619

Comment 10 joe.chan 2019-09-12 08:35:50 UTC

attached more debug result by manually execute the "eu-addr2line" command:

/lib64/libglusterfs.so.0(+0x26db0)[0x7f506d81adb0] _gf_msg_backtrace_nomem /usr/src/debug/glusterfs-6.1/libglusterfs/src/logging.c:1124

/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f506d8257b4] sprintf inlined at /usr/src/debug/glusterfs-6.1/libglusterfs/src/common-utils.c:954 in gf_print_trace /usr/include/bits/stdio2.h:33

/lib64/libglusterfs.so.0(+0x8c286)[0x7f506d880286] event_dispatch_epoll_handler inlined at /usr/src/debug/glusterfs-6.1/libglusterfs/src/event-epoll.c:761 in event_dispatch_epoll_worker /usr/src/debug/glusterfs-6.1/libglusterfs/src/event-epoll.c:648


I am not able to find the "/lib64/libc.so.6" and "/lib64/libpthread.so.0" under "glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/"

Comment 11 Nithya Balachandran 2019-09-13 03:46:41 UTC

libc and libpthread are std libraries and not part of gluster.

I've moving this to the rpc component for someone to look at.

Comment 12 joe.chan 2019-09-23 04:35:12 UTC

Created attachment 1618042 [details]
2 coredump files

attached the coredump files

Comment 13 joe.chan 2019-09-23 10:17:58 UTC

Program terminated with signal 11, Segmentation fault.
#0  socket_is_connected (this=0x7f90d80018a0) at socket.c:2619
2619        if (priv->use_ssl) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libacl-2.2.51-14.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-59.el7_6.1.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  socket_is_connected (this=0x7f90d80018a0) at socket.c:2619
#1  socket_event_handler (fd=5, idx=11, gen=16, data=0x7f90d80018a0, poll_in=1, poll_out=0, poll_err=0, event_thread_died=0 '\000')
    at socket.c:2887
#2  0x00007f90ec34c286 in event_dispatch_epoll_handler (event=0x7f90d7ffee70, event_pool=0x5632b0727560) at event-epoll.c:648
#3  event_dispatch_epoll_worker (data=0x5632b077a420) at event-epoll.c:761
#4  0x00007f90eb126dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f90ea9ee02d in clone () from /lib64/libc.so.6

Comment 14 Xavi Hernandez 2019-12-09 16:16:33 UTC

@joe, do you have any port scanning tool that could try to probe gluster ports ?

Comment 15 joe.chan 2019-12-11 03:41:04 UTC

(In reply to Xavi Hernandez from comment #14)
> @joe, do you have any port scanning tool that could try to probe gluster
> ports ?

Yes, I have use the security scanner called Rapid7 to ensure the security hardening of the whole system. I think the root cause of this crash is most likely due to the scanning as the case can be reproduced.
Thank you.

Comment 16 Worker Ant 2019-12-11 17:37:47 UTC

REVIEW: https://review.gluster.org/23861 (socket: fix error handling) posted (#1) for review on master by Xavi Hernandez

Comment 17 Xavi Hernandez 2019-12-11 17:44:36 UTC

I've found the problem. I've sent a patch to fix it. As soon as it is accepted I will backport it so that it will be fixed in next release.

Comment 18 Worker Ant 2019-12-11 17:50:13 UTC

REVISION POSTED: https://review.gluster.org/23861 (socket: fix error handling) posted (#2) for review on master by Xavi Hernandez

Comment 19 joe.chan 2019-12-12 02:37:48 UTC

Thank you for your support, however, I have blocked the corresponding port and only allow the GlusterFS member to access for the current workaround.

Comment 20 Worker Ant 2019-12-13 11:35:31 UTC

REVIEW: https://review.gluster.org/23872 (socket: fix error handling) posted (#1) for review on release-6 by Xavi Hernandez

Comment 21 Worker Ant 2019-12-24 05:20:55 UTC

REVIEW: https://review.gluster.org/23872 (socket: fix error handling) merged (#2) on release-6 by Xavi Hernandez

Comment 22 Xavi Hernandez 2020-01-27 10:25:21 UTC

*** Bug 1740413 has been marked as a duplicate of this bug. ***

Comment 23 Xavi Hernandez 2020-01-27 11:16:42 UTC

*** Bug 1739884 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.