Bug 1782495 - GlusterFS brick process crash
Summary: GlusterFS brick process crash
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: mainline
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1739884 1749625 1783227
TreeView+ depends on / blocked
 
Reported: 2019-12-11 17:47 UTC by Xavi Hernandez
Modified: 2020-01-14 10:23 UTC (History)
5 users (show)

Fixed In Version:
Clone Of: 1749625
: 1783227 (view as bug list)
Environment:
Last Closed: 2019-12-12 17:56:51 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 23861 0 None Merged socket: fix error handling 2019-12-12 17:56:50 UTC

Description Xavi Hernandez 2019-12-11 17:47:03 UTC
+++ This bug was initially created as a clone of Bug #1749625 +++

Description of problem:
The gluster client is somehow crash after running for a period (1~2 weeks), I would like to know is it a bug exist in GlusterFS 6.1 or other problem. 

Firstly, The glusterfs is installed by using the rpm from http://mirror.centos.org/centos/7/storage/x86_64/gluster-4.0/

client rpm:
glusterfs-6.1-1.el7.x86_64.rpm

server rpm:
glusterfs-server-6.1-1.el7.x86_64.rpm

Below is the structure of the GlusterFS in our servers:

Volume Name: k8s-volume
Type: Replicate
Volume ID: d5d673d6-a1bb-4d14-bc91-1ceab7ad761d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: hk2hp057:/gfs/k8s_pv
Brick2: hk2hp058:/gfs/k8s_pv
Brick3: hk2hp059:/gfs/k8s_pv
Options Reconfigured:
auth.reject: 172.20.117.144
auth.allow: 172.20.117.*
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on


As shown above, there are total three servers using the "Replicated Volume". It is created three replicas in pursuit of better reliability and data redundancy with regard to the official recommendation. 


Version-Release number of selected component (if applicable):


However, the gluster process go down after running for a period. it is easily to see by executing the "gluster" command - $ gluster volume status

Status of volume: k8s-volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hk2hp057:/gfs/k8s_pv                  49152     0          Y       25141
Brick hk2hp058:/gfs/k8s_pv                  N/A       N/A        N       N/A
Brick hk2hp059:/gfs/k8s_pv                  49152     0          Y       6031
Self-heal Daemon on localhost               N/A       N/A        Y       6065
Self-heal Daemon on hk2hp057                N/A       N/A        Y       25150
Self-heal Daemon on hk2hp059                N/A       N/A        Y       6048

Task Status of Volume k8s-volume
------------------------------------------------------------------------------
There are no active volume tasks

the brick under "hk2hp058" is offline.
When i check the log file under "/var/log/glusterfs/bricks/", the message as below:

[2019-09-04 07:33:50.004661] W [MSGID: 113117] [posix-metadata.c:627:posix_set_ctime] 0-k8s-volume-posix: posix set mdata failed, No ctime : /gfs/k8s_pv/.glusterfs/dd/56/dd56eca4-e7e8-4208-b589-763a428408e1 gfid:dd56eca4-e7e8-4208-b589-763a428408e1
[2019-09-04 07:33:50.004776] W [MSGID: 113117] [posix-metadata.c:627:posix_set_ctime] 0-k8s-volume-posix: posix set mdata failed, No ctime : /gfs/k8s_pv/.glusterfs/dd/56/dd56eca4-e7e8-4208-b589-763a428408e1 gfid:dd56eca4-e7e8-4208-b589-763a428408e1
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-09-04 10:53:02
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.1
/lib64/libglusterfs.so.0(+0x26db0)[0x7fa1fa2d1db0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fa1fa2dc7b4]
/lib64/libc.so.6(+0x36340)[0x7fa1f8911340]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7fa1ee6964cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7fa1fa337286]
/lib64/libpthread.so.0(+0x7dd5)[0x7fa1f9111dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fa1f89d902d]
---------

The glusterfs process is resumed after manually restart the process - "glusterd".
[root@hk2hp058 /var/log/glusterfs/bricks] systemctl restart glusterd
[root@hk2hp058 /var/log/glusterfs/bricks] gluster volume status
Status of volume: k8s-volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hk2hp057:/gfs/k8s_pv                  49152     0          Y       25141
Brick hk2hp058:/gfs/k8s_pv                  49152     0          Y       18814
Brick hk2hp059:/gfs/k8s_pv                  49152     0          Y       6031
Self-heal Daemon on localhost               N/A       N/A        Y       18846
Self-heal Daemon on hk2hp059                N/A       N/A        Y       6048
Self-heal Daemon on hk2hp057                N/A       N/A        Y       25150

Task Status of Volume k8s-volume
------------------------------------------------------------------------------
There are no active volume tasks

Thus, is it a bug or other reason lead to this issue?

--- Additional comment from  on 2019-09-06 05:07:07 CEST ---



--- Additional comment from  on 2019-09-06 05:07:35 CEST ---



--- Additional comment from Nithya Balachandran on 2019-09-09 05:24:46 CEST ---

Do you have a coredump? If yes, please install the debuginfo, and send send the bt for the core.

--- Additional comment from  on 2019-09-09 06:03:38 CEST ---

Thank you for your reply.
Unfortunately, I cannot find any coredump file in the servers with pattern "core.*"

--- Additional comment from Nithya Balachandran on 2019-09-12 05:12:40 CEST ---

In that case please use the steps in https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-backtrace.sh and the backtrace in the  log file to get more information as to what happened.

--- Additional comment from  on 2019-09-12 08:46:08 CEST ---

(In reply to Nithya Balachandran from comment #5)
> In that case please use the steps in
> https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-
> backtrace.sh and the backtrace in the  log file to get more information as
> to what happened.

Thank you for your suggestion, however, I have encountered the error while running the corresponding script. Is there something wrong?

[hk2hp060@projadm /home/projadm/user/joechan/gfs]  ./print-backtrace_v2.sh /home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm glusterfs/bricks/20190912.log
eu-addr2line --functions --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
/usr/sbin/glusterfsd
eu-addr2line --functions --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
/usr/sbin/glusterfsd

--- Additional comment from  on 2019-09-12 08:52:57 CEST ---

(In reply to joe.chan from comment #6)
> (In reply to Nithya Balachandran from comment #5)
> > In that case please use the steps in
> > https://github.com/gluster/glusterfs/blob/master/extras/devel-tools/print-
> > backtrace.sh and the backtrace in the  log file to get more information as
> > to what happened.
> 
> Thank you for your suggestion, however, I have encountered the error while
> running the corresponding script. Is there something wrong?
> 
> [hk2hp060@projadm /home/projadm/user/joechan/gfs]  ./print-backtrace_v2.sh
> /home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm
> glusterfs/bricks/20190912.log
> eu-addr2line --functions
> --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/
> usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
> eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
> /usr/sbin/glusterfsd
> eu-addr2line --functions
> --exe=/home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64/
> usr/lib/debug/usr/sbin/glusterfsd.debug /usr/sbin/glusterfsd
> eu-addr2line: cannot find symbol '/usr/sbin/glusterfsd'
> /usr/sbin/glusterfsd

Sorry, it is a mistake. Below is the backtrace in log file:

[hk2hp060@projadm /home/projadm/user/joechan/gfs]  ./print-backtrace.sh /home/projadm/user/joechan/gfs/glusterfs-debuginfo-6.1-1.el7.x86_64.rpm glusterfs/bricks/20190912.log
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc] socket_is_connected inlined at /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2887 in socket_event_handler /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2619

--- Additional comment from Nithya Balachandran on 2019-09-12 09:55:33 CEST ---

Can you confirm that you have copied the stacktrace alone to a different file and run the script on thta file? You should be seeing one line per line in the stacktrace.

--- Additional comment from  on 2019-09-12 10:10:09 CEST ---

(In reply to Nithya Balachandran from comment #8)
> Can you confirm that you have copied the stacktrace alone to a different
> file and run the script on thta file? You should be seeing one line per line
> in the stacktrace.

Yes it is confirmed that the procedure is correct.

stacktrace from log:
/lib64/libglusterfs.so.0(+0x26db0)[0x7f506d81adb0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f506d8257b4]
/lib64/libc.so.6(+0x36340)[0x7f506be5a340]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7f506d880286]
/lib64/libpthread.so.0(+0x7dd5)[0x7f506c65add5]
/lib64/libc.so.6(clone+0x6d)[0x7f506bf2202d]

only one line while running the script:
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f5061bdf4cc] socket_is_connected inlined at /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2887 in socket_event_handler /usr/src/debug/glusterfs-6.1/rpc/rpc-transport/socket/src/socket.c:2619

--- Additional comment from  on 2019-09-12 10:35:50 CEST ---

attached more debug result by manually execute the "eu-addr2line" command:

/lib64/libglusterfs.so.0(+0x26db0)[0x7f506d81adb0] _gf_msg_backtrace_nomem /usr/src/debug/glusterfs-6.1/libglusterfs/src/logging.c:1124

/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f506d8257b4] sprintf inlined at /usr/src/debug/glusterfs-6.1/libglusterfs/src/common-utils.c:954 in gf_print_trace /usr/include/bits/stdio2.h:33

/lib64/libglusterfs.so.0(+0x8c286)[0x7f506d880286] event_dispatch_epoll_handler inlined at /usr/src/debug/glusterfs-6.1/libglusterfs/src/event-epoll.c:761 in event_dispatch_epoll_worker /usr/src/debug/glusterfs-6.1/libglusterfs/src/event-epoll.c:648


I am not able to find the "/lib64/libc.so.6" and "/lib64/libpthread.so.0" under "glusterfs-debuginfo-6.1-1.el7.x86_64/usr/lib/"

--- Additional comment from Nithya Balachandran on 2019-09-13 05:46:41 CEST ---

libc and libpthread are std libraries and not part of gluster.

I've moving this to the rpc component for someone to look at.

--- Additional comment from  on 2019-09-23 06:35:12 CEST ---

attached the coredump files

--- Additional comment from  on 2019-09-23 12:17:58 CEST ---

Program terminated with signal 11, Segmentation fault.
#0  socket_is_connected (this=0x7f90d80018a0) at socket.c:2619
2619        if (priv->use_ssl) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libacl-2.2.51-14.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-59.el7_6.1.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  socket_is_connected (this=0x7f90d80018a0) at socket.c:2619
#1  socket_event_handler (fd=5, idx=11, gen=16, data=0x7f90d80018a0, poll_in=1, poll_out=0, poll_err=0, event_thread_died=0 '\000')
    at socket.c:2887
#2  0x00007f90ec34c286 in event_dispatch_epoll_handler (event=0x7f90d7ffee70, event_pool=0x5632b0727560) at event-epoll.c:648
#3  event_dispatch_epoll_worker (data=0x5632b077a420) at event-epoll.c:761
#4  0x00007f90eb126dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f90ea9ee02d in clone () from /lib64/libc.so.6

--- Additional comment from Xavi Hernandez on 2019-12-09 17:16:33 CET ---

@joe, do you have any port scanning tool that could try to probe gluster ports ?

--- Additional comment from  on 2019-12-11 04:41:04 CET ---

(In reply to Xavi Hernandez from comment #14)
> @joe, do you have any port scanning tool that could try to probe gluster
> ports ?

Yes, I have use the security scanner called Rapid7 to ensure the security hardening of the whole system. I think the root cause of this crash is most likely due to the scanning as the case can be reproduced.
Thank you.

--- Additional comment from Worker Ant on 2019-12-11 18:37:47 CET ---

REVIEW: https://review.gluster.org/23861 (socket: fix error handling) posted (#1) for review on master by Xavi Hernandez

--- Additional comment from Xavi Hernandez on 2019-12-11 18:44:36 CET ---

I've found the problem. I've sent a patch to fix it. As soon as it is accepted I will backport it so that it will be fixed in next release.

Comment 1 Worker Ant 2019-12-11 17:50:17 UTC
REVIEW: https://review.gluster.org/23861 (socket: fix error handling) posted (#2) for review on master by Xavi Hernandez

Comment 2 Worker Ant 2019-12-12 17:56:51 UTC
REVIEW: https://review.gluster.org/23861 (socket: fix error handling) merged (#3) on master by Raghavendra G

Comment 3 Oonkwee Lim 2020-01-13 19:47:22 UTC
Customer is asking

Can you possibly help us out and see if there are any logs referencing any ip addresses on what else could be scanning those ports?

Comment 4 Xavi Hernandez 2020-01-14 10:23:16 UTC
I see there's a message logged each time a connection is accepted, but it's only logged in TRACE logging level. I don't recommend to switch to TRACE log level on production for a long period.

It would be better, for example, to run tcpdump with a filter that only monitors SYN packets to some ports, and then collect the source addresses.


Note You need to log in before you can comment on or make changes to this bug.