Bug 1272436

Summary:	glusterd crashing
Product:	[Community] GlusterFS	Reporter:	gene
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	3.7.4	CC:	amukherj, bugs, florian.leduc, gene, mselvaga, nicolas, smohan
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-06-22 05:06:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1288060
Bug Blocks:

Description gene 2015-10-16 12:05:28 UTC

Description of problem:
glusted is crashing as described in the thread at https://www.gluster.org/pipermail/gluster-users/2015-October/023783.html Community members looked at the core dump and said it looks like a glibc corruption. Vijay Bellur requested a bug report be opened.

Version-Release number of selected component (if applicable):
# rpm -qa | grep gluster
glusterfs-geo-replication-3.7.4-2.el6.x86_64
glusterfs-client-xlators-3.7.4-2.el6.x86_64
glusterfs-3.7.4-2.el6.x86_64
glusterfs-libs-3.7.4-2.el6.x86_64
glusterfs-api-3.7.4-2.el6.x86_64
glusterfs-fuse-3.7.4-2.el6.x86_64
glusterfs-server-3.7.4-2.el6.x86_64
glusterfs-cli-3.7.4-2.el6.x86_64

How reproducible:
It has core dumped on multiple nodes multiple times.

Steps to Reproduce:
Not sure of how to reproduce

Actual results:
Gluster to keep running

Expected results:
Gluster crashing

Additional info:
# gluster volume info
Volume Name: gv0
Type: Replicate
Volume ID: fc50d049-cebe-4a3f-82a6-748847226099
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: eapps-gluster01.uwg.westga.edu:/export/sdb1/gv0
Brick2: eapps-gluster02.uwg.westga.edu:/export/sdb1/gv0
Brick3: eapps-gluster03.uwg.westga.edu:/export/sdb1/gv0
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.drc: off

# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick eapps-gluster01.uwg.westga.edu:/expor
t/sdb1/gv0                                  49152     0          Y       36149
Brick eapps-gluster02.uwg.westga.edu:/expor
t/sdb1/gv0                                  49152     0          Y       24797
Brick eapps-gluster03.uwg.westga.edu:/expor
t/sdb1/gv0                                  N/A       N/A        N       N/A
NFS Server on localhost                     2049      0          Y       26812
Self-heal Daemon on localhost               N/A       N/A        Y       26820
NFS Server on eapps-gluster03.uwg.westga.ed
u                                           2049      0          Y       47314
Self-heal Daemon on eapps-gluster03.uwg.wes
tga.edu                                     N/A       N/A        Y       47322
NFS Server on eapps-gluster02.uwg.westga.ed
u                                           2049      0          Y       52522
Self-heal Daemon on eapps-gluster02.uwg.wes
tga.edu                                     N/A       N/A        Y       52535

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.7 (Santiago)


Core dump info requested in the thread:

Both of the requested trace commands are below:

Core was generated by `/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid'.
Program terminated with signal 6, Aborted.
#0  0x0000003b91432625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);



(gdb) bt
#0  0x0000003b91432625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003b91433e05 in abort () at abort.c:92
#2  0x0000003b91470537 in __libc_message (do_abort=2, fmt=0x3b915588c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x0000003b91475f4e in malloc_printerr (action=3, str=0x3b9155687d "corrupted double-linked list", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x0000003b914763d3 in malloc_consolidate (av=0x7fee90000020) at malloc.c:5216
#5  0x0000003b91479c28 in _int_malloc (av=0x7fee90000020, bytes=<value optimized out>) at malloc.c:4415
#6  0x0000003b9147a7ed in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4093
#7  0x0000003b9345c81f in __gf_calloc (nmemb=<value optimized out>, size=<value optimized out>, type=59, typestr=0x7fee9ed2d708 "gf_common_mt_rpc_trans_t") at mem-pool.c:117
#8  0x00007fee9ed2830b in socket_server_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0xf3eca0, poll_in=1, poll_out=<value optimized out>,
    poll_err=<value optimized out>) at socket.c:2622
#9  0x0000003b9348b0a0 in event_dispatch_epoll_handler (data=0xf408b0) at event-epoll.c:575
#10 event_dispatch_epoll_worker (data=0xf408b0) at event-epoll.c:678
#11 0x0000003b91807a51 in start_thread (arg=0x7fee9db3b700) at pthread_create.c:301
#12 0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115




(gdb) t a a bt

Thread 9 (Thread 0x7fee9e53c700 (LWP 37122)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00007fee9fffcf93 in hooks_worker (args=<value optimized out>) at glusterd-hooks.c:534
#2  0x0000003b91807a51 in start_thread (arg=0x7fee9e53c700) at pthread_create.c:301
#3  0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 8 (Thread 0x7feea0c99700 (LWP 36996)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x0000003b9346cbdb in syncenv_task (proc=0xefa8c0) at syncop.c:607
#2  0x0000003b93472cb0 in syncenv_processor (thdata=0xefa8c0) at syncop.c:699
#3  0x0000003b91807a51 in start_thread (arg=0x7feea0c99700) at pthread_create.c:301
#4  0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 7 (Thread 0x7feea209b700 (LWP 36994)):
#0  do_sigwait (set=<value optimized out>, sig=0x7feea209ae5c) at ../sysdeps/unix/sysv/linux/sigwait.c:65
#1  __sigwait (set=<value optimized out>, sig=0x7feea209ae5c) at ../sysdeps/unix/sysv/linux/sigwait.c:100
#2  0x0000000000405dfb in glusterfs_sigwaiter (arg=<value optimized out>) at glusterfsd.c:1989
#3  0x0000003b91807a51 in start_thread (arg=0x7feea209b700) at pthread_create.c:301
#4  0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 6 (Thread 0x7feea2a9c700 (LWP 36993)):
#0  0x0000003b9180efbd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b934473ea in gf_timer_proc (ctx=0xecc010) at timer.c:205
#2  0x0000003b91807a51 in start_thread (arg=0x7feea2a9c700) at pthread_create.c:301
#3  0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7feea9e04740 (LWP 36992)):
#0  0x0000003b918082ad in pthread_join (threadid=140662814254848, thread_return=0x0) at pthread_join.c:89
#1  0x0000003b9348ab4d in event_dispatch_epoll (event_pool=0xeeb5b0) at event-epoll.c:762
#2  0x0000000000407b24 in main (argc=2, argv=0x7fff5294adc8) at glusterfsd.c:2333

Thread 4 (Thread 0x7feea169a700 (LWP 36995)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x0000003b9346cbdb in syncenv_task (proc=0xefa500) at syncop.c:607
#2  0x0000003b93472cb0 in syncenv_processor (thdata=0xefa500) at syncop.c:699
#3  0x0000003b91807a51 in start_thread (arg=0x7feea169a700) at pthread_create.c:301
#4  0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7fee9d13a700 (LWP 37124)):
#0  0x0000003b914e8f33 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b9348aed1 in event_dispatch_epoll_worker (data=0xf405b0) at event-epoll.c:668
#2  0x0000003b91807a51 in start_thread (arg=0x7fee9d13a700) at pthread_create.c:301
#3  0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7fee97fff700 (LWP 37125)):
#0  0x0000003b914e8f33 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b9348aed1 in event_dispatch_epoll_worker (data=0xf6b4d0) at event-epoll.c:668
#2  0x0000003b91807a51 in start_thread (arg=0x7fee97fff700) at pthread_create.c:301
#3  0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7fee9db3b700 (LWP 37123)):
#0  0x0000003b91432625 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003b91433e05 in abort () at abort.c:92
#2  0x0000003b91470537 in __libc_message (do_abort=2, fmt=0x3b915588c0 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198

---Type <return> to continue, or q <return> to quit---
#3  0x0000003b91475f4e in malloc_printerr (action=3, str=0x3b9155687d "corrupted double-linked list", ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6350
#4  0x0000003b914763d3 in malloc_consolidate (av=0x7fee90000020) at malloc.c:5216
#5  0x0000003b91479c28 in _int_malloc (av=0x7fee90000020, bytes=<value optimized out>) at malloc.c:4415
#6  0x0000003b9147a7ed in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4093
#7  0x0000003b9345c81f in __gf_calloc (nmemb=<value optimized out>, size=<value optimized out>, type=59, typestr=0x7fee9ed2d708 "gf_common_mt_rpc_trans_t") at mem-pool.c:117
#8  0x00007fee9ed2830b in socket_server_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0xf3eca0, poll_in=1, poll_out=<value optimized out>,
    poll_err=<value optimized out>) at socket.c:2622
#9  0x0000003b9348b0a0 in event_dispatch_epoll_handler (data=0xf408b0) at event-epoll.c:575
#10 event_dispatch_epoll_worker (data=0xf408b0) at event-epoll.c:678
#11 0x0000003b91807a51 in start_thread (arg=0x7fee9db3b700) at pthread_create.c:301
#12 0x0000003b914e893d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 1 florian.leduc 2015-11-27 09:02:03 UTC

Hello guys,

I've been experiencing the same issue lately. I've got a 255 GB core dump ... that I could'nt exploit.

The symptoms are quite the same that explained in the original report (https://www.gluster.org/pipermail/gluster-users/2015-October/023784.html)

glustershd.log.2.gz:[2015-11-26 06:47:59.053991] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fbe53b8f182] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fbe548cc7c5] -->/usr/sbin/glusterfs(clea
nup_and_exit+0x69) [0x7fbe548cc659] ) 0-: received signum (15), shutting down

after that message, glusterd is crashed. 

I'm running glusterfs on ubuntu 14.04:

ii  glusterfs-client                    3.7.6-ubuntu1~trusty1            amd64        clustered file-system (client package)
ii  glusterfs-common                    3.7.6-ubuntu1~trusty1            amd64        GlusterFS common libraries and translator modules
ii  glusterfs-server                    3.7.6-ubuntu1~trusty1            amd64        clustered file-system (server package)

I will follow this thread, if you need more input, feel free to let me know.

Comment 2 florian.leduc 2015-12-02 09:33:42 UTC

Hello guys, 

Any updates on this bug?

What do you suggest ? Should I wait for a immient patch or downgrade my servers to a prior version (<= 3.7.4) ?

Comment 3 Atin Mukherjee 2015-12-02 09:38:47 UTC

Could you mention the configuration values for the following options in glusterd.vol file?

ping-timeout
event-threads

We observed few crashes when multi threaded e-poll support was enabled in glusterd and I suspect this could be one of them. We had decided to revert the settings. You shouldn't be seeing this crash with 3.7.6 onwards.

Comment 4 florian.leduc 2015-12-02 10:33:02 UTC

Hello, 

thanks for your quick answers. 

Here's a sample or glusterd.vol:

volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 30
#   option base-port 49152
end-volume

Comment 5 Atin Mukherjee 2015-12-02 12:42:05 UTC

Are you ok to upgrade to 3.7.6 and try it out?

Comment 6 florian.leduc 2015-12-02 13:06:58 UTC

Hello,

The main problem is that I'm already using that version (see comment#1). Should I downgrade ?

Comment 7 Gaurav Kumar Garg 2015-12-03 07:33:22 UTC

(In reply to florian.leduc from comment #6)
> Hello,
> 
> The main problem is that I'm already using that version (see comment#1).
> Should I downgrade ?

Hi florian,

Could you do some configuration in glusterd.vol file which present in /usr/local/etc/glusterfs/glusterd.vol file. in that file add/modify below entry:

option event-threads 1
option ping-timeout 0


and restart the glusterd, and let me know if you face glusterd crash problem again.

Comment 8 Atin Mukherjee 2015-12-03 08:42:08 UTC

I've one more point to share. I thought we had already disabled multi threaded e-poll support in GlusterD and it seems like we missed to do so and will surely do it in next 3.7.x release. #c7 actually is a work around to disable it.

Comment 9 florian.leduc 2015-12-03 11:48:19 UTC

Hello Guys,

I'll do that today or tomorrow. I'll keep you up to date.

Comment 10 Gaurav Kumar Garg 2015-12-03 12:23:01 UTC

Hi florian,

patch http://review.gluster.org/#/c/12874/ will be available soon in gluster codebase. meanwhile you can do configuration of glusterd.vol file manually and let us know if the  issue is still reproducing after doing configuration.

Comment 11 florian.leduc 2015-12-03 13:44:57 UTC

Perfect,

I've just modified the settings. We will monitor our systems intensively and let you know if the crashes still occur. 

Thanks for your quick replies.

Comment 12 florian.leduc 2015-12-07 11:24:41 UTC

Hello Guys,

No glusterd crashing during the whole weekend :). Should I maintain those options in my CMDB or should I wait for the next patch to get it?

Regards,

Comment 13 Atin Mukherjee 2015-12-08 04:18:53 UTC

(In reply to florian.leduc from comment #12)
> Hello Guys,
> 
> No glusterd crashing during the whole weekend :). Should I maintain those
> options in my CMDB or should I wait for the next patch to get it?
> 
> Regards,

Folrian,

We'd encourage you to maintain the same configuration till we release 3.7.7.

Thanks,
Atin

Comment 14 florian.leduc 2015-12-15 17:58:28 UTC

Hello guys,

For some times, no crashed occured, but after enabling quota feature we started to see crashes of glusterfsd (but no more glusterd) and we experienced wierd behavior:

1. glusterfsd crashes from times to times (see backtrace below)
2. after enabling quotas, a lot of CPU was consumed (around 60% of 32 vcpu). 
3. a lot of split-brain and unsynched entries has appeared in gluster vomheal info


[2015-12-15 17:35:54.236684] I [glusterfsd-mgmt.c:57:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-12-15 17:35:54.241767] I [graph.c:269:gf_add_cmdline_options] 0-data-01-server: adding option 'listen-port' for volume 'data-01-server' with value '49154'
[2015-12-15 17:35:54.241810] I [graph.c:269:gf_add_cmdline_options] 0-data-01-posix: adding option 'glusterd-uuid' for volume 'data-01-posix' with value 'e2a44035-0e7d-4796-819a-062f916b0d49'
[2015-12-15 17:35:54.248617] I [MSGID: 121037] [changetimerecorder.c:1686:reconfigure] 0-data-01-changetimerecorder: set!
[2015-12-15 17:35:54.249140] W [socket.c:3636:reconfigure] 0-data-01-quota: NBIO on -1 failed (Bad file descriptor)
[2015-12-15 17:35:54.249388] I [MSGID: 115034] [server.c:403:_check_for_auth_option] 0-/var/opt/hosting/data/volume_data-01: skip format check for non-addr auth option auth.login./var/opt/hosting/data/volume_data-01.allow
[2015-12-15 17:35:54.249442] I [MSGID: 115034] [server.c:403:_check_for_auth_option] 0-/var/opt/hosting/data/volume_data-01: skip format check for non-addr auth option auth.login.8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e.password
[2015-12-15 17:35:54.249648] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e
[2015-12-15 17:35:54.249686] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e
[2015-12-15 17:35:54.249713] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e
[2015-12-15 17:35:54.249741] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e
[2015-12-15 17:35:54.249771] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e
[2015-12-15 17:35:54.249795] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e
pending frames:
frame : type(0) op(14)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-12-15 17:35:54
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.6
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x92)[0x7f9aced33562]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f9aced4f51d]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7f9ace131d40]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_spin_lock+0x0)[0x7f9ace4cd0f0]
---------

Comment 15 Atin Mukherjee 2015-12-16 03:49:17 UTC

Hi Vijaikumar,

Can you please take a look at it?

Comment 16 Vijaikumar Mallikarjuna 2015-12-17 03:37:39 UTC

Hi Folrian,

Could you please provide the stack-trace from the glusterfsd core-dump

Thanks,
Vijay

Comment 17 florian.leduc 2015-12-17 08:29:27 UTC

Hello Vijaikumar,

thanks for your reply. 

After a quick look at the system. I could'nt find any core dumps, can you give me a hint of where it should be located ? (I tried to google it, but no luck so far).

I once got a core dump in brick which is: /var/opt/hosting/data/volume_data-01.

Comment 18 florian.leduc 2015-12-17 08:37:17 UTC

BTW, here's our configuration:

Volume Name: data-01
Type: Replicate
Volume ID: 4b2b4dbe-a8dd-4988-b76e-0e1fc7c0dda9
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.234.208.154:/var/opt/hosting/data/volume_data-01
Brick2: 10.234.208.155:/var/opt/hosting/data/volume_data-01
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
nfs.disable: on
cluster.self-heal-window-size: 128
cluster.data-self-heal-algorithm: diff
cluster.min-free-disk: 5
network.frame-timeout: 600
network.ping-timeout: 60
performance.write-behind-window-size: 128MB
performance.cache-max-file-size: 100MB
performance.cache-min-file-size: 1KB
performance.cache-size: 10GB
performance.cache-refresh-timeout: 5
cluster.self-heal-daemon: on

Comment 19 Vijaikumar Mallikarjuna 2015-12-17 12:34:21 UTC

Hi Folrian,

Usually core-file will be generated under the root dir '/' (which is a cwd of a brick process).
If the core pattern is set in the kernel parameter to gerenerate corefile in a different directory other than cwd, it will be in the specified dir.
In RHEL, core pattern may be set to '/var/crash' or '/var/log/crash'
Command to check the core pattern 'sysctl kernel.core_pattern'

Also check for 'ulimit -c', if it is zero then corefile would have not generated


We will also try to re-create this problem in-house

Thanks,
Vijay

Comment 20 florian.leduc 2015-12-17 13:42:53 UTC

Hi,

I havent found any trails of core files on that system (that should be named "core" according to sysctl). I'll do more searching on the next crash.


here's a pastebin alerts sent by thru syslog: http://pastebin.com/1JZZuz86

Comment 21 florian.leduc 2015-12-23 09:40:22 UTC

Hi everyone,

We're still experiencing a lot of severe crashes (no trail of core dump on the volume) and then a lot of unsynched entries after healing passed even after reinstalling the whole volume from scratch.


==== Logs:

red-ack   Dec 22 21:10:30: Program: ssh%3A%2F%2Froot%4010.234.208.15 [2015-12-22 20, Facility: daemon, Level:
crit
10:30.601517] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.
10:30.601517] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.
red-ack   Dec 22 21:10:21: Program: glustershd[40694], Facility: daemon, Level: crit
[2015-12-22 20:10:21.209994] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.
[2015-12-22 20:10:21.209994] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.
red-ack   Dec 22 21:10:15: Program: ssh%3A%2F%2Froot%4010.234.144.57 [2015-12-22 20, Facility: daemon, Level:
crit
10:15.976956] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.
10:15.976956] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.
red-ack   Dec 22 21:09:30: Program: var-opt-hosting-shared-volumes-d [2015-12-22 20, Facility: daemon, Level:
crit
09:30.414887] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.
09:30.414887] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153
has not responded in the last 60 seconds, disconnecting.


==== Volume Heal info output:
....
<gfid:e2d18ab9-a607-499d-babf-8fdaa90dd0bb> 
<gfid:199ba193-0788-4e3b-8951-26f0841c7e45> 
<gfid:77e2401a-2b98-4713-99b3-444bff26a222> 
<gfid:aa47948d-cd91-4d70-941d-21342d4acf06> 
<gfid:ef1f3a4f-6c7b-4741-a846-e8e78174369a> 
<gfid:38856f67-d776-4000-ab42-e548a0ab5f09> 
<gfid:7aa8f688-a53b-4962-81da-ffe5c45ac025> 
<gfid:b9d4bef4-bdee-45dc-bac5-85fdb45f6f41> 
<gfid:ba930fd2-3f46-4c32-99f4-6b6f344b649b> 
<gfid:4d6b8109-cf72-4837-bc48-45158785227a> 
<gfid:62025fc2-e011-4ce0-a3bb-2815bceaaac4> 
Number of entries: 853

Could you please advise.

Thanks.

Comment 22 Atin Mukherjee 2015-12-23 09:49:24 UTC

Is the crash from glusterd or brick process?

Comment 23 florian.leduc 2015-12-23 10:26:33 UTC

Hello Atin,

I'd say the brick process but I have the feeling that ping-timeout set to 0 may be related to those crashes/timeouts.

What do you suggest ? keep feeding this thread or opening a new one ?

Comment 24 Atin Mukherjee 2015-12-23 11:28:41 UTC

(In reply to florian.leduc from comment #23)
> Hello Atin,
> 
> I'd say the brick process but I have the feeling that ping-timeout set to 0
> may be related to those crashes/timeouts.
I don't think ping timeout will contribute to it.
> 
> What do you suggest ? keep feeding this thread or opening a new one ?
I highly recommend of opening a new bug for this as otherwise it will be misleading since this bug talks about a crash in glusterd process.

Comment 25 Atin Mukherjee 2016-06-22 05:06:53 UTC

Since I've not received any further details around this bug, I am closing it right now, feel free to reopen if the issue persists.