Bug 1790670

Summary: glusterd crashed when trying to add node
Product: [Community] GlusterFS Reporter: Rick Pizzi <pizzi>
Component: glusterdAssignee: Sanju <srakonde>
Status: CLOSED NOTABUG QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7CC: bugs, srakonde
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-14 11:15:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
requested core file none

Description Rick Pizzi 2020-01-13 21:14:23 UTC
Description of problem:

Trying to add a node to an existing cluster causes glusterd to crash when peer probe is issued on any of the nodes of the existing cluster.

Tried several times, from all nodes, using both public and private interface address for probe, to no avail.

Actually the join seem to succeed (peer status on other node says it joined), but since glusterd crashes, on next execution something is corrupted and glusterd will not start. 

Log of crashing glusterd:

[2020-01-13 20:45:25.057344] I [glusterd.c:1998:init] 0-management: Regenerating volfiles due to a max op-version mismatch or glusterd.upgrade file not being present, op_version retrieved:0, max op_version: 70000
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 1024
  8:     option max-port 60999
  9:     option event-threads 1
 10:     option ping-timeout 0
 11:     option transport.rdma.listen-port 24008
 12:     option transport.socket.listen-port 24007
 13:     option transport.socket.read-fail-log off
 14:     option transport.socket.keepalive-interval 2
 15:     option transport.socket.keepalive-time 10
 16:     option transport-type rdma
 17:     option working-directory /var/lib/glusterd
 18: end-volume
 19:  
+------------------------------------------------------------------------------+
[2020-01-13 20:45:25.069364] I [MSGID: 101190] [event-epoll.c:682:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 
[2020-01-13 20:45:27.338940] I [MSGID: 106487] [glusterd-handler.c:1339:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req 
[2020-01-13 20:45:53.609592] I [MSGID: 106163] [glusterd-handshake.c:1433:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100 
[2020-01-13 20:45:53.609631] E [MSGID: 101032] [store.c:493:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2020-01-13 20:45:53.609666] I [MSGID: 106477] [glusterd.c:182:glusterd_uuid_generate_save] 0-management: generated UUID: a4bc89ed-5100-4c82-942c-2e23126d8fef 
[2020-01-13 20:45:53.668763] I [MSGID: 106490] [glusterd-handler.c:2789:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8 
[2020-01-13 20:45:53.670396] I [MSGID: 106128] [glusterd-handler.c:2824:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local (24007) 
[2020-01-13 20:45:53.677593] W [MSGID: 106061] [glusterd-handler.c:3315:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout 
[2020-01-13 20:45:53.677635] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-01-13 20:45:53.680479] I [MSGID: 106498] [glusterd-handler.c:3470:glusterd_friend_add] 0-management: connect returned 0 
[2020-01-13 20:45:53.680536] I [MSGID: 106493] [glusterd-handler.c:2850:__glusterd_handle_probe_query] 0-glusterd: Responded to 185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local, op_ret: 0, op_errno: 0, ret: 0 
[2020-01-13 20:45:53.681379] I [MSGID: 106490] [glusterd-handler.c:2434:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8 
[2020-01-13 20:45:53.689395] I [MSGID: 106493] [glusterd-handler.c:3715:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local (0), ret: 0, op_ret: 0 
[2020-01-13 20:45:53.723341] I [MSGID: 106511] [glusterd-rpc-ops.c:250:__glusterd_probe_cbk] 0-management: Received probe resp from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8, host: 185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local 
[2020-01-13 20:45:53.723376] I [MSGID: 106511] [glusterd-rpc-ops.c:403:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req 
[2020-01-13 20:45:53.740791] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8, host: 185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local, port: 0 
[2020-01-13 20:45:53.746122] I [MSGID: 106492] [glusterd-handler.c:2619:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8 
[2020-01-13 20:45:53.751125] W [MSGID: 106061] [glusterd-handler.c:3315:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout 
[2020-01-13 20:45:53.751174] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-01-13 20:45:53.753764] I [MSGID: 106498] [glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2020-01-13 20:45:53
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 7.1

Sometimes a stack trace is printed:

/lib64/libglusterfs.so.0(+0x277ff)[0x7f19e70437ff]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f19e704e234]
/lib64/libc.so.6(+0x363b0)[0x7f19e56843b0]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26c42)[0x7f19e134cc42]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x279ae)[0x7f19e134d9ae]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x2840f)[0x7f19e134e40f]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x2362e)[0x7f19e134962e]
/lib64/libgfrpc.so.0(+0x9695)[0x7f19e6de7695]
/lib64/libgfrpc.so.0(+0x9a0b)[0x7f19e6de7a0b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f19e6de9a93]
/usr/lib64/glusterfs/7.1/rpc-transport/socket.so(+0x4468)[0x7f19e0564468]
/usr/lib64/glusterfs/7.1/rpc-transport/socket.so(+0xb861)[0x7f19e056b861]
/lib64/libglusterfs.so.0(+0x8e246)[0x7f19e70aa246]
/lib64/libpthread.so.0(+0x7e65)[0x7f19e5e86e65]
/lib64/libc.so.6(clone+0x6d)[0x7f19e574c88d]

If I restart glusterd, it will abort due to some corruption:

[2020-01-13 21:02:53.619924] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 7.1 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) 
[2020-01-13 21:02:53.620277] I [glusterfsd.c:2594:daemonize] 0-glusterfs: Pid of current running process is 80
[2020-01-13 21:02:53.622166] I [MSGID: 106478] [glusterd.c:1426:init] 0-management: Maximum allowed open file descriptors set to 65536 
[2020-01-13 21:02:53.622193] I [MSGID: 106479] [glusterd.c:1482:init] 0-management: Using /var/lib/glusterd as working directory 
[2020-01-13 21:02:53.622201] I [MSGID: 106479] [glusterd.c:1488:init] 0-management: Using /var/run/gluster as pid file working directory 
[2020-01-13 21:02:53.625927] I [socket.c:1014:__socket_server_bind] 0-socket.management: process started listening on port (24007)
[2020-01-13 21:02:53.627074] W [MSGID: 103071] [rdma.c:4472:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2020-01-13 21:02:53.627092] W [MSGID: 103055] [rdma.c:4782:init] 0-rdma.management: Failed to initialize IB Device 
[2020-01-13 21:02:53.627101] W [rpc-transport.c:366:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2020-01-13 21:02:53.627171] W [rpcsvc.c:1981:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2020-01-13 21:02:53.627179] E [MSGID: 106244] [glusterd.c:1781:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport 
[2020-01-13 21:02:53.628336] I [socket.c:957:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 12
[2020-01-13 21:02:54.544119] I [MSGID: 106513] [glusterd-store.c:2257:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 
[2020-01-13 21:02:54.544330] I [MSGID: 106498] [glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 
[2020-01-13 21:02:54.544595] E [glusterd-handler.c:3275:glusterd_transport_inet_options_build] (-->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x8aebe) [0x7fd0ac39eebe] -->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26c7a) [0x7fd0ac33ac7a] -->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26b56) [0x7fd0ac33ab56] ) 0-: Assertion failed: hostname 
[2020-01-13 21:02:54.544620] E [rpc-transport.c:655:rpc_transport_inet_options_build] (-->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26c7a) [0x7fd0ac33ac7a] -->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x267fc) [0x7fd0ac33a7fc] -->/lib64/libgfrpc.so.0(rpc_transport_inet_options_build+0x2b6) [0x7fd0b1dd8156] ) 0-: Assertion failed: hostname 
The message "I [MSGID: 106498] [glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 2 times between [2020-01-13 21:02:54.544330] and [2020-01-13 21:02:54.544420]
[2020-01-13 21:02:54.544637] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again 
[2020-01-13 21:02:54.544649] E [MSGID: 101066] [graph.c:425:glusterfs_graph_init] 0-management: initializing translator failed 
[2020-01-13 21:02:54.544654] E [MSGID: 101176] [graph.c:779:glusterfs_graph_activate] 0-graph: init failed 
[2020-01-13 21:02:54.544729] W [glusterfsd.c:1596:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55b5b2bbc19d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x21d) [0x55b5b2bbc08d] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55b5b2bbb48b] ) 0-: received signum (-1), shutting down 
[2020-01-13 21:02:54.544764] W [mgmt-pmap.c:132:rpc_clnt_mgmt_pmap_signout] 0-glusterfs: failed to create XDR payload


firewall is open (all the 3 nodes have an ACCEPT rule for any port and protocol in iptables)


Version-Release number of selected component (if applicable):

glusterfs-4.1.6-1.el7.x86_64
glusterfs-fuse-4.1.6-1.el7.x86_64
glusterfs-libs-4.1.6-1.el7.x86_64
glusterfs-client-xlators-4.1.6-1.el7.x86_64

Also tried with latest (4.1.7), issue is the same



How reproducible:

It happens on a new node that I am trying to add to cluster.
Did not happen on an identical previous node!


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sanju 2020-01-14 08:47:28 UTC
Please share the output of "t a a bt" from the core file. Pleas attach the core file if possible.

Thanks,
Sanju

Comment 2 Rick Pizzi 2020-01-14 08:59:12 UTC
(gdb) t a a bt

Thread 9 (Thread 0x7fe14c8db700 (LWP 83)):
#0  0x00007fe14f76eda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe1509689a0 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007fe150969850 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007fe14f76ae65 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe14f03088d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fe14827c700 (LWP 142)):
#0  0x00007fe14f76e9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe14acedecb in hooks_worker () from /usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so
#2  0x00007fe14f76ae65 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe14f03088d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fe150e2e4c0 (LWP 79)):
#0  0x00007fe14f76bfd7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fe15098d4c8 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x00005616983d3a01 in main ()

Thread 6 (Thread 0x7fe14c0da700 (LWP 84)):
#0  0x00007fe14f76eda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe1509689a0 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007fe150969850 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007fe14f76ae65 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe14f03088d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fe14b8d9700 (LWP 85)):
#0  0x00007fe14f027953 in select () from /lib64/libc.so.6
#1  0x00007fe1509a8814 in runner () from /lib64/libglusterfs.so.0
#2  0x00007fe14f76ae65 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe14f03088d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fe14d0dc700 (LWP 82)):
#0  0x00007fe14eff780d in nanosleep () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit--- 
#1  0x00007fe14eff76a4 in sleep () from /lib64/libc.so.6
#2  0x00007fe150953948 in pool_sweeper () from /lib64/libglusterfs.so.0
#3  0x00007fe14f76ae65 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe14f03088d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe14d8dd700 (LWP 81)):
#0  0x00007fe14f772381 in sigwait () from /lib64/libpthread.so.0
#1  0x00005616983d75db in glusterfs_sigwaiter ()
#2  0x00007fe14f76ae65 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe14f03088d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe14e0de700 (LWP 80)):
#0  0x00007fe14f76eda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe15093691b in gf_timer_proc () from /lib64/libglusterfs.so.0
#2  0x00007fe14f76ae65 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe14f03088d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fe147a7b700 (LWP 143)):
#0  0x00007fe1470740b8 in ?? () from /lib64/libgcc_s.so.1
#1  0x00007fe147074fb9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x00007fe14f046a56 in backtrace () from /lib64/libc.so.6
#3  0x00007fe1509277ff in _gf_msg_backtrace_nomem () from /lib64/libglusterfs.so.0
#4  0x00007fe150932234 in gf_print_trace () from /lib64/libglusterfs.so.0
#5  <signal handler called>
#6  0x00007fe1470740b8 in ?? () from /lib64/libgcc_s.so.1
#7  0x00007fe147074fb9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#8  0x00007fe14f046a56 in backtrace () from /lib64/libc.so.6
#9  0x00007fe14efaaea4 in __libc_message () from /lib64/libc.so.6
#10 0x00007fe14f04a527 in __fortify_fail () from /lib64/libc.so.6
#11 0x00007fe14f04a4e2 in __stack_chk_fail () from /lib64/libc.so.6
#12 0x00007fe14ac94ab5 in glusterd_store_perform_peer_store () from /usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so
#13 0x73756c672e30312d in ?? ()
#14 0x79642d7366726574 in ?? ()
#15 0x38392d63696d616e in ?? ()
---Type <return> to continue, or q <return> to quit---
#16 0x302d323633393662 in ?? ()
#17 0x396531312d613637 in ?? ()
#18 0x30302d653631622d in ?? ()
#19 0x6638316463333631 in ?? ()
#20 0x75616665642e6561 in ?? ()
#21 0x632e6376732e746c in ?? ()
#22 0x6c2e72657473756c in ?? ()
#23 0x0000000a6c61636f in ?? ()
#24 0x00007fe14ad43c08 in ?? () from /usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so
#25 0x00007fe147a7b290 in ?? ()
#26 0x0000000000000000 in ?? ()

Comment 3 Rick Pizzi 2020-01-14 09:00:40 UTC
Created attachment 1652135 [details]
requested core file

the core file you have requested

Comment 4 Sanju 2020-01-14 09:30:56 UTC
From the comment 2:

Thread 1 (Thread 0x7fe147a7b700 (LWP 143)):
#0  0x00007fe1470740b8 in ?? () from /lib64/libgcc_s.so.1
#1  0x00007fe147074fb9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x00007fe14f046a56 in backtrace () from /lib64/libc.so.6
#3  0x00007fe1509277ff in _gf_msg_backtrace_nomem () from /lib64/libglusterfs.so.0
#4  0x00007fe150932234 in gf_print_trace () from /lib64/libglusterfs.so.0
#5  <signal handler called>
#6  0x00007fe1470740b8 in ?? () from /lib64/libgcc_s.so.1
#7  0x00007fe147074fb9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#8  0x00007fe14f046a56 in backtrace () from /lib64/libc.so.6
#9  0x00007fe14efaaea4 in __libc_message () from /lib64/libc.so.6
#10 0x00007fe14f04a527 in __fortify_fail () from /lib64/libc.so.6
#11 0x00007fe14f04a4e2 in __stack_chk_fail () from /lib64/libc.so.6
#12 0x00007fe14ac94ab5 in glusterd_store_perform_peer_store () from /usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so

We are seeing no memory error here:
#3  0x00007fe1509277ff in _gf_msg_backtrace_nomem ()

This might have caused the peer probe to fail because peer probe tries write into the store /var/lib/glusterd/ about the new peer. As there is no memory available, glusterd has crashed.

Closing this as not a bug. Please feel free to re-open if you hit this issue when you have memory available as well.

Thanks,
Sanju

Comment 5 Rick Pizzi 2020-01-14 09:47:57 UTC
I have put triple the memory in the POD (300M now) and the problem persists.

Besides, I don't think it is correct that an out of memory condition crashes the glusterd daemon, it should just fail gracefully (malloc returns NULL, you know, just trap that).

Again, it doesn't look like it is memory dependent to me. Also, previous node was working fine with identical config, including memory.

I believe there is some sort of buffer overflow going on, as it is overwriting the stack.

One thing that I have done is to change hostname of the previous node but this did not create problem to other existing nodes, it's just the peer probe that crashes.

Can you please have another look? 

Thank you.
Rick

Comment 6 Sanju 2020-01-14 09:55:16 UTC
Hi Rick,

can you please share output of "rpm -qa | grep glusterfs" from all the nodes. I see the op-version 40100 in the logs but core generated is by release 7.1 bits.

Thanks,
Sanju

Comment 7 Rick Pizzi 2020-01-14 09:57:17 UTC
Situation:

kubectl describe pod glusterfs-vsfbr   -n kube-system | grep memory
      memory:   900Mi


[2020-01-14 09:50:45.536179] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 7.1 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) 
[2020-01-14 09:50:45.536653] I [glusterfsd.c:2594:daemonize] 0-glusterfs: Pid of current running process is 79
[2020-01-14 09:50:45.538261] I [MSGID: 106478] [glusterd.c:1426:init] 0-management: Maximum allowed open file descriptors set to 65536 
[2020-01-14 09:50:45.538300] I [MSGID: 106479] [glusterd.c:1482:init] 0-management: Using /var/lib/glusterd as working directory 
[2020-01-14 09:50:45.538403] I [MSGID: 106479] [glusterd.c:1488:init] 0-management: Using /var/run/gluster as pid file working directory 
[2020-01-14 09:50:45.541895] I [socket.c:1014:__socket_server_bind] 0-socket.management: process started listening on port (24007)
[2020-01-14 09:50:45.543039] W [MSGID: 103071] [rdma.c:4472:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2020-01-14 09:50:45.543076] W [MSGID: 103055] [rdma.c:4782:init] 0-rdma.management: Failed to initialize IB Device 
[2020-01-14 09:50:45.543091] W [rpc-transport.c:366:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2020-01-14 09:50:45.543181] W [rpcsvc.c:1981:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2020-01-14 09:50:45.543197] E [MSGID: 106244] [glusterd.c:1781:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport 
[2020-01-14 09:50:45.544310] I [socket.c:957:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 12
[2020-01-14 09:50:46.430992] I [MSGID: 106513] [glusterd-store.c:2257:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 
[2020-01-14 09:50:46.432660] I [MSGID: 106498] [glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 
The message "I [MSGID: 106498] [glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 2 times between [2020-01-14 09:50:46.432660] and [2020-01-14 09:50:46.432760]
[2020-01-14 09:50:46.432800] W [MSGID: 106061] [glusterd-handler.c:3315:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout 
[2020-01-14 09:50:46.432823] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-01-14 09:50:46.437859] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2020-01-14 09:50:46.442293] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 1024
  8:     option event-threads 1
  9:     option ping-timeout 0
 10:     option transport.socket.read-fail-log off
 11:     option transport.socket.keepalive-interval 2
 12:     option transport.socket.keepalive-time 10
 13:     option transport-type rdma
 14:     option working-directory /var/lib/glusterd
 15: end-volume
 16:  
+------------------------------------------------------------------------------+
The message "W [MSGID: 106061] [glusterd-handler.c:3315:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout" repeated 2 times between [2020-01-14 09:50:46.432800] and [2020-01-14 09:50:46.442281]
[2020-01-14 09:50:46.450965] I [MSGID: 101190] [event-epoll.c:682:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 
[2020-01-14 09:50:46.451744] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: bc728fa4-f496-43f0-8974-9def63e989c3 
[2020-01-14 09:50:46.453163] I [MSGID: 106163] [glusterd-handshake.c:1433:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100 
[2020-01-14 09:50:46.530141] I [MSGID: 106490] [glusterd-handler.c:2434:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: d120bc11-e32d-42f5-bb3d-d9feefab396d 
[2020-01-14 09:50:46.535728] I [MSGID: 106493] [glusterd-handler.c:3715:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to nl2.leopardus.com (0), ret: 0, op_ret: 0 
[2020-01-14 09:50:46.543698] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: d120bc11-e32d-42f5-bb3d-d9feefab396d, host: nl2.leopardus.com, port: 0 
[2020-01-14 09:50:46.547771] I [MSGID: 106492] [glusterd-handler.c:2619:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: d120bc11-e32d-42f5-bb3d-d9feefab396d 
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2020-01-14 09:50:46
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 7.1


Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f9de00d60b8 in ?? () from /lib64/libgcc_s.so.1
Missing separate debuginfos, use: debuginfo-install glusterfs-server-7.1-1.el7.x86_64
(gdb) bt
#0  0x00007f9de00d60b8 in ?? () from /lib64/libgcc_s.so.1
#1  0x00007f9de00d6fb9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x00007f9de82a9a56 in backtrace () from /lib64/libc.so.6
#3  0x00007f9de9b8a7ff in _gf_msg_backtrace_nomem () from /lib64/libglusterfs.so.0
#4  0x00007f9de9b95234 in gf_print_trace () from /lib64/libglusterfs.so.0
#5  <signal handler called>
#6  0x00007f9de00d60b8 in ?? () from /lib64/libgcc_s.so.1
#7  0x00007f9de00d6fb9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#8  0x00007f9de82a9a56 in backtrace () from /lib64/libc.so.6
#9  0x00007f9de820dea4 in __libc_message () from /lib64/libc.so.6
#10 0x00007f9de82ad527 in __fortify_fail () from /lib64/libc.so.6
#11 0x00007f9de82ad4e2 in __stack_chk_fail () from /lib64/libc.so.6
#12 0x00007f9de3ef7ab5 in glusterd_store_perform_peer_store () from /usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so
#13 0x73756c672e30312d in ?? ()
#14 0x79642d7366726574 in ?? ()
#15 0x38392d63696d616e in ?? ()
#16 0x302d323633393662 in ?? ()
#17 0x396531312d613637 in ?? ()
#18 0x30302d653631622d in ?? ()
#19 0x6638316463333631 in ?? ()
#20 0x75616665642e6561 in ?? ()
#21 0x632e6376732e746c in ?? ()
#22 0x6c2e72657473756c in ?? ()
#23 0x0000000a6c61636f in ?? ()
#24 0x00007f9de3fa6c08 in ?? () from /usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so
#25 0x00007f9de0cde290 in ?? ()
#26 0x0000000000000000 in ?? ()
(gdb) 



So, I do see the "nomem" function you have mentioned in frame #3, but the crash is actually happenint much more up the chain in frame #11, and it is a stack smash
check that is detected. Anyway, it has about 1G of memory and still happens. Please help.

Comment 8 Rick Pizzi 2020-01-14 09:59:46 UTC
(In reply to Sanju from comment #6)
> Hi Rick,
> 
> can you please share output of "rpm -qa | grep glusterfs" from all the
> nodes. I see the op-version 40100 in the logs but core generated is by
> release 7.1 bits.
> 
> Thanks,
> Sanju

This is what is installed in the POD where glusterd is deployed:

glusterfs-rdma-7.1-1.el7.x86_64
glusterfs-7.1-1.el7.x86_64
glusterfs-api-7.1-1.el7.x86_64
glusterfs-fuse-7.1-1.el7.x86_64
glusterfs-geo-replication-7.1-1.el7.x86_64
glusterfs-libs-7.1-1.el7.x86_64
glusterfs-client-xlators-7.1-1.el7.x86_64
glusterfs-cli-7.1-1.el7.x86_64
glusterfs-server-7.1-1.el7.x86_64

Comment 9 Rick Pizzi 2020-01-14 10:01:07 UTC
Yes cluster is running op version 40100

Comment 10 Rick Pizzi 2020-01-14 10:05:39 UTC
It looks like the "nomem" frame is an out of memory got when trying to print the stack trace, probably due to messed up memory after the buffer overrun,
as such it should be ignored...

Comment 11 Rick Pizzi 2020-01-14 10:15:44 UTC
uuid field doesn't look proper to me...

(gdb) ptype peerinfo
type = struct glusterd_peerinfo_ {
    uuid_t uuid;
    char uuid_str[50];
    glusterd_peer_state_info_t state;
    char *hostname;
    struct cds_list_head hostnames;
    int port;
    struct cds_list_head uuid_list;
    struct cds_list_head op_peers_list;
    struct rpc_clnt *rpc;
    rpc_clnt_prog_t *mgmt;
    rpc_clnt_prog_t *peer;
    rpc_clnt_prog_t *mgmt_v3;
    int connected;
    gf_store_handle_t *shandle;
    glusterd_sm_tr_log_t sm_log;
    _Bool quorum_action;
    gd_quorum_contrib_t quorum_contrib;
    _Bool locked;
    _Bool detaching;
    gd_rcu_head rcu_head;
    pthread_mutex_t delete_lock;
    uint32_t generation;
} *


(gdb) p *peerinfo
$2 = {uuid = "6\002\261\243r\027M\373\230\313\370\265G\246\241", <incomplete sequence \343>, uuid_str = '\000' <repeats 49 times>, state = {
    state = GD_FRIEND_STATE_BEFRIENDED, transition_time = {tv_sec = 0, tv_usec = 0}}, hostname = 0x7fffe800d320 "nl-kn1.leopardus.com", hostnames = {
    next = 0x5555558438c8, prev = 0x7fffe800d918}, port = 0, uuid_list = {next = 0x555555828450, prev = 0x555555844940}, op_peers_list = {next = 0x0, prev = 0x0}, 
  rpc = 0x555555855bd0, mgmt = 0x7ffff21e38a0 <gd_mgmt_prog>, peer = 0x7ffff21e3820 <gd_peer_prog>, mgmt_v3 = 0x7ffff21e37e0 <gd_mgmt_v3_prog>, connected = 1, 
  shandle = 0x555555844fe0, sm_log = {transitions = 0x55555584fb40, current = 0, size = 50, count = 1, 
    state_name_get = 0x7ffff1dfc160 <glusterd_friend_sm_state_name_get>, event_name_get = 0x7ffff1dfc180 <glusterd_friend_sm_event_name_get>}, quorum_action = true, 
  quorum_contrib = QUORUM_NONE, locked = false, detaching = false, rcu_head = {head = {next = {next = 0x0}, func = 0x0}, this = 0x0}, delete_lock = {__data = {
      __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 39 times>, __align = 0}, generation = 6}

Comment 12 Rick Pizzi 2020-01-14 10:38:04 UTC
An update -- I found out that I have an unfortunate combination of versions.

Node 1 has 4.1.5
Node 2 has 4.1.5
Node 3 (has been updated the other day) has 7.1.1 (!!) but it joined fine with the other two...
Node 4 (trying now) has 7.1.1 and crashes

I will try with 4.1.5

Comment 13 Rick Pizzi 2020-01-14 11:15:24 UTC
OK so it looks like I am a lucky man.
The gluster team updated the docker image for the first time after more than one year, and this happened EXACTLY 3 days ago when I have installed the new kube node.

Their new image is incompatible with the previous one as it bumps from version 4.1 to version 7.1 and I see a lot of kube clusters out there in danger with this poor move,
when a pod  is redeployed if they pointing to latest.

Thank you for the hint and for your time. After using previous image, the problem went away.

For sure a version check in the code to avoid a buffer overflow is recommended.

Comment 14 Sanju 2020-01-14 11:23:48 UTC
(In reply to Rick Pizzi from comment #13)
> OK so it looks like I am a lucky man.
> The gluster team updated the docker image for the first time after more than
> one year, and this happened EXACTLY 3 days ago when I have installed the new
> kube node.
> 
> Their new image is incompatible with the previous one as it bumps from
> version 4.1 to version 7.1 and I see a lot of kube clusters out there in
> danger with this poor move,
> when a pod  is redeployed if they pointing to latest.
> 
> Thank you for the hint and for your time. After using previous image, the
> problem went away.
> 
> For sure a version check in the code to avoid a buffer overflow is
> recommended.

Noted Rick.