Description of problem: ============== when we try to peer probe a node where the IP addr has the range more than 255, the glusterd is crashing consistently(alteast 95% times, checked this on 5 different setups) Issue a gluster peer probe 10.70.35.1221 ===> note that the last part is a 4 digit glusterd crashes This is consistent and can easily happen if the admin makes a typo mistake, which is quite possible Check on 3.1.3 (3.7.9-10), i couldn't reproduce. on 3.8.4-18, mention anything above 255 it crashes Core details: [root@dhcp35-138 ~]# file /core.30402 /core.30402: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterd', platform: 'x86_64' [root@dhcp35-138 ~]# gdb /usr/sbin/glusterd /core.30402 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. warning: core file may not match specified executable file. [New LWP 29703] [New LWP 30405] [New LWP 30403] [New LWP 30404] [New LWP 30406] [New LWP 30402] [New LWP 30607] [New LWP 30608] [New LWP 29704] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 314 GF_ASSERT (GF_MEM_TRAILER_MAGIC == Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): === 3.8.4-18 How reproducible: ==== always(or say 95% times) Steps to Reproduce: 1.setup a gluster node 2.issue a peer probe to say 10.70.35.x (where x is >255) 3.glusterd crashes
I hit this on my setup as well just now . [root@localhost bricks]# gluster peer probe 10.70.37.12345 peer probe: failed: Probe returned with Transport endpoint is not connected [root@localhost bricks]# The weird thing is I see this file getting created with the wrong/random hostname : [root@localhost peers]# ll -h /var/lib/glusterd/peers/ total 12K -rw-------. 1 root root 73 Mar 17 05:52 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b -rw-------. 1 root root 75 Mar 17 05:52 10.70.37.12345 -----> BAD -rw-------. 1 root root 94 Mar 17 05:52 f6384f3a-ab69-4757-8fc8-eda43bd17c2e [root@localhost peers]# [root@localhost peers]# cat 10.70.37.12345 uuid=00000000-0000-0000-0000-000000000000 state=0 hostname1=10.70.37.12345 [root@localhost peers]# Peer Status fails on the crashed node as well : [root@localhost peers]# gluster peer status peer status: failed [root@localhost peers]# Though it works fine on other nodes : [root@localhost /]# gluster peer status Number of Peers: 2 Hostname: 10.70.37.65 Uuid: 32095651-cbda-40e8-941c-6b75c260610e State: Peer in Cluster (Connected) Hostname: 10.70.37.116 Uuid: 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b State: Peer in Cluster (Connected) [root@localhost /]#
The issue is reproducible if I give peer probe "abcd" as well. Samikshan shared a similar upstream BZ - https://bugzilla.redhat.com/show_bug.cgi?id=770048 ,which got later closed as WFM as noone could reproduce it. But it's very very consistent now.
downstream patch : https://code.engineering.redhat.com/gerrit/101366
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774