+++ This bug was initially created as a clone of Bug #1433276 +++ Description of problem: ============== when we try to peer probe a node where the IP addr has the range more than 255, the glusterd is crashing consistently(alteast 95% times, checked this on 5 different setups) Issue a gluster peer probe 10.70.35.1221 ===> note that the last part is a 4 digit glusterd crashes This is consistent and can easily happen if the admin makes a typo mistake, which is quite possible Check on 3.1.3 (3.7.9-10), i couldn't reproduce. on 3.8.4-18, mention anything above 255 it crashes Core details: [root@dhcp35-138 ~]# file /core.30402 /core.30402: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterd', platform: 'x86_64' [root@dhcp35-138 ~]# gdb /usr/sbin/glusterd /core.30402 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. warning: core file may not match specified executable file. [New LWP 29703] [New LWP 30405] [New LWP 30403] [New LWP 30404] [New LWP 30406] [New LWP 30402] [New LWP 30607] [New LWP 30608] [New LWP 29704] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 314 GF_ASSERT (GF_MEM_TRAILER_MAGIC == Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) #0 0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314 #1 0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388 #2 0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557 #3 0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900 #4 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30) at rpc-clnt.c:953 #5 0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-transport.c:538 #6 0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927 #7 0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): === 3.8.4-18 How reproducible: ==== always(or say 95% times) Steps to Reproduce: 1.setup a gluster node 2.issue a peer probe to say 10.70.35.x (where x is >255) 3.glusterd crashes --- Additional comment from Red Hat Bugzilla Rules Engine on 2017-03-17 05:52:01 EDT --- This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Ambarish on 2017-03-17 05:59:36 EDT --- I hit this on my setup as well just now . [root@localhost bricks]# gluster peer probe 10.70.37.12345 peer probe: failed: Probe returned with Transport endpoint is not connected [root@localhost bricks]# The weird thing is I see this file getting created with the wrong/random hostname : [root@localhost peers]# ll -h /var/lib/glusterd/peers/ total 12K -rw-------. 1 root root 73 Mar 17 05:52 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b -rw-------. 1 root root 75 Mar 17 05:52 10.70.37.12345 -----> BAD -rw-------. 1 root root 94 Mar 17 05:52 f6384f3a-ab69-4757-8fc8-eda43bd17c2e [root@localhost peers]# [root@localhost peers]# cat 10.70.37.12345 uuid=00000000-0000-0000-0000-000000000000 state=0 hostname1=10.70.37.12345 [root@localhost peers]# Peer Status fails on the crashed node as well : [root@localhost peers]# gluster peer status peer status: failed [root@localhost peers]# Though it works fine on other nodes : [root@localhost /]# gluster peer status Number of Peers: 2 Hostname: 10.70.37.65 Uuid: 32095651-cbda-40e8-941c-6b75c260610e State: Peer in Cluster (Connected) Hostname: 10.70.37.116 Uuid: 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b State: Peer in Cluster (Connected) [root@localhost /]# --- Additional comment from Ambarish on 2017-03-17 06:03:30 EDT --- The issue is reproducible if I give peer probe "abcd" as well. Samikshan shared a similar upstream BZ - https://bugzilla.redhat.com/show_bug.cgi?id=770048 ,which got later closed as WFM as noone could reproduce it. But it's very very consistent now. --- Additional comment from Atin Mukherjee on 2017-03-17 11:39:12 EDT --- https://review.gluster.org/#/c/15916 has caused this regression, further analysis to follow on. --- Additional comment from Atin Mukherjee on 2017-03-17 11:47:13 EDT --- (In reply to Atin Mukherjee from comment #4) > https://review.gluster.org/#/c/15916 has caused this regression, further > analysis to follow on. Ignore this. Doesn't look like the same patch which is culprit. --- Additional comment from Milind Changire on 2017-03-17 15:31:29 EDT --- When the erroneous IP Address does not pass the test for valid_ipv4_address() the test for valid_host_name() passes and the IP Address with typo is assumed as a dotted FQDN and is handed over to glusterd for processing. We could mitigate this problem of erroneous input forwarding by ensuring that the host name resolves to a valid IP Address in the cli before passing the host name to glusterd. However, we do need to RCA the assertion failure during saved_frames_destroy() I wonder if this result can be seen on a ping-timer-expiry when FOP processing is held for a long time in a gdb debug session on other node to simulate a busy brick.
REVIEW: https://review.gluster.org/16914 (rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup) posted (#1) for review on master by Atin Mukherjee (amukherj)
REVIEW: https://review.gluster.org/16914 (rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup) posted (#2) for review on master by Atin Mukherjee (amukherj)
COMMIT: https://review.gluster.org/16914 committed in master by Jeff Darcy (jeff.us) ------ commit 39e09ad1e0e93f08153688c31433c38529f93716 Author: Atin Mukherjee <amukherj> Date: Sat Mar 18 16:29:10 2017 +0530 rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup Commit 086436a introduced generation number (cleanup_gen) to ensure that rpc layer doesn't end up cleaning up the connection object if application layer has already destroyed it. Bumping up cleanup_gen was done only in rpc_clnt_connection_cleanup (). However the same is needed in rpc_clnt_reconnect_cleanup () too as with out it if the object gets destroyed through the reconnect event in the application layer, rpc layer will still end up in trying to delete the object resulting into double free and crash. Peer probing an invalid host/IP was the basic test to catch this issue. Change-Id: Id5332f3239cb324cead34eb51cf73d426733bd46 BUG: 1433578 Signed-off-by: Atin Mukherjee <amukherj> Reviewed-on: https://review.gluster.org/16914 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Milind Changire <mchangir> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jeff.us>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/