Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1433578

Summary:	glusterd crashes when peering an IP where the address is more than acceptable range (>255) OR with random hostnames
Product:	[Community] GlusterFS	Reporter:	Atin Mukherjee <amukherj>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, asoman, bsrirama, bugs, mchangir, nchilaka, rcyriac, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1433276
Clones:	1434399 (view as bug list)		Environment:
Last Closed:	2017-05-30 18:47:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1433276, 1434399, 1440162, 1449076

Description Atin Mukherjee 2017-03-18 10:48:14 UTC

+++ This bug was initially created as a clone of Bug #1433276 +++

Description of problem:
==============
when we try to peer probe a node where the IP addr has the range more than 255, the glusterd is crashing consistently(alteast 95% times, checked this on 5 different setups)
Issue a gluster peer probe 10.70.35.1221 ===> note that the last part is a 4 digit
glusterd crashes

This is consistent and can easily happen if the admin makes a typo mistake, which is quite possible


Check on 3.1.3 (3.7.9-10), i couldn't reproduce.
on 3.8.4-18, mention anything above 255 it crashes


Core details:
[root@dhcp35-138 ~]# file /core.30402 
/core.30402: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterd', platform: 'x86_64'
[root@dhcp35-138 ~]# gdb /usr/sbin/glusterd /core.30402
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.

warning: core file may not match specified executable file.
[New LWP 29703]
[New LWP 30405]
[New LWP 30403]
[New LWP 30404]
[New LWP 30406]
[New LWP 30402]
[New LWP 30607]
[New LWP 30608]
[New LWP 29704]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
314	        GF_ASSERT (GF_MEM_TRAILER_MAGIC ==
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390, clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390, event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>, event=event@entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 









Version-Release number of selected component (if applicable):
===
3.8.4-18

How reproducible:
====
always(or say 95% times)

Steps to Reproduce:
1.setup a gluster node
2.issue a peer probe to say 10.70.35.x (where x is >255)
3.glusterd crashes

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-03-17 05:52:01 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Ambarish on 2017-03-17 05:59:36 EDT ---

I hit this on my setup as well just now .

[root@localhost bricks]# gluster peer probe 10.70.37.12345
peer probe: failed: Probe returned with Transport endpoint is not connected
[root@localhost bricks]# 


The weird thing is I see this file getting created with the wrong/random hostname :

[root@localhost peers]# ll -h /var/lib/glusterd/peers/
total 12K
-rw-------. 1 root root 73 Mar 17 05:52 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b
-rw-------. 1 root root 75 Mar 17 05:52 10.70.37.12345     -----> BAD
-rw-------. 1 root root 94 Mar 17 05:52 f6384f3a-ab69-4757-8fc8-eda43bd17c2e
[root@localhost peers]# 


[root@localhost peers]# cat 10.70.37.12345 
uuid=00000000-0000-0000-0000-000000000000
state=0
hostname1=10.70.37.12345
[root@localhost peers]# 


Peer Status fails on the crashed node as well :

[root@localhost peers]# gluster peer status
peer status: failed
[root@localhost peers]# 



Though it works fine on other nodes :

[root@localhost /]# gluster peer status
Number of Peers: 2

Hostname: 10.70.37.65
Uuid: 32095651-cbda-40e8-941c-6b75c260610e
State: Peer in Cluster (Connected)

Hostname: 10.70.37.116
Uuid: 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b
State: Peer in Cluster (Connected)
[root@localhost /]#

--- Additional comment from Ambarish on 2017-03-17 06:03:30 EDT ---

The issue is reproducible if I give peer probe "abcd" as well.

Samikshan shared a similar upstream BZ - https://bugzilla.redhat.com/show_bug.cgi?id=770048 ,which got later closed as WFM as noone could reproduce it.

But it's very very consistent now.

--- Additional comment from Atin Mukherjee on 2017-03-17 11:39:12 EDT ---

https://review.gluster.org/#/c/15916 has caused this regression, further analysis to follow on.

--- Additional comment from Atin Mukherjee on 2017-03-17 11:47:13 EDT ---

(In reply to Atin Mukherjee from comment #4)
> https://review.gluster.org/#/c/15916 has caused this regression, further
> analysis to follow on.

Ignore this. Doesn't look like the same patch which is culprit.

--- Additional comment from Milind Changire on 2017-03-17 15:31:29 EDT ---

When the erroneous IP Address does not pass the test for valid_ipv4_address() the test for valid_host_name() passes and the IP Address with typo is assumed as a dotted FQDN and is handed over to glusterd for processing.

We could mitigate this problem of erroneous input forwarding by ensuring that the host name resolves to a valid IP Address in the cli before passing the host name to glusterd.

However, we do need to RCA the assertion failure during saved_frames_destroy()

I wonder if this result can be seen on a ping-timer-expiry when FOP processing is held for a long time in a gdb debug session on other node to simulate a busy brick.

Comment 1 Worker Ant 2017-03-18 11:06:53 UTC

REVIEW: https://review.gluster.org/16914 (rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 2 Worker Ant 2017-03-18 12:03:06 UTC

REVIEW: https://review.gluster.org/16914 (rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 3 Worker Ant 2017-03-20 23:34:20 UTC

COMMIT: https://review.gluster.org/16914 committed in master by Jeff Darcy (jeff.us) 
------
commit 39e09ad1e0e93f08153688c31433c38529f93716
Author: Atin Mukherjee <amukherj>
Date:   Sat Mar 18 16:29:10 2017 +0530

    rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup
    
    Commit 086436a introduced generation number (cleanup_gen) to ensure that
    rpc layer doesn't end up cleaning up the connection object if
    application layer has already destroyed it. Bumping up cleanup_gen was
    done only in rpc_clnt_connection_cleanup (). However the same is needed
    in rpc_clnt_reconnect_cleanup () too as with out it if the object gets destroyed
    through the reconnect event in the application layer, rpc layer will
    still end up in trying to delete the object resulting into double free
    and crash.
    
    Peer probing an invalid host/IP was the basic test to catch this issue.
    
    Change-Id: Id5332f3239cb324cead34eb51cf73d426733bd46
    BUG: 1433578
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/16914
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Milind Changire <mchangir>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jeff.us>

Comment 4 Shyamsundar 2017-05-30 18:47:35 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/