1314391 – glusterd crashed when probing a node with firewall enabled on only one node

Bug 1314391 - glusterd crashed when probing a node with firewall enabled on only one node

Summary: glusterd crashed when probing a node with firewall enabled on only one node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Satish Mohan
QA Contact:	Byreddy
Docs Contact:
URL:
Whiteboard:
Depends On:	1310677
Blocks:	1299184
TreeView+	depends on / blocked

Reported:	2016-03-03 13:41 UTC by SATHEESARAN
Modified:	2016-09-17 16:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.7.9-2
Doc Type:	Bug Fix
Doc Text:	When a node was disconnected from a cluster, and a peer probe was sent to that node using an IP address when initial peering was done with hostname, or vice versa, glusterd attempted to return two responses (one for the IP address, one for the hostname), which resulted in a glusterd crash. This has been corrected so that only one response is sent, preventing the crash.
Clone Of:	1310677
Environment:
Last Closed:	2016-06-23 05:10:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description SATHEESARAN 2016-03-03 13:41:51 UTC

+++ This bug was initially created as a clone of Bug #1310677 +++

Description of problem:
-----------------------
glusterd crashed when  probing a node with firewall enabled on only one of the node

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEL 7.2
RHGS 3.1.2

How reproducible:
-----------------
always

Steps to Reproduce:
-------------------
1. Install RHEL 7.2 + glusterfs on 2 nodes (say node1, node2 )
2. Add a firewall rule to open glusterd port 24007 only on one node ( say node2 )
3. Probe a peer - node2 -  from node1
4. Probe a peer - node1 - from node2

Actual results:
---------------
glusterd crashed on node2

Expected results:
-----------------
glusterd should not crash

--- Additional comment from SATHEESARAN on 2016-02-22 08:50:37 EST ---

[root@node2 ~]# gdb -c /core.9717
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
[New LWP 9720]
[New LWP 9725]
[New LWP 9718]
[New LWP 9726]
[New LWP 9719]
[New LWP 9732]
[New LWP 9724]
[New LWP 9723]
[New LWP 9717]
[New LWP 9721]

warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/17/a121b1f7bbb010f54735ffde3347b27b33884d
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
24	1:	LOCK
(gdb) bt
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x00007fdd8397a45d in __gf_free (free_ptr=0x7fdd6c000cb0) at mem-pool.c:316
#2  0x00007fdd8393ee55 in data_destroy (data=<optimized out>) at dict.c:235
#3  0x00007fdd83941b79 in dict_get_str (this=<optimized out>, key=<optimized out>, str=<optimized out>)
    at dict.c:2213
#4  0x00007fdd784adce9 in glusterd_xfer_cli_probe_resp (req=req@entry=0x7fdd85c6811c, op_ret=op_ret@entry=-1, 
    op_errno=0, op_errstr=op_errstr@entry=0x0, hostname=0x7fdd6c000d80 "dhcp37-152", port=24007, 
    dict=0x7fdd83c17be4) at glusterd-handler.c:3894
#5  0x00007fdd784aea57 in __glusterd_handle_cli_probe (req=req@entry=0x7fdd85c6811c) at glusterd-handler.c:1220
#6  0x00007fdd784a7540 in glusterd_big_locked_handler (req=0x7fdd85c6811c, 
    actor_fn=0x7fdd784ae590 <__glusterd_handle_cli_probe>) at glusterd-handler.c:83
#7  0x00007fdd83988e32 in synctask_wrap (old_task=<optimized out>) at syncop.c:380
#8  0x00007fdd82047110 in ?? () from /usr/lib64/libc-2.17.so
#9  0x0000000000000000 in ?? ()

--- Additional comment from SATHEESARAN on 2016-02-22 09:00:42 EST ---

Coredump error messages as seen in glusterd logs :

<snip>
The message "I [MSGID: 106004] [glusterd-handler.c:5065:__glusterd_peer_rpc_notify] 0-management: Peer <dhcp37-152.lab.eng.blr.redhat.com> (<4d46cc7a-6d17-460e-82ba-7f5624436fb0>), in state <Accepted peer request>, has disconnected from glusterd." repeated 4 times between [2016-02-22 15:50:38.204058] and [2016-02-22 15:50:50.235773]
[2016-02-22 15:50:51.106009] I [MSGID: 106487] [glusterd-handler.c:1178:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req dhcp37-152 24007
The message "I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management:  already stopped" repeated 4 times between [2016-02-22 15:50:16.093916] and [2016-02-22 15:50:16.093939]
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2016-02-22 15:50:51
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.6
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7fdd83947012]
/lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7fdd839634dd]
/lib64/libc.so.6(+0x35670)[0x7fdd82035670]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x7fdd827b4210]
---------
(END)


</snip>

--- Additional comment from SATHEESARAN on 2016-02-22 09:08:02 EST ---



Console output from node1
-------------------------
[root@node1 ~]# gluster peer probe node2
peer probe: success.
 
[root@node1 ~]# gluster peer status
Number of Peers: 1

Hostname: node2
Uuid: df339e12-c30f-4a86-9977-ef4ac6d5a190
State: Accepted peer request (Connected)

Console output from node2
-------------------------

[root@node2 ~]# gluster peer status
Number of Peers: 1

Hostname: node1
Uuid: 4d46cc7a-6d17-460e-82ba-7f5624436fb0
State: Accepted peer request (Disconnected)

[root@node2 ~]# gluster peer probe node1
peer probe: success. Host dhcp37-152 port 24007 already in peer list

[root@node2 ~]# gluster peer status
Connection failed. Please check if gluster daemon is operational.
peer status: failed

--- Additional comment from SATHEESARAN on 2016-02-22 09:13:59 EST ---

I could hit this issue consistently

--- Additional comment from Gaurav Kumar Garg on 2016-02-29 05:42:49 EST ---

upstream patch for this bug is available: http://review.gluster.org/#/c/13546/

--- Additional comment from Vijay Bellur on 2016-03-01 00:59:35 EST ---

REVIEW: http://review.gluster.org/13546 (glusterd: glusterd was crashing when peer probing of disconnect node of cluster) posted (#2) for review on master by Gaurav Kumar Garg (ggarg)

Comment 1 SATHEESARAN 2016-03-03 13:42:49 UTC

Priority field came in from the upstream clone, so reverting to 'unspecified'

Comment 3 Gaurav Kumar Garg 2016-03-28 06:24:43 UTC

downstream patch for this bug available: https://code.engineering.redhat.com/gerrit/70828

Comment 4 Atin Mukherjee 2016-03-28 07:09:09 UTC

The patch is merged now, hence moving the state to MODIFIED.

Comment 6 Byreddy 2016-04-27 05:18:45 UTC

Verified this bug using the build "glusterfs-3.7.9-2.el7" with below steps.

Steps:
======
1. Had two rhgs nodes  (node1 and node2)
2. Opened the glusterd port (24007) on node2 only using firewall-cmd 
3. probed node2 from node1 and checked the peer status on node1.
4. probed node1 from node2 and checked the peer status on node2

Result: No glusterd crash happened on both the nodes.


Moving to verified state.

Comment 11 errata-xmlrpc 2016-06-23 05:10:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.