960946 – CLI crash upon executing "gluster peer status " command

Bug 960946 - CLI crash upon executing "gluster peer status " command

Summary: CLI crash upon executing "gluster peer status " command

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	santosh pradhan
QA Contact:	shylesh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	961691
TreeView+	depends on / blocked

Reported:	2013-05-08 11:03 UTC by shylesh
Modified:	2014-09-21 22:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.4.0.6rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	961691 (view as bug list)
Environment:
Last Closed:	2013-09-23 22:35:29 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description shylesh 2013-05-08 11:03:20 UTC

Description of problem:
while checking the peer status cli crashed with segmentaion fault

Version-Release number of selected component (if applicable):
[root@rhs1-bb ~]# rpm -qa | grep gluster
glusterfs-fuse-3.4.0.4rhs-1.el6rhs.x86_64
glusterfs-3.4.0.4rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.4rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.4rhs-1.el6rhs.x86_64


How reproducible:


Steps to Reproduce:
1. I had a 14x2 distributed-replicate volume , added 2 more bricks and triggered rebalance
2. while rebalance is running restarted 1 of the node from each replica pair one after the other
3. when all of them came back checking the gluster peer status lead to crash
  
 

Volume Name: vstore
Type: Distributed-Replicate
Volume ID: e8fe6a61-6345-41f0-9329-a802b051a026
Status: Started
Number of Bricks: 16 x 2 = 32
Transport-type: tcp
Bricks:
Brick1: 10.70.37.76:/brick1/vs1
Brick2: 10.70.37.133:/brick1/vs1
Brick3: 10.70.37.76:/brick2/vs2
Brick4: 10.70.37.133:/brick2/vs2
Brick5: 10.70.37.76:/brick3/vs3
Brick6: 10.70.37.133:/brick3/vs3
Brick7: 10.70.37.76:/brick4/vs4
Brick8: 10.70.37.133:/brick4/vs4
Brick9: 10.70.37.76:/brick5/vs5
Brick10: 10.70.37.133:/brick5/vs5
Brick11: 10.70.37.76:/brick6/vs6
Brick12: 10.70.37.133:/brick6/vs6
Brick13: 10.70.37.134:/brick1/vs1
Brick14: 10.70.37.59:/brick1/vs1
Brick15: 10.70.37.134:/brick2/vs7
Brick16: 10.70.37.59:/brick2/vs7
Brick17: 10.70.37.134:/brick3/vs8
Brick18: 10.70.37.59:/brick3/vs8
Brick19: 10.70.37.134:/brick4/vs9
Brick20: 10.70.37.59:/brick4/vs9
Brick21: 10.70.37.134:/brick5/vs10
Brick22: 10.70.37.59:/brick5/vs10
Brick23: 10.70.37.76:/brick6/vs11
Brick24: 10.70.37.133:/brick6/vs11
Brick25: 10.70.37.134:/brick6/vs12
Brick26: 10.70.37.59:/brick6/vs12
Brick27: 10.70.37.134:/brick7/vs13
Brick28: 10.70.37.59:/brick7/vs13
Brick29: 10.70.37.134:/brick7/vs14
Brick30: 10.70.37.59:/brick7/vs14
Brick31: 10.70.37.134:/brick7/vs15
Brick32: 10.70.37.59:/brick7/vs15
Options Reconfigured:
network.remote-dio: on
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
storage.owner-uid: 36
storage.owner-gid: 36



Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `/usr/sbin/gluster peer status'.
Program terminated with signal 11, Segmentation fault.
#0  cli_local_wipe (local=0x1) at cli.c:553
553                     GF_FREE (local->get_vol.volname);
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libxml2-2.7.6-8.el6_3.4.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 openssl-1.0.0-27.el6.x86_64 readline-6.0-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt full
#0  cli_local_wipe (local=0x1) at cli.c:553
No locals.
#1  0x000000000041000c in cli_cmd_peer_status_cbk (state=<value optimized out>, word=<value optimized out>, words=<value optimized out>,
    wordcount=<value optimized out>) at cli-cmd-peer.c:197
        ret = <value optimized out>
        proc = <value optimized out>
        frame = 0x9ecdc4
        sent = 1
        parse_error = <value optimized out>
        __FUNCTION__ = "cli_cmd_peer_status_cbk"
#2  0x0000000000409c6b in cli_cmd_process (state=0x7fffed4e4e50, argc=2, argv=0x7fffed4e5040) at cli-cmd.c:140
        ret = <value optimized out>
        word = <value optimized out>
        next = <value optimized out>
        i = <value optimized out>
        __FUNCTION__ = "cli_cmd_process"
#3  0x0000000000409710 in cli_batch (d=<value optimized out>) at input.c:34
        state = <value optimized out>
        ret = 0
        __FUNCTION__ = "cli_batch"
#4  0x00000035bb807851 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x00000035bb0e890d in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb)




(gdb) bt
#0  cli_local_wipe (local=0x1) at cli.c:553
#1  0x000000000041000c in cli_cmd_peer_status_cbk (state=<value optimized out>, word=<value optimized out>, words=<value optimized out>,
    wordcount=<value optimized out>) at cli-cmd-peer.c:197
#2  0x0000000000409c6b in cli_cmd_process (state=0x7fffed4e4e50, argc=2, argv=0x7fffed4e5040) at cli-cmd.c:140
#3  0x0000000000409710 in cli_batch (d=<value optimized out>) at input.c:34
#4  0x00000035bb807851 in start_thread () from /lib64/libpthread.so.0
#5  0x00000035bb0e890d in clone () from /lib64/libc.so.6

Comment 2 shylesh 2013-05-08 11:06:07 UTC

command was executed from 10.70.37.76

Comment 4 santosh pradhan 2013-05-08 12:11:06 UTC

Could you point the location of core file?

Comment 6 santosh pradhan 2013-05-09 06:44:49 UTC

Unfortunately the file pointed by you is just a text file not the dump of the core. :( 
1.
[root@rhs1-bb tmp]# file /tmp/core.info.PID=26905UID=0
/tmp/core.info.PID=26905UID=0: ASCII text


[root@rhs1-bb tmp]# cat /tmp/core.info.PID=26905UID=0
PROGRAM
HOST=rhs1-bb.lab.eng.blr.redhat.com
sig=11
PID=26905UID=0
GID=0
Total bytes in core dump: 35880960

2. core pattern in the system:
[root@rhs1-bb tmp]# cat /proc/sys/kernel/core_pattern 
|/usr/local/bin/qeCoreAlert PROGRAM=%e HOST=%h sig=%s PID=%pUID=%u GID=%g

Looks like instead of save the dump, it passes the core to this qeCoreAlert binary or so. I am not sure if this ll save the dump. May be Amar ll ve some idea.

Thanks,
Santosh

Comment 7 santosh pradhan 2013-05-09 06:59:35 UTC


Thanks Shylesh. got the core file.

Comment 9 santosh pradhan 2013-05-10 09:53:23 UTC

Posted patch upstream (http://review.gluster.org/#/c/4976/).

Comment 10 shylesh 2013-05-16 05:12:27 UTC

Verified on 3.4.0.8rhs-1.el6rhs.x86_64

Comment 11 Scott Haines 2013-09-23 22:35:29 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.