Description of problem: while checking the peer status cli crashed with segmentaion fault Version-Release number of selected component (if applicable): [root@rhs1-bb ~]# rpm -qa | grep gluster glusterfs-fuse-3.4.0.4rhs-1.el6rhs.x86_64 glusterfs-3.4.0.4rhs-1.el6rhs.x86_64 glusterfs-debuginfo-3.4.0.4rhs-1.el6rhs.x86_64 glusterfs-server-3.4.0.4rhs-1.el6rhs.x86_64 How reproducible: Steps to Reproduce: 1. I had a 14x2 distributed-replicate volume , added 2 more bricks and triggered rebalance 2. while rebalance is running restarted 1 of the node from each replica pair one after the other 3. when all of them came back checking the gluster peer status lead to crash Volume Name: vstore Type: Distributed-Replicate Volume ID: e8fe6a61-6345-41f0-9329-a802b051a026 Status: Started Number of Bricks: 16 x 2 = 32 Transport-type: tcp Bricks: Brick1: 10.70.37.76:/brick1/vs1 Brick2: 10.70.37.133:/brick1/vs1 Brick3: 10.70.37.76:/brick2/vs2 Brick4: 10.70.37.133:/brick2/vs2 Brick5: 10.70.37.76:/brick3/vs3 Brick6: 10.70.37.133:/brick3/vs3 Brick7: 10.70.37.76:/brick4/vs4 Brick8: 10.70.37.133:/brick4/vs4 Brick9: 10.70.37.76:/brick5/vs5 Brick10: 10.70.37.133:/brick5/vs5 Brick11: 10.70.37.76:/brick6/vs6 Brick12: 10.70.37.133:/brick6/vs6 Brick13: 10.70.37.134:/brick1/vs1 Brick14: 10.70.37.59:/brick1/vs1 Brick15: 10.70.37.134:/brick2/vs7 Brick16: 10.70.37.59:/brick2/vs7 Brick17: 10.70.37.134:/brick3/vs8 Brick18: 10.70.37.59:/brick3/vs8 Brick19: 10.70.37.134:/brick4/vs9 Brick20: 10.70.37.59:/brick4/vs9 Brick21: 10.70.37.134:/brick5/vs10 Brick22: 10.70.37.59:/brick5/vs10 Brick23: 10.70.37.76:/brick6/vs11 Brick24: 10.70.37.133:/brick6/vs11 Brick25: 10.70.37.134:/brick6/vs12 Brick26: 10.70.37.59:/brick6/vs12 Brick27: 10.70.37.134:/brick7/vs13 Brick28: 10.70.37.59:/brick7/vs13 Brick29: 10.70.37.134:/brick7/vs14 Brick30: 10.70.37.59:/brick7/vs14 Brick31: 10.70.37.134:/brick7/vs15 Brick32: 10.70.37.59:/brick7/vs15 Options Reconfigured: network.remote-dio: on cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off storage.owner-uid: 36 storage.owner-gid: 36 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Core was generated by `/usr/sbin/gluster peer status'. Program terminated with signal 11, Segmentation fault. #0 cli_local_wipe (local=0x1) at cli.c:553 553 GF_FREE (local->get_vol.volname); Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libxml2-2.7.6-8.el6_3.4.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 openssl-1.0.0-27.el6.x86_64 readline-6.0-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt full #0 cli_local_wipe (local=0x1) at cli.c:553 No locals. #1 0x000000000041000c in cli_cmd_peer_status_cbk (state=<value optimized out>, word=<value optimized out>, words=<value optimized out>, wordcount=<value optimized out>) at cli-cmd-peer.c:197 ret = <value optimized out> proc = <value optimized out> frame = 0x9ecdc4 sent = 1 parse_error = <value optimized out> __FUNCTION__ = "cli_cmd_peer_status_cbk" #2 0x0000000000409c6b in cli_cmd_process (state=0x7fffed4e4e50, argc=2, argv=0x7fffed4e5040) at cli-cmd.c:140 ret = <value optimized out> word = <value optimized out> next = <value optimized out> i = <value optimized out> __FUNCTION__ = "cli_cmd_process" #3 0x0000000000409710 in cli_batch (d=<value optimized out>) at input.c:34 state = <value optimized out> ret = 0 __FUNCTION__ = "cli_batch" #4 0x00000035bb807851 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00000035bb0e890d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) (gdb) bt #0 cli_local_wipe (local=0x1) at cli.c:553 #1 0x000000000041000c in cli_cmd_peer_status_cbk (state=<value optimized out>, word=<value optimized out>, words=<value optimized out>, wordcount=<value optimized out>) at cli-cmd-peer.c:197 #2 0x0000000000409c6b in cli_cmd_process (state=0x7fffed4e4e50, argc=2, argv=0x7fffed4e5040) at cli-cmd.c:140 #3 0x0000000000409710 in cli_batch (d=<value optimized out>) at input.c:34 #4 0x00000035bb807851 in start_thread () from /lib64/libpthread.so.0 #5 0x00000035bb0e890d in clone () from /lib64/libc.so.6
command was executed from 10.70.37.76
Could you point the location of core file?
Unfortunately the file pointed by you is just a text file not the dump of the core. :( 1. [root@rhs1-bb tmp]# file /tmp/core.info.PID=26905UID=0 /tmp/core.info.PID=26905UID=0: ASCII text [root@rhs1-bb tmp]# cat /tmp/core.info.PID=26905UID=0 PROGRAM HOST=rhs1-bb.lab.eng.blr.redhat.com sig=11 PID=26905UID=0 GID=0 Total bytes in core dump: 35880960 2. core pattern in the system: [root@rhs1-bb tmp]# cat /proc/sys/kernel/core_pattern |/usr/local/bin/qeCoreAlert PROGRAM=%e HOST=%h sig=%s PID=%pUID=%u GID=%g Looks like instead of save the dump, it passes the core to this qeCoreAlert binary or so. I am not sure if this ll save the dump. May be Amar ll ve some idea. Thanks, Santosh
Thanks Shylesh. got the core file.
Posted patch upstream (http://review.gluster.org/#/c/4976/).
Verified on 3.4.0.8rhs-1.el6rhs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html