Bug 1121230

Summary: gluster nfs server process was crashed thrice while mounting volume using nfs protocol
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachana Patel <racpatel>
Component: gluster-nfsAssignee: santosh pradhan <spradhan>
Status: CLOSED WORKSFORME QA Contact: amainkar
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.0CC: racpatel, rhs-bugs, ssamanta, vagarwal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-12 09:23:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rachana Patel 2014-07-18 17:13:06 UTC
Description of problem:
=======================
while mounting volume using nfs it was hung and found nfs server process was crashed and core was generated.

tried thrice and nfs server process(gluster) was crashed all time.

(gdb) bt
#0  list_del_init (iobuf_pool=<value optimized out>, page_size=128) at list.h:88
#1  __iobuf_arena_unprune (iobuf_pool=<value optimized out>, page_size=128) at iobuf.c:227
#2  0x00000035ff24fd85 in __iobuf_pool_add_arena (iobuf_pool=0x1a11970, page_size=128, num_pages=1024) at iobuf.c:251
#3  0x00000035ff2500dd in iobuf_get2 (iobuf_pool=0x1a11970, page_size=44) at iobuf.c:603
#4  0x00007f74060e7050 in __socket_proto_state_machine (this=0x1d03a90) at socket.c:1989
#5  socket_proto_state_machine (this=0x1d03a90) at socket.c:2107
#6  socket_event_poll_in (this=0x1d03a90) at socket.c:2123
#7  0x00007f74060e8c5d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1d03a90, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2240
#8  0x00000035ff275907 in event_dispatch_epoll_handler (event_pool=0x1a2d4d0) at event-epoll.c:384
#9  event_dispatch_epoll (event_pool=0x1a2d4d0) at event-epoll.c:445
#10 0x0000000000407e93 in main (argc=11, argv=0x7fff8e3080f8) at glusterfsd.c:2023


Version-Release number :
=========================
3.6.0.24-1.el6rhs.x86_64


How reproducible:
=================
got multiple times(thrice)


Steps to Reproduce:
====================
try to mount volume using nfs protocol 

[root@OVM4 core]# ls -l
total 593036
-rw------- 1 root root 201437184 Jul 18 11:01 core.21849.1405660637.dump.1
-rw------- 1 root root 201437184 Jul 18 11:02 core.22203.1405661330.dump.1
-rw------- 1 root root 204394496 Jul 18 12:01 core.22378.1405661653.dump.1


(gdb) bt
#0  list_del_init (iobuf_pool=<value optimized out>, page_size=128) at list.h:88
#1  __iobuf_arena_unprune (iobuf_pool=<value optimized out>, page_size=128) at iobuf.c:227
#2  0x00000035ff24fd85 in __iobuf_pool_add_arena (iobuf_pool=0xee3970, page_size=128, num_pages=1024) at iobuf.c:251
#3  0x00000035ff2500dd in iobuf_get2 (iobuf_pool=0xee3970, page_size=44) at iobuf.c:603
#4  0x00007f4feb322050 in __socket_proto_state_machine (this=0x1204c00) at socket.c:1989
#5  socket_proto_state_machine (this=0x1204c00) at socket.c:2107
#6  socket_event_poll_in (this=0x1204c00) at socket.c:2123
#7  0x00007f4feb323c5d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1204c00, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2240
#8  0x00000035ff275907 in event_dispatch_epoll_handler (event_pool=0xeff4d0) at event-epoll.c:384
#9  event_dispatch_epoll (event_pool=0xeff4d0) at event-epoll.c:445
#10 0x0000000000407e93 in main (argc=11, argv=0x7fff310857a8) at glusterfsd.c:2023
(gdb) bt full
#0  list_del_init (iobuf_pool=<value optimized out>, page_size=128) at list.h:88
No locals.
#1  __iobuf_arena_unprune (iobuf_pool=<value optimized out>, page_size=128) at iobuf.c:227
        iobuf_arena = 0x0
        tmp = 0xffffffff00ee3db0
        index = <value optimized out>
        __FUNCTION__ = "__iobuf_arena_unprune"
#2  0x00000035ff24fd85 in __iobuf_pool_add_arena (iobuf_pool=0xee3970, page_size=128, num_pages=1024) at iobuf.c:251
        iobuf_arena = 0x0
        index = 0
        __FUNCTION__ = "__iobuf_pool_add_arena"
#3  0x00000035ff2500dd in iobuf_get2 (iobuf_pool=0xee3970, page_size=44) at iobuf.c:603
        iobuf = 0x0
        iobuf_arena = 0x0
        rounded_size = 128
        __FUNCTION__ = "iobuf_get2"
#4  0x00007f4feb322050 in __socket_proto_state_machine (this=0x1204c00) at socket.c:1989
        ret = <value optimized out>
        iobuf = <value optimized out>
        frag = 0xf21128
        priv = 0xf21100
        iobref = 0x0
        vector = {{iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
        in = 0xf21120
#5  socket_proto_state_machine (this=0x1204c00) at socket.c:2107
        priv = 0xf21100
        ret = 0
#6  socket_event_poll_in (this=0x1204c00) at socket.c:2123
        ret = -1
        pollin = 0x0
        priv = 0xf21100
#7  0x00007f4feb323c5d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1204c00, poll_in=1, 
---Type <return> to continue, or q <return> to quit---
    poll_out=0, poll_err=0) at socket.c:2240
        this = 0x1204c00
        priv = 0xf21100
        ret = 0
        __FUNCTION__ = "socket_event_handler"
#8  0x00000035ff275907 in event_dispatch_epoll_handler (event_pool=0xeff4d0) at event-epoll.c:384
        data = <value optimized out>
        idx = <value optimized out>
        ret = -1
        event_data = 0xf1dd84
        handler = <value optimized out>
#9  event_dispatch_epoll (event_pool=0xeff4d0) at event-epoll.c:445
        events = 0xf1dd80
        i = <value optimized out>
        ret = -1
        __FUNCTION__ = "event_dispatch_epoll"
#10 0x0000000000407e93 in main (argc=11, argv=0x7fff310857a8) at glusterfsd.c:2023
        ctx = 0xee2010
        ret = 0
        cmdlinestr = "/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/9a99795244cae05ee896f56bc0a61595.socket", '\000' <repeats 3919 times>
        __FUNCTION__ = "main"

Comment 2 Rachana Patel 2014-07-21 07:11:26 UTC
-> It was distributed volume and tried to mount using command

mount -t nfs <IP>:<vol_name> <mnt_point>
-> no I/Os werr being donr from any other mount

Comment 3 Rachana Patel 2014-07-21 07:12:14 UTC
no volume options were set for the same.

Comment 4 santosh pradhan 2014-07-21 11:34:58 UTC
I have tried several times (~ 20 times) on 2x2 volume (build glusterfs-3.6.0.24-1.el6rhs.x86_64) to reproduce the issue but no success. 

Even Saurabh tried to repro the issue with his setup but no luck yet.

This is the analysis I have at the moment:
=========================================

(1) The NFS server (glusterfsd) crashes while getting the mount request i.e. while processing in the socket layer [ glusterfsd/NFS server crashes before the control comes to NFS protocol layer]. The backtrace:

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/'.
Program terminated with signal 11, Segmentation fault.
#0  list_del_init (iobuf_pool=<value optimized out>, page_size=128) at list.h:88
88		old->next->prev = old->prev;

(gdb) where
#0  list_del_init (iobuf_pool=<value optimized out>, page_size=128) at list.h:88
#1  __iobuf_arena_unprune (iobuf_pool=<value optimized out>, page_size=128) at iobuf.c:227
#2  0x00000035ff24fd85 in __iobuf_pool_add_arena (iobuf_pool=0x1c4c970, page_size=128, num_pages=1024) at iobuf.c:251
#3  0x00000035ff2500dd in iobuf_get2 (iobuf_pool=0x1c4c970, page_size=44) at iobuf.c:603
#4  0x00007f7113df6050 in __socket_proto_state_machine (this=0x1f3ee50) at socket.c:1989
#5  socket_proto_state_machine (this=0x1f3ee50) at socket.c:2107
#6  socket_event_poll_in (this=0x1f3ee50) at socket.c:2123
#7  0x00007f7113df7c5d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1f3ee50, poll_in=1, poll_out=0, poll_err=0) at socket.c:2240
#8  0x00000035ff275907 in event_dispatch_epoll_handler (event_pool=0x1c684d0) at event-epoll.c:384
#9  event_dispatch_epoll (event_pool=0x1c684d0) at event-epoll.c:445
#10 0x0000000000407e93 in main (argc=11, argv=0x7fffb4964cc8) at glusterfsd.c:2023


(2) Something unusual I could see in the glusterfs ctx iobuf_pool:

(gdb) frame 4
#4  0x00007f7113df6050 in __socket_proto_state_machine (this=0x1f3ee50) at socket.c:1989
1989	                        iobuf = iobuf_get2 (this->ctx->iobuf_pool,
(gdb)  p ((struct iobuf_pool *)this->ctx->iobuf_pool).purge[0]
$1 = {next = 0xffffffff01c4cdb0, prev = 0x1c4c770}

The next pointer seems to be corrupted somehow. 

(3) In __iobuf_select_arena(): __iobuf_pool_add_arena() gets executed if and only if "all arenas were full, find the right count to add" [as per the comment in code]. 

What can cause all the arenas to be FULL, before it goes to fetch one from purge[] list? It needs to be checked with people who work in socket layer. Will be sending an email to Avati/gluster-devel.

(4) The issue is not reproducible with several attempts, so the severity should be brought down IMO. What do you say Rachana?

Thanks,
Santosh

Comment 5 santosh pradhan 2014-07-22 17:11:25 UTC
The issue is not reproducible with several attempts, not even by QE.

As per my discussion with Rachana, I am lowering the severity to medium so that it can be investigated later if we hit it again. Otherwise it would be CLOSED.