Bug 1575840 - brick crash seen while creating and deleting two volumes in loop
Summary: brick crash seen while creating and deleting two volumes in loop
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rpc
Version: rhgs-3.4
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Mohit Agrawal
QA Contact: Bala Konda Reddy M
URL:
Whiteboard:
Depends On:
Blocks: 1503137 1577574
TreeView+ depends on / blocked
 
Reported: 2018-05-08 05:26 UTC by Bala Konda Reddy M
Modified: 2018-09-04 06:49 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.12.2-10
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1577574 (view as bug list)
Environment:
Last Closed: 2018-09-04 06:48:05 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:49:57 UTC

Description Bala Konda Reddy M 2018-05-08 05:26:13 UTC
Description of problem:
On a brick-mux enabled setup. Seen a brick crash, Having a base volume 2X3 and while creating and deleting two volumes in a loop.


Version-Release number of selected component (if applicable):
3.12.2-8

How reproducible:
1/1

Steps to Reproduce:
1. On a three node cluster, with brick mux enabled and bricks-per-process to default.
2. Created a 2X3 volume and started it.
3. Have a script to create two volumes (pikachu_1 and pikachu_2), start, stop and delete in a continuous loop.


Actual results:
Seen a brick crash on one of the node and core generated.

Expected results:
No crash should be seen


Additional info:
[root@dhcp37-107 ~]# gluster vol info
 
Volume Name: deadpool
Type: Distributed-Replicate
Volume ID: 9a2be3bc-139c-4037-9ebe-8204614b5d65
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick0/deadpool_1
Brick2: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick0/deadpool_1
Brick3: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick0/deadpool_1
Brick4: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick1/deadpool_1
Brick5: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick1/deadpool_1
Brick6: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick1/deadpool_1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.brick-multiplex: enable
cluster.max-bricks-per-process: 0
 
Volume Name: pikachu_1
Type: Distributed-Replicate
Volume ID: 83fa7d64-b1b6-40be-8d38-cd22faec821f
Status: Stopped
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick0/testvol_1
Brick2: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick0/testvol_1
Brick3: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick0/testvol_1
Brick4: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick1/testvol_1
Brick5: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick1/testvol_1
Brick6: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick1/testvol_1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.brick-multiplex: enable
cluster.max-bricks-per-process: 0
 
Volume Name: pikachu_2
Type: Distributed-Replicate
Volume ID: c3bbc872-757c-4c50-b86b-c686be8ee6f6
Status: Stopped
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick0/testvol_2
Brick2: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick0/testvol_2
Brick3: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick0/testvol_2
Brick4: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick1/testvol_2
Brick5: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick1/testvol_2
Brick6: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick1/testvol_2
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.brick-multiplex: enable
cluster.max-bricks-per-process: 0

Here is the bt of the core file

 bt 
#0  0x00007f656d2b4de7 in __inode_get_xl_index (xlator=0x7f6558029ff0, inode=0x7f64e00024a0) at inode.c:455
#1  __inode_unref (inode=inode@entry=0x7f64e00024a0) at inode.c:489
#2  0x00007f656d2b5641 in inode_unref (inode=0x7f64e00024a0) at inode.c:559
#3  0x00007f656d2cb533 in fd_destroy (bound=_gf_true, fd=0x7f6504005dd0) at fd.c:532
#4  fd_unref (fd=0x7f6504005dd0) at fd.c:569
#5  0x00007f655c4b00d9 in free_state (state=0x7f6504008580) at server-helpers.c:185
#6  0x00007f655c4ab5fa in server_submit_reply (frame=frame@entry=0x7f6504002370, req=0x7f65580afd30, arg=arg@entry=0x7f650effc910, 
    payload=payload@entry=0x0, payloadcount=payloadcount@entry=0, iobref=0x7f65040015e0, iobref@entry=0x0, 
    xdrproc=0x7f656ce4f6b0 <xdr_gfs3_opendir_rsp>) at server.c:212
#7  0x00007f655c4bfd54 in server_opendir_cbk (frame=frame@entry=0x7f6504002370, cookie=<optimized out>, this=0x7f6558029ff0, 
    op_ret=op_ret@entry=0, op_errno=op_errno@entry=0, fd=fd@entry=0x7f6504005dd0, xdata=xdata@entry=0x0) at server-rpc-fops.c:710
#8  0x00007f655c91f111 in io_stats_opendir_cbk (frame=0x7f6504009650, cookie=<optimized out>, this=<optimized out>, op_ret=0, 
    op_errno=0, fd=0x7f6504005dd0, xdata=0x0) at io-stats.c:2315
#9  0x00007f655cd6019d in index_opendir (frame=frame@entry=0x7f6504004920, this=this@entry=0x7f652c09b920, 
    loc=loc@entry=0x7f6504008598, fd=fd@entry=0x7f6504005dd0, xdata=xdata@entry=0x0) at index.c:2113
#10 0x00007f656d3262bb in default_opendir (frame=0x7f6504004920, this=<optimized out>, loc=0x7f6504008598, fd=0x7f6504005dd0, 
---Type <return> to continue, or q <return> to quit---
    xdata=0x0) at defaults.c:2956
#11 0x00007f655c90e1bb in io_stats_opendir (frame=frame@entry=0x7f6504009650, this=this@entry=0x7f652c09e190, 
    loc=loc@entry=0x7f6504008598, fd=fd@entry=0x7f6504005dd0, xdata=xdata@entry=0x0) at io-stats.c:3311
#12 0x00007f656d3262bb in default_opendir (frame=0x7f6504009650, this=<optimized out>, loc=0x7f6504008598, fd=0x7f6504005dd0, 
    xdata=0x0) at defaults.c:2956
#13 0x00007f655c4c7ff2 in server_opendir_resume (frame=0x7f6504002370, bound_xl=0x7f652c09f7a0) at server-rpc-fops.c:2672
#14 0x00007f655c4aec99 in server_resolve_done (frame=0x7f6504002370) at server-resolve.c:587
#15 0x00007f655c4aed3d in server_resolve_all (frame=frame@entry=0x7f6504002370) at server-resolve.c:622
#16 0x00007f655c4af755 in server_resolve (frame=0x7f6504002370) at server-resolve.c:571
#17 0x00007f655c4aed7e in server_resolve_all (frame=frame@entry=0x7f6504002370) at server-resolve.c:618
#18 0x00007f655c4af4eb in server_resolve_inode (frame=frame@entry=0x7f6504002370) at server-resolve.c:425
#19 0x00007f655c4af780 in server_resolve (frame=0x7f6504002370) at server-resolve.c:559
#20 0x00007f655c4aed5e in server_resolve_all (frame=frame@entry=0x7f6504002370) at server-resolve.c:611
#21 0x00007f655c4af814 in resolve_and_resume (frame=frame@entry=0x7f6504002370, fn=fn@entry=0x7f655c4c7e00 <server_opendir_resume>)
    at server-resolve.c:642
#22 0x00007f655c4c97c1 in server3_3_opendir (req=<optimized out>) at server-rpc-fops.c:4938
---Type <return> to continue, or q <return> to quit---
#23 0x00007f656d06666e in rpcsvc_request_handler (arg=0x7f655803f8f0) at rpcsvc.c:1909
#24 0x00007f656c103dd5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f656b9ccb3d in clone () from /lib64/libc.so.6

bt full

(gdb) bt full
#0  0x00007f79cb82bde7 in __inode_get_xl_index (xlator=0x7f79b8029ff0, inode=0x7f795c002370) at inode.c:455
        set_idx = -1
#1  __inode_unref (inode=inode@entry=0x7f795c002370) at inode.c:489
        index = 0
        this = 0x7f79b8029ff0
        __FUNCTION__ = "__inode_unref"
#2  0x00007f79cb82c641 in inode_unref (inode=0x7f795c002370) at inode.c:559
        table = 0x7f79b80b3890
#3  0x00007f79cb842533 in fd_destroy (bound=_gf_true, fd=0x7f793c002930) at fd.c:532
        xl = <optimized out>
        i = <optimized out>
        old_THIS = <optimized out>
#4  fd_unref (fd=0x7f793c002930) at fd.c:569
        refcount = <optimized out>
        bound = _gf_true
        __FUNCTION__ = "fd_unref"
#5  0x00007f79b68d30d9 in free_state (state=0x7f793c0013d0) at server-helpers.c:185
No locals.
#6  0x00007f79b68ce5fa in server_submit_reply (frame=frame@entry=0x7f793c0025b0, req=0x7f79680018e0, arg=arg@entry=0x7f79427fb910, 
    payload=payload@entry=0x0, payloadcount=payloadcount@entry=0, iobref=0x7f793c005c70, iobref@entry=0x0, 
    xdrproc=0x7f79cb3c66b0 <xdr_gfs3_opendir_rsp>) at server.c:212
        iob = <optimized out>
        ret = -1
        rsp = {iov_base = 0x7f79cbd00d00, iov_len = 20}
        state = 0x7f793c0013d0
        new_iobref = 1 '\001'
        client = 0x7f79381448b0
        lk_heal = _gf_false
        __FUNCTION__ = "server_submit_reply"
#7  0x00007f79b68e2d54 in server_opendir_cbk (frame=frame@entry=0x7f793c0025b0, cookie=<optimized out>, this=0x7f79b8029ff0, 
    op_ret=op_ret@entry=0, op_errno=op_errno@entry=0, fd=fd@entry=0x7f793c002930, xdata=xdata@entry=0x0) at server-rpc-fops.c:710
        state = <optimized out>
        req = <optimized out>
        rsp = {op_ret = 0, op_errno = 0, fd = 0, xdata = {xdata_len = 0, xdata_val = 0x0}}
        __FUNCTION__ = "server_opendir_cbk"
---Type <return> to continue, or q <return> to quit---
#8  0x00007f79b6d42111 in io_stats_opendir_cbk (frame=0x7f793c0012a0, cookie=<optimized out>, this=<optimized out>, op_ret=0, 
    op_errno=0, fd=0x7f793c002930, xdata=0x0) at io-stats.c:2315
        fn = 0x7f79b68e2c80 <server_opendir_cbk>
        _parent = 0x7f793c0025b0
        old_THIS = 0x7f79b8026cc0
        iosstat = 0x0
        ret = <optimized out>
        __FUNCTION__ = "io_stats_opendir_cbk"
#9  0x00007f79b718319d in index_opendir (frame=frame@entry=0x7f793c002280, this=this@entry=0x7f79b80235e0, 
    loc=loc@entry=0x7f793c0013e8, fd=fd@entry=0x7f793c002930, xdata=xdata@entry=0x0) at index.c:2113
        fn = 0x7f79b6d41f20 <io_stats_opendir_cbk>
        _parent = 0x7f793c0012a0
        old_THIS = 0x7f79b80235e0
        __FUNCTION__ = "index_opendir"
#10 0x00007f79cb89d2bb in default_opendir (frame=0x7f793c002280, this=<optimized out>, loc=0x7f793c0013e8, fd=0x7f793c002930, 
    xdata=0x0) at defaults.c:2956
        old_THIS = 0x7f79b80251e0
        next_xl = 0x7f79b80235e0
        next_xl_fn = 0x7f79b7183040 <index_opendir>
        __FUNCTION__ = "default_opendir"
#11 0x00007f79b6d311bb in io_stats_opendir (frame=frame@entry=0x7f793c0012a0, this=this@entry=0x7f79b8026cc0, 
    loc=loc@entry=0x7f793c0013e8, fd=fd@entry=0x7f793c002930, xdata=xdata@entry=0x0) at io-stats.c:3311
        _new = 0x7f793c002280
        old_THIS = 0x7f79b8026cc0
        tmp_cbk = 0x7f79b6d41f20 <io_stats_opendir_cbk>
        __FUNCTION__ = "io_stats_opendir"
#12 0x00007f79cb89d2bb in default_opendir (frame=0x7f793c0012a0, this=<optimized out>, loc=0x7f793c0013e8, fd=0x7f793c002930, 
    xdata=0x0) at defaults.c:2956
        old_THIS = 0x7f79b80289e0
        next_xl = 0x7f79b8026cc0
        next_xl_fn = 0x7f79b6d30fb0 <io_stats_opendir>
        __FUNCTION__ = "default_opendir"
#13 0x00007f79b68eaff2 in server_opendir_resume (frame=0x7f793c0025b0, bound_xl=0x7f79b80289e0) at server-rpc-fops.c:2672
        _new = 0x7f793c0012a0
        old_THIS = 0x7f79b8029ff0
---Type <return> to continue, or q <return> to quit---
        tmp_cbk = 0x7f79b68e2c80 <server_opendir_cbk>
        state = 0x7f793c0013d0
        __FUNCTION__ = "server_opendir_resume"
#14 0x00007f79b68d1c99 in server_resolve_done (frame=0x7f793c0025b0) at server-resolve.c:587
        state = 0x7f793c0013d0
#15 0x00007f79b68d1d3d in server_resolve_all (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:622
        state = <optimized out>
        this = <optimized out>
        __FUNCTION__ = "server_resolve_all"
#16 0x00007f79b68d2755 in server_resolve (frame=0x7f793c0025b0) at server-resolve.c:571
        state = 0x7f793c0013d0
        resolve = 0x7f793c0014f0
        __FUNCTION__ = "server_resolve"
#17 0x00007f79b68d1d7e in server_resolve_all (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:618
        state = <optimized out>
        this = <optimized out>
        __FUNCTION__ = "server_resolve_all"
#18 0x00007f79b68d24eb in server_resolve_inode (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:425
        state = <optimized out>
        ret = <optimized out>
        loc = 0x7f793c0013e8
#19 0x00007f79b68d2780 in server_resolve (frame=0x7f793c0025b0) at server-resolve.c:559
        state = 0x7f793c0013d0
        resolve = 0x7f793c001468
        __FUNCTION__ = "server_resolve"
#20 0x00007f79b68d1d5e in server_resolve_all (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:611
        state = <optimized out>
        this = <optimized out>
        __FUNCTION__ = "server_resolve_all"
#21 0x00007f79b68d2814 in resolve_and_resume (frame=frame@entry=0x7f793c0025b0, fn=fn@entry=0x7f79b68eae00 <server_opendir_resume>)
    at server-resolve.c:642
        state = <optimized out>
#22 0x00007f79b68ec7c1 in server3_3_opendir (req=<optimized out>) at server-rpc-fops.c:4938
        state = 0x7f793c0013d0
        frame = 0x7f793c0025b0
---Type <return> to continue, or q <return> to quit---
        args = {gfid = "\017\257\263\226\250\306El\243\215r\b\251\034\331\377", xdata = {xdata_len = 0, xdata_val = 0x0}}
        ret = 0
        op_errno = 0
        __FUNCTION__ = "server3_3_opendir"
#23 0x00007f79cb5dd66e in rpcsvc_request_handler (arg=0x7f79b803f9b0) at rpcsvc.c:1909
        program = 0x7f79b803f9b0
        req = 0x7f79680018e0
        actor = <optimized out>
        done = _gf_false
        ret = <optimized out>
        __FUNCTION__ = "rpcsvc_request_handler"
#24 0x00007f79ca67add5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#25 0x00007f79c9f43b3d in clone () from /lib64/libc.so.6
No symbol table info available.

Comment 11 Bala Konda Reddy M 2018-05-17 18:26:57 UTC
Build: 3.12.2-10
Followed the steps mentioned in the description.
On a brick mux setup, Had a base volume of replicate(2X3) then used the below script use to verify the bug
host1=hostname1
host2=hostname2
host3=hostname3
count=1
while true
do
for i in {1..2}
do
gluster vol create pikachu_$i replica 3 $host1:/bricks/brick0/testvol_$i $host2:/bricks/brick0/testvol_$i $host3:/bricks/brick0/testvol_$i $host1:/bricks/brick1/testvol_$i $host2:/bricks/brick1/testvol_$i $host3:/bricks/brick1/testvol_$i
gluster vol start pikachu_$i
done
sleep 3
for i in {1..2}
do
gluster vol stop pikachu_$i --mode=script
sleep 3
gluster vol delete pikachu_$i --mode=script
python delete_dirs.py # which delete the brick directories
done
count=$((count+1))
if ls /tmp/cores/core* 1> /dev/null 2>&1; then exit; fi
done

Haven't seen any brick crash

Hence marking it as verified

Comment 13 errata-xmlrpc 2018-09-04 06:48:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.