+++ This bug was initially created as a clone of Bug #1575840 +++ Description of problem: On a brick-mux enabled setup. Seen a brick crash, Having a base volume 2X3 and while creating and deleting two volumes in a loop. Version-Release number of selected component (if applicable): 3.12.2-8 How reproducible: 1/1 Steps to Reproduce: 1. On a three node cluster, with brick mux enabled and bricks-per-process to default. 2. Created a 2X3 volume and started it. 3. Have a script to create two volumes (pikachu_1 and pikachu_2), start, stop and delete in a continuous loop. Actual results: Seen a brick crash on one of the node and core generated. Expected results: No crash should be seen Additional info: [root@dhcp37-107 ~]# gluster vol info Volume Name: deadpool Type: Distributed-Replicate Volume ID: 9a2be3bc-139c-4037-9ebe-8204614b5d65 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick0/deadpool_1 Brick2: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick0/deadpool_1 Brick3: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick0/deadpool_1 Brick4: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick1/deadpool_1 Brick5: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick1/deadpool_1 Brick6: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick1/deadpool_1 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off cluster.brick-multiplex: enable cluster.max-bricks-per-process: 0 Volume Name: pikachu_1 Type: Distributed-Replicate Volume ID: 83fa7d64-b1b6-40be-8d38-cd22faec821f Status: Stopped Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick0/testvol_1 Brick2: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick0/testvol_1 Brick3: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick0/testvol_1 Brick4: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick1/testvol_1 Brick5: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick1/testvol_1 Brick6: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick1/testvol_1 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off cluster.brick-multiplex: enable cluster.max-bricks-per-process: 0 Volume Name: pikachu_2 Type: Distributed-Replicate Volume ID: c3bbc872-757c-4c50-b86b-c686be8ee6f6 Status: Stopped Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick0/testvol_2 Brick2: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick0/testvol_2 Brick3: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick0/testvol_2 Brick4: dhcp37-107.lab.eng.blr.redhat.com:/bricks/brick1/testvol_2 Brick5: dhcp37-102.lab.eng.blr.redhat.com:/bricks/brick1/testvol_2 Brick6: dhcp37-44.lab.eng.blr.redhat.com:/bricks/brick1/testvol_2 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off cluster.brick-multiplex: enable cluster.max-bricks-per-process: 0 Here is the bt of the core file bt #0 0x00007f656d2b4de7 in __inode_get_xl_index (xlator=0x7f6558029ff0, inode=0x7f64e00024a0) at inode.c:455 #1 __inode_unref (inode=inode@entry=0x7f64e00024a0) at inode.c:489 #2 0x00007f656d2b5641 in inode_unref (inode=0x7f64e00024a0) at inode.c:559 #3 0x00007f656d2cb533 in fd_destroy (bound=_gf_true, fd=0x7f6504005dd0) at fd.c:532 #4 fd_unref (fd=0x7f6504005dd0) at fd.c:569 #5 0x00007f655c4b00d9 in free_state (state=0x7f6504008580) at server-helpers.c:185 #6 0x00007f655c4ab5fa in server_submit_reply (frame=frame@entry=0x7f6504002370, req=0x7f65580afd30, arg=arg@entry=0x7f650effc910, payload=payload@entry=0x0, payloadcount=payloadcount@entry=0, iobref=0x7f65040015e0, iobref@entry=0x0, xdrproc=0x7f656ce4f6b0 <xdr_gfs3_opendir_rsp>) at server.c:212 #7 0x00007f655c4bfd54 in server_opendir_cbk (frame=frame@entry=0x7f6504002370, cookie=<optimized out>, this=0x7f6558029ff0, op_ret=op_ret@entry=0, op_errno=op_errno@entry=0, fd=fd@entry=0x7f6504005dd0, xdata=xdata@entry=0x0) at server-rpc-fops.c:710 #8 0x00007f655c91f111 in io_stats_opendir_cbk (frame=0x7f6504009650, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7f6504005dd0, xdata=0x0) at io-stats.c:2315 #9 0x00007f655cd6019d in index_opendir (frame=frame@entry=0x7f6504004920, this=this@entry=0x7f652c09b920, loc=loc@entry=0x7f6504008598, fd=fd@entry=0x7f6504005dd0, xdata=xdata@entry=0x0) at index.c:2113 #10 0x00007f656d3262bb in default_opendir (frame=0x7f6504004920, this=<optimized out>, loc=0x7f6504008598, fd=0x7f6504005dd0, ---Type <return> to continue, or q <return> to quit--- xdata=0x0) at defaults.c:2956 #11 0x00007f655c90e1bb in io_stats_opendir (frame=frame@entry=0x7f6504009650, this=this@entry=0x7f652c09e190, loc=loc@entry=0x7f6504008598, fd=fd@entry=0x7f6504005dd0, xdata=xdata@entry=0x0) at io-stats.c:3311 #12 0x00007f656d3262bb in default_opendir (frame=0x7f6504009650, this=<optimized out>, loc=0x7f6504008598, fd=0x7f6504005dd0, xdata=0x0) at defaults.c:2956 #13 0x00007f655c4c7ff2 in server_opendir_resume (frame=0x7f6504002370, bound_xl=0x7f652c09f7a0) at server-rpc-fops.c:2672 #14 0x00007f655c4aec99 in server_resolve_done (frame=0x7f6504002370) at server-resolve.c:587 #15 0x00007f655c4aed3d in server_resolve_all (frame=frame@entry=0x7f6504002370) at server-resolve.c:622 #16 0x00007f655c4af755 in server_resolve (frame=0x7f6504002370) at server-resolve.c:571 #17 0x00007f655c4aed7e in server_resolve_all (frame=frame@entry=0x7f6504002370) at server-resolve.c:618 #18 0x00007f655c4af4eb in server_resolve_inode (frame=frame@entry=0x7f6504002370) at server-resolve.c:425 #19 0x00007f655c4af780 in server_resolve (frame=0x7f6504002370) at server-resolve.c:559 #20 0x00007f655c4aed5e in server_resolve_all (frame=frame@entry=0x7f6504002370) at server-resolve.c:611 #21 0x00007f655c4af814 in resolve_and_resume (frame=frame@entry=0x7f6504002370, fn=fn@entry=0x7f655c4c7e00 <server_opendir_resume>) at server-resolve.c:642 #22 0x00007f655c4c97c1 in server3_3_opendir (req=<optimized out>) at server-rpc-fops.c:4938 ---Type <return> to continue, or q <return> to quit--- #23 0x00007f656d06666e in rpcsvc_request_handler (arg=0x7f655803f8f0) at rpcsvc.c:1909 #24 0x00007f656c103dd5 in start_thread () from /lib64/libpthread.so.0 #25 0x00007f656b9ccb3d in clone () from /lib64/libc.so.6 bt full (gdb) bt full #0 0x00007f79cb82bde7 in __inode_get_xl_index (xlator=0x7f79b8029ff0, inode=0x7f795c002370) at inode.c:455 set_idx = -1 #1 __inode_unref (inode=inode@entry=0x7f795c002370) at inode.c:489 index = 0 this = 0x7f79b8029ff0 __FUNCTION__ = "__inode_unref" #2 0x00007f79cb82c641 in inode_unref (inode=0x7f795c002370) at inode.c:559 table = 0x7f79b80b3890 #3 0x00007f79cb842533 in fd_destroy (bound=_gf_true, fd=0x7f793c002930) at fd.c:532 xl = <optimized out> i = <optimized out> old_THIS = <optimized out> #4 fd_unref (fd=0x7f793c002930) at fd.c:569 refcount = <optimized out> bound = _gf_true __FUNCTION__ = "fd_unref" #5 0x00007f79b68d30d9 in free_state (state=0x7f793c0013d0) at server-helpers.c:185 No locals. #6 0x00007f79b68ce5fa in server_submit_reply (frame=frame@entry=0x7f793c0025b0, req=0x7f79680018e0, arg=arg@entry=0x7f79427fb910, payload=payload@entry=0x0, payloadcount=payloadcount@entry=0, iobref=0x7f793c005c70, iobref@entry=0x0, xdrproc=0x7f79cb3c66b0 <xdr_gfs3_opendir_rsp>) at server.c:212 iob = <optimized out> ret = -1 rsp = {iov_base = 0x7f79cbd00d00, iov_len = 20} state = 0x7f793c0013d0 new_iobref = 1 '\001' client = 0x7f79381448b0 lk_heal = _gf_false __FUNCTION__ = "server_submit_reply" #7 0x00007f79b68e2d54 in server_opendir_cbk (frame=frame@entry=0x7f793c0025b0, cookie=<optimized out>, this=0x7f79b8029ff0, op_ret=op_ret@entry=0, op_errno=op_errno@entry=0, fd=fd@entry=0x7f793c002930, xdata=xdata@entry=0x0) at server-rpc-fops.c:710 state = <optimized out> req = <optimized out> rsp = {op_ret = 0, op_errno = 0, fd = 0, xdata = {xdata_len = 0, xdata_val = 0x0}} __FUNCTION__ = "server_opendir_cbk" ---Type <return> to continue, or q <return> to quit--- #8 0x00007f79b6d42111 in io_stats_opendir_cbk (frame=0x7f793c0012a0, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7f793c002930, xdata=0x0) at io-stats.c:2315 fn = 0x7f79b68e2c80 <server_opendir_cbk> _parent = 0x7f793c0025b0 old_THIS = 0x7f79b8026cc0 iosstat = 0x0 ret = <optimized out> __FUNCTION__ = "io_stats_opendir_cbk" #9 0x00007f79b718319d in index_opendir (frame=frame@entry=0x7f793c002280, this=this@entry=0x7f79b80235e0, loc=loc@entry=0x7f793c0013e8, fd=fd@entry=0x7f793c002930, xdata=xdata@entry=0x0) at index.c:2113 fn = 0x7f79b6d41f20 <io_stats_opendir_cbk> _parent = 0x7f793c0012a0 old_THIS = 0x7f79b80235e0 __FUNCTION__ = "index_opendir" #10 0x00007f79cb89d2bb in default_opendir (frame=0x7f793c002280, this=<optimized out>, loc=0x7f793c0013e8, fd=0x7f793c002930, xdata=0x0) at defaults.c:2956 old_THIS = 0x7f79b80251e0 next_xl = 0x7f79b80235e0 next_xl_fn = 0x7f79b7183040 <index_opendir> __FUNCTION__ = "default_opendir" #11 0x00007f79b6d311bb in io_stats_opendir (frame=frame@entry=0x7f793c0012a0, this=this@entry=0x7f79b8026cc0, loc=loc@entry=0x7f793c0013e8, fd=fd@entry=0x7f793c002930, xdata=xdata@entry=0x0) at io-stats.c:3311 _new = 0x7f793c002280 old_THIS = 0x7f79b8026cc0 tmp_cbk = 0x7f79b6d41f20 <io_stats_opendir_cbk> __FUNCTION__ = "io_stats_opendir" #12 0x00007f79cb89d2bb in default_opendir (frame=0x7f793c0012a0, this=<optimized out>, loc=0x7f793c0013e8, fd=0x7f793c002930, xdata=0x0) at defaults.c:2956 old_THIS = 0x7f79b80289e0 next_xl = 0x7f79b8026cc0 next_xl_fn = 0x7f79b6d30fb0 <io_stats_opendir> __FUNCTION__ = "default_opendir" #13 0x00007f79b68eaff2 in server_opendir_resume (frame=0x7f793c0025b0, bound_xl=0x7f79b80289e0) at server-rpc-fops.c:2672 _new = 0x7f793c0012a0 old_THIS = 0x7f79b8029ff0 ---Type <return> to continue, or q <return> to quit--- tmp_cbk = 0x7f79b68e2c80 <server_opendir_cbk> state = 0x7f793c0013d0 __FUNCTION__ = "server_opendir_resume" #14 0x00007f79b68d1c99 in server_resolve_done (frame=0x7f793c0025b0) at server-resolve.c:587 state = 0x7f793c0013d0 #15 0x00007f79b68d1d3d in server_resolve_all (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:622 state = <optimized out> this = <optimized out> __FUNCTION__ = "server_resolve_all" #16 0x00007f79b68d2755 in server_resolve (frame=0x7f793c0025b0) at server-resolve.c:571 state = 0x7f793c0013d0 resolve = 0x7f793c0014f0 __FUNCTION__ = "server_resolve" #17 0x00007f79b68d1d7e in server_resolve_all (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:618 state = <optimized out> this = <optimized out> __FUNCTION__ = "server_resolve_all" #18 0x00007f79b68d24eb in server_resolve_inode (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:425 state = <optimized out> ret = <optimized out> loc = 0x7f793c0013e8 #19 0x00007f79b68d2780 in server_resolve (frame=0x7f793c0025b0) at server-resolve.c:559 state = 0x7f793c0013d0 resolve = 0x7f793c001468 __FUNCTION__ = "server_resolve" #20 0x00007f79b68d1d5e in server_resolve_all (frame=frame@entry=0x7f793c0025b0) at server-resolve.c:611 state = <optimized out> this = <optimized out> __FUNCTION__ = "server_resolve_all" #21 0x00007f79b68d2814 in resolve_and_resume (frame=frame@entry=0x7f793c0025b0, fn=fn@entry=0x7f79b68eae00 <server_opendir_resume>) at server-resolve.c:642 state = <optimized out> #22 0x00007f79b68ec7c1 in server3_3_opendir (req=<optimized out>) at server-rpc-fops.c:4938 state = 0x7f793c0013d0 frame = 0x7f793c0025b0 ---Type <return> to continue, or q <return> to quit--- args = {gfid = "\017\257\263\226\250\306El\243\215r\b\251\034\331\377", xdata = {xdata_len = 0, xdata_val = 0x0}} ret = 0 op_errno = 0 __FUNCTION__ = "server3_3_opendir" #23 0x00007f79cb5dd66e in rpcsvc_request_handler (arg=0x7f79b803f9b0) at rpcsvc.c:1909 program = 0x7f79b803f9b0 req = 0x7f79680018e0 actor = <optimized out> done = _gf_false ret = <optimized out> __FUNCTION__ = "rpcsvc_request_handler" #24 0x00007f79ca67add5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #25 0x00007f79c9f43b3d in clone () from /lib64/libc.so.6 No symbol table info available. --- Additional comment from Red Hat Bugzilla Rules Engine on 2018-05-08 01:26:18 EDT --- This bug is automatically being proposed for the release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Bala Konda Reddy M on 2018-05-08 02:41:06 EDT --- Sos reports : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/bmekala/bug.1575840/ --- Additional comment from Mohit Agrawal on 2018-05-08 22:47:29 EDT --- Hi Bala, I have tried to reproduce the same after setup same environment and executed same reproducer script but I am not getting any crash. I have executed script around 20 hrs but I did not get success. I have prepared a test build based on the known issue in brick mux code at the time of stopping the volume, Please test the same in your environment and share the result. Below is the link of test build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16128460 Regards Mohit Agrawal --- Additional comment from Bala Konda Reddy M on 2018-05-11 01:48:25 EDT --- Mohit, I installed the test build and followed the same steps in the description. I am able to hit this issue again.
RCA: The brick process was getting crash because in there was no ctx available at inode and inode was already free in xlator_mem_cleanup, It call's by at the time of destroying transport in free_state before call fd_unref. To resolve the same move the code in free_state destroying transport after free all resources. >>>>>>>>>>>>>>>>>>>> #0 0x00007fa5a3c08dc7 in __inode_get_xl_index (xlator=0x7fa590029ff0, inode=0x7fa53c002260) at inode.c:455 #1 __inode_unref (inode=inode@entry=0x7fa53c002260) at inode.c:489 #2 0x00007fa5a3c09621 in inode_unref (inode=0x7fa53c002260) at inode.c:559 #3 0x00007fa5a3c1f502 in fd_destroy (bound=_gf_true, fd=0x7fa538004d20) at fd.c:532 #4 fd_unref (fd=0x7fa538004d20) at fd.c:569 #5 0x00007fa58ed03169 in free_state (state=0x7fa5380013b0) at server-helpers.c:185 #6 0x00007fa58ecfe64a in server_submit_reply (frame=frame@entry=0x7fa538002910, req=0x7fa50c29ade0, arg=arg@entry=0x7fa58e8ec910, payload=payload@entry=0x0, payloadcount=payloadcount@entry=0, iobref=0x7fa538004e50, iobref@entry=0x0, xdrproc=0x7fa5a37a36b0 <xdr_gfs3_opendir_rsp>) at server.c:212 #7 0x00007fa58ed12de4 in server_opendir_cbk (frame=frame@entry=0x7fa538002910, cookie=<optimized out>, this=0x7fa590029ff0, op_ret=op_ret@entry=0, op_errno=op_errno@entry=0, fd=fd@entry=0x7fa538004d20, xdata=xdata@entry=0x0) at server-rpc-fops.c:710 #8 0x00007fa58f173111 in io_stats_opendir_cbk (frame=0x7fa538006f10, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fa538004d20, xdata=0x0) at io-stats.c:2315 #9 0x00007fa58f5b419d in index_opendir (frame=frame@entry=0x7fa538002480, this=this@entry=0x7fa5640138a0, loc=loc@entry=0x7fa5380013c8, fd=fd@entry=0x7fa538004d20, xdata=xdata@entry=0x0) at index.c:2113 #10 0x00007fa5a3c7a27b in default_opendir (frame=0x7fa538002480, this=<optimized out>, loc=0x7fa5380013c8, fd=0x7fa538004d20, xdata=0x0) at defaults.c:2956 #11 0x00007fa58f1621bb in io_stats_opendir (frame=frame@entry=0x7fa538006f10, this=this@entry=0x7fa564016110, loc=loc@entry=0x7fa5380013c8, fd=fd@entry=0x7fa538004d20, xdata=xdata@entry=0x0) at io-stats.c:3311 #12 0x00007fa5a3c7a27b in default_opendir (frame=0x7fa538006f10, this=<optimized out>, loc=0x7fa5380013c8, fd=0x7fa538004d20, xdata=0x0) at defaults.c:2956 #13 0x00007fa58ed1b082 in server_opendir_resume (frame=0x7fa538002910, bound_xl=0x7fa564017720) at server-rpc-fops.c:2672 #14 0x00007fa58ed01d29 in server_resolve_done (frame=0x7fa538002910) at server-resolve.c:587 #15 0x00007fa58ed01dcd in server_resolve_all (frame=frame@entry=0x7fa538002910) at server-resolve.c:622 #16 0x00007fa58ed027e5 in server_resolve (frame=0x7fa538002910) at server-resolve.c:571 #17 0x00007fa58ed01e0e in server_resolve_all (frame=frame@entry=0x7fa538002910) at server-resolve.c:618 #18 0x00007fa58ed0257b in server_resolve_inode (frame=frame@entry=0x7fa538002910) at server-resolve.c:425 #19 0x00007fa58ed02810 in server_resolve (frame=0x7fa538002910) at server-resolve.c:559 #20 0x00007fa58ed01dee in server_resolve_all (frame=frame@entry=0x7fa538002910) at server-resolve.c:611 #21 0x00007fa58ed028a4 in resolve_and_resume (frame=frame@entry=0x7fa538002910, fn=fn@entry=0x7fa58ed1ae90 <server_opendir_resume>) at server-resolve.c:642 ---Type <return> to continue, or q <return> to quit--- #22 0x00007fa58ed1c851 in server3_3_opendir (req=<optimized out>) at server-rpc-fops.c:4938 #23 0x00007fa5a39ba66e in rpcsvc_request_handler (arg=0x7fa59003f9b0) at rpcsvc.c:1915 #24 0x00007fa5a2a57dd5 in start_thread () from /lib64/libpthread.so.0 #25 0x00007fa5a2320b3d in clone () from /lib64/libc.so.6 Regards Mohit Agrawal
REVIEW: https://review.gluster.org/20014 (glusterfs: Resolve brick crashes at the time of inode_unref) posted (#1) for review on master by MOHIT AGRAWAL
COMMIT: https://review.gluster.org/20014 committed in master by "Raghavendra G" <rgowdapp> with a commit message- glusterfs: Resolve brick crashes at the time of inode_unref Problem: Sometimes brick process is getting crash at the time of calling inode_unref in fd_destroy Solution: Brick process is getting crash because inode is already free by xlator_mem_cleanup call by server_rpc_notify.To resolve the same move code specific to call transport_unref in last in free_state. BUG: 1577574 Change-Id: Ia517c230d68af4e929b6b753e4c374a26c39dc1a fixes: bz#1577574 Signed-off-by: Mohit Agrawal <moagrawa>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/