Description of problem: About 550 files in the replicated volume, run ls to list the files will cause the glfs_iotwr thread stack overflow. Version-Release number of selected component (if applicable): v6.4 How reproducible: Steps to Reproduce: 1. Create a replicated volume 2. Mount the replicated volume 3. Touch about 550 files in the replicated volume 4. Run "ls" command Actual results: [ 296.815617] glfs_iotwr000[626]: bad frame in setup_rt_frame: 00003fff76d7a720 nip 00003fff80f5a1c4 lr 00003fff81019c74 Expected results: List all the files in the replicated volume Additional info: Core was generated by `/usr/sbin/glusterfsd -s 128.224.95.141 --volfile-id gv0.128.224.95.141.tmp-bric'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00003fff7f89c1c4 in _int_free (av=0x3fff68000020, p=0x3fff68025820, have_lock=0) at malloc.c:3846 3846 { [Current thread is 1 (Thread 0x3fff7970c440 (LWP 1265))] #0 0x00003fff7f89c1c4 in _int_free (av=0x3fff68000020, p=0x3fff68025820, have_lock=0) at malloc.c:3846 #1 0x00003fff7f95bc74 in x_inline (xdrs=<optimized out>, len=<optimized out>) at xdr_sizeof.c:88 #2 0x00003fff7fa394e8 in .xdr_gfx_iattx () from /usr/lib64/libgfxdr.so.0 #3 0x00003fff7fa39ee4 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #4 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8a80, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #5 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8a80, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #6 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #7 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8900, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #8 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8900, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #9 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #10 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8780, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #11 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8780, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #12 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #13 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8600, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #14 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8600, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #15 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #16 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8480, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #17 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8480, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #18 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #19 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8300, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #20 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8300, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #21 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #22 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8180, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #23 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8180, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #24 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 ... #1611 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #1612 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680b6e00, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #1613 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680b6e00, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #1614 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 ---Type <return> to continue, or q <return> to quit--- #1615 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff7970a300, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #1616 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff7970a300, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135 #1617 0x00003fff7fa3e4d8 in .xdr_gfx_readdirp_rsp () from /usr/lib64/libgfxdr.so.0 #1618 0x00003fff7f95bdd0 in __GI_xdr_sizeof (func=<optimized out>, data=<optimized out>) at xdr_sizeof.c:157 #1619 0x00003fff7a1d391c in gfs_serialize_reply () from /usr/lib64/glusterfs/6.4/xlator/protocol/server.so #1620 0x00003fff7a1d3b78 in server_submit_reply () from /usr/lib64/glusterfs/6.4/xlator/protocol/server.so
Hi Liguang Li, I'm not able to reproduce this issue on v6.4. Here is what I did: [root@vm2 glusterfs]# gluster --version glusterfs 6.4 [root@vm2 glusterfs]# gluster vol create testvol replica 3 127.0.0.2:/bricks/brick{1..3} force volume create: testvol: success: please start the volume to access data [root@vm2 glusterfs]# gluster v start testvol volume start: testvol: success [root@vm2 glusterfs]# mount -t glusterfs 127.0.0.2:testvol /mnt/fuse_mnt [root@vm2 glusterfs]# cd /mnt/fuse_mnt/ [root@vm2 fuse_mnt]# touch file{1..550} [root@vm2 fuse_mnt]# ls|wc 550 550 4292 Am I missing something in the steps? Are all your clients and servers on glusterfs 6.4? Was this a fresh install or did you upgrade from an earlier version?
This issue can reproduce easily on v6.4 as you steps. root@128:/# gluster --version glusterfs 6.4 root@128:/# gdb /usr/sbin/glusterfsd ./core.638 ... Core was generated by `/usr/sbin/glusterfsd -s 128.224.95.141 --volfile-id gv0.128.224.95.141.tmp-bric'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846 3846 { [Current thread is 1 (Thread 0x3fff99390440 (LWP 648))] (gdb) bt #0 0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846 #1 0x00003fff9f5dfc74 in x_inline (xdrs=<optimized out>, len=<optimized out>) at xdr_sizeof.c:88 #2 0x00003fff9f6bd4e8 in .xdr_gfx_iattx () from /usr/lib64/libgfxdr.so.0 #3 0x00003fff9f6bdee4 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0 #4 0x00003fff9f5df8d8 in __GI_xdr_reference (xdrs=0x3fff9938e040, pp=0x3fff880eacf0, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84 #5 0x00003fff9f5dfab4 in __GI_xdr_pointer (xdrs=0x3fff9938e040, objpp=0x3fff880eacf0, obj_size=<optimized out>, ... #1642 0x00003fff9f79a3d4 in .call_resume () from /usr/lib64/libglusterfs.so.0 #1643 0x00003fff9a07e948 in ?? () from /usr/lib64/glusterfs/6.4/xlator/performance/io-threads.so #1644 0x00003fff9f654b30 in start_thread (arg=0x3fff99390440) at pthread_create.c:462 (gdb) frame 1644 #1644 0x00003fff9f654b30 in start_thread (arg=0x3fff99390440) at pthread_create.c:462 462 THREAD_SETMEM (pd, result, pd->start_routine (pd->arg)); (gdb) p/x $r1 $1 = 0x3fff9938fa20 (gdb) frame 0 #0 0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846 3846 { (gdb) p/x $r1 $2 = 0x3fff99353080 (gdb) p $1 - $2 $3 = 248224 (gdb) disassemble Dump of assembler code for function _int_free: 0x00003fff903f0160 <+0>: mflr r0 0x00003fff903f0164 <+4>: std r30,-16(r1) 0x00003fff903f0168 <+8>: std r0,16(r1) 0x00003fff903f016c <+12>: mfcr r12 0x00003fff903f0170 <+16>: std r29,-24(r1) 0x00003fff903f0174 <+20>: mr r29,r3 0x00003fff903f0178 <+24>: std r31,-8(r1) 0x00003fff903f017c <+28>: mr r31,r4 0x00003fff903f0180 <+32>: ld r10,8(r4) 0x00003fff903f0184 <+36>: std r17,-120(r1) 0x00003fff903f0188 <+40>: std r18,-112(r1) 0x00003fff903f018c <+44>: rldicr r30,r10,0,60 0x00003fff903f0190 <+48>: std r19,-104(r1) 0x00003fff903f0194 <+52>: neg r9,r30 0x00003fff903f0198 <+56>: std r20,-96(r1) 0x00003fff903f019c <+60>: cmpld cr7,r4,r9 0x00003fff903f01a0 <+64>: std r21,-88(r1) 0x00003fff903f01a4 <+68>: std r22,-80(r1) 0x00003fff903f01a8 <+72>: std r23,-72(r1) 0x00003fff903f01ac <+76>: std r24,-64(r1) 0x00003fff903f01b0 <+80>: std r25,-56(r1) 0x00003fff903f01b4 <+84>: std r26,-48(r1) 0x00003fff903f01b8 <+88>: std r27,-40(r1) 0x00003fff903f01bc <+92>: std r28,-32(r1) 0x00003fff903f01c0 <+96>: stw r12,8(r1) => 0x00003fff903f01c4 <+100>: stdu r1,-256(r1) Please notes, we are using a powerpc machine. From the stack pointer register in frame 1644 and 0, we know 248224 bytes have been used in the stack of the thread. From the assemble instructions, we know the crash happens in the "stdu r1,-256(r1)" instruction, so i guess there is a stack overflow. We know the stack size of the thread is 256K from the source code, can i fix this crash by increasing the stack size.
(In reply to Liguang Li from comment #2) > We know the stack size of the thread is 256K from the source code, can i fix > this crash by increasing the stack size. IOT_THREAD_STACK_SIZE (256) must be enough ideally. What is surprising is that you are having such a long call stack with 1644 frames, almost as if there is some bug due to looping. Can you attach the entire backtrace of the thread that crashed? > Please notes, we are using a powerpc machine. F Hmm, I tried setting up a Fedora 31 ppc64le virtual machine using qemu on my x86_64 Fedora laptop but I'm having trouble getting it to boot :(.
Created attachment 1666340 [details] full backtrace of glusterfsd
> IOT_THREAD_STACK_SIZE (256) must be enough ideally. What is surprising is that you are having such a long call stack with 1644 frames, almost as if there is some bug due to looping. Can you attach the entire backtrace of the thread that crashed? Changing IOT_THREAD_STACK_SIZE from 256 to 512, the test works. And then increasing the files in gluster volume, no crash occurs again.
This bug is moved to https://github.com/gluster/glusterfs/issues/978, and will be tracked there from now on. Visit GitHub issues URL for further details