Bug 1802947 - list about 550 files in replicated volume will causes glfs_iotwr thread crash
Summary: list about 550 files in replicated volume will causes glfs_iotwr thread crash
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: io-threads
Version: 6
Hardware: ppc64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Ravishankar N
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-14 08:29 UTC by Liguang Li
Modified: 2020-03-12 12:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:58:37 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
full backtrace of glusterfsd (230.32 KB, text/plain)
2020-02-28 08:47 UTC, Liguang Li
no flags Details

Description Liguang Li 2020-02-14 08:29:13 UTC
Description of problem:

About 550 files in the replicated volume, run ls to list the files will cause the glfs_iotwr thread stack overflow.

Version-Release number of selected component (if applicable):
v6.4

How reproducible:

Steps to Reproduce:
1. Create a replicated volume
2. Mount the replicated volume
3. Touch about 550 files in the replicated volume
4. Run "ls" command

Actual results:

[  296.815617] glfs_iotwr000[626]: bad frame in setup_rt_frame: 00003fff76d7a720 nip 00003fff80f5a1c4 lr 00003fff81019c74


Expected results:

List all the files in the replicated volume

Additional info:

Core was generated by `/usr/sbin/glusterfsd -s 128.224.95.141 --volfile-id gv0.128.224.95.141.tmp-bric'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00003fff7f89c1c4 in _int_free (av=0x3fff68000020, p=0x3fff68025820, have_lock=0) at malloc.c:3846
3846    {
[Current thread is 1 (Thread 0x3fff7970c440 (LWP 1265))]
#0  0x00003fff7f89c1c4 in _int_free (av=0x3fff68000020, p=0x3fff68025820, have_lock=0) at malloc.c:3846
#1  0x00003fff7f95bc74 in x_inline (xdrs=<optimized out>, len=<optimized out>) at xdr_sizeof.c:88
#2  0x00003fff7fa394e8 in .xdr_gfx_iattx () from /usr/lib64/libgfxdr.so.0
#3  0x00003fff7fa39ee4 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#4  0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8a80, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#5  0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8a80, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#6  0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#7  0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8900, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#8  0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8900, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#9  0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#10 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8780, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#11 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8780, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#12 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#13 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8600, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#14 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8600, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#15 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#16 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8480, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#17 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8480, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#18 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#19 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8300, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#20 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8300, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#21 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#22 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8180, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#23 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8180, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#24 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
...
#1611 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#1612 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680b6e00, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#1613 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680b6e00, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>)
    at xdr_ref.c:135
#1614 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
---Type <return> to continue, or q <return> to quit---
#1615 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff7970a300, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#1616 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff7970a300, obj_size=<optimized out>, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>)
    at xdr_ref.c:135
#1617 0x00003fff7fa3e4d8 in .xdr_gfx_readdirp_rsp () from /usr/lib64/libgfxdr.so.0
#1618 0x00003fff7f95bdd0 in __GI_xdr_sizeof (func=<optimized out>, data=<optimized out>) at xdr_sizeof.c:157
#1619 0x00003fff7a1d391c in gfs_serialize_reply () from /usr/lib64/glusterfs/6.4/xlator/protocol/server.so
#1620 0x00003fff7a1d3b78 in server_submit_reply () from /usr/lib64/glusterfs/6.4/xlator/protocol/server.so

Comment 1 Ravishankar N 2020-02-17 10:44:55 UTC
Hi Liguang Li, I'm not able to reproduce this issue on v6.4. Here is what I did:

[root@vm2 glusterfs]# gluster --version
glusterfs 6.4

[root@vm2 glusterfs]# gluster vol create testvol replica 3 127.0.0.2:/bricks/brick{1..3} force
volume create: testvol: success: please start the volume to access data
[root@vm2 glusterfs]# gluster v start testvol
volume start: testvol: success
[root@vm2 glusterfs]# mount -t glusterfs 127.0.0.2:testvol /mnt/fuse_mnt
[root@vm2 glusterfs]# cd /mnt/fuse_mnt/
[root@vm2 fuse_mnt]# touch file{1..550}
[root@vm2 fuse_mnt]# ls|wc
    550     550    4292

Am I missing something in the steps? 
Are all your clients and servers on glusterfs 6.4? Was this a fresh install or did you upgrade from an earlier version?

Comment 2 Liguang Li 2020-02-19 03:17:26 UTC
This issue can reproduce easily on v6.4 as you steps.

root@128:/# gluster --version
glusterfs 6.4

root@128:/# gdb /usr/sbin/glusterfsd ./core.638
...
Core was generated by `/usr/sbin/glusterfsd -s 128.224.95.141 --volfile-id gv0.128.224.95.141.tmp-bric'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846
3846    {
[Current thread is 1 (Thread 0x3fff99390440 (LWP 648))]
(gdb) bt
#0  0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846
#1  0x00003fff9f5dfc74 in x_inline (xdrs=<optimized out>, len=<optimized out>) at xdr_sizeof.c:88
#2  0x00003fff9f6bd4e8 in .xdr_gfx_iattx () from /usr/lib64/libgfxdr.so.0
#3  0x00003fff9f6bdee4 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#4  0x00003fff9f5df8d8 in __GI_xdr_reference (xdrs=0x3fff9938e040, pp=0x3fff880eacf0, size=<optimized out>, proc=<optimized out>) at xdr_ref.c:84
#5  0x00003fff9f5dfab4 in __GI_xdr_pointer (xdrs=0x3fff9938e040, objpp=0x3fff880eacf0, obj_size=<optimized out>,
...
#1642 0x00003fff9f79a3d4 in .call_resume () from /usr/lib64/libglusterfs.so.0
#1643 0x00003fff9a07e948 in ?? () from /usr/lib64/glusterfs/6.4/xlator/performance/io-threads.so
#1644 0x00003fff9f654b30 in start_thread (arg=0x3fff99390440) at pthread_create.c:462
(gdb) frame 1644
#1644 0x00003fff9f654b30 in start_thread (arg=0x3fff99390440) at pthread_create.c:462
462           THREAD_SETMEM (pd, result, pd->start_routine (pd->arg));
(gdb) p/x $r1
$1 = 0x3fff9938fa20
(gdb) frame 0
#0  0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846
3846    {
(gdb) p/x $r1
$2 = 0x3fff99353080
(gdb) p $1 - $2
$3 = 248224
(gdb) disassemble
Dump of assembler code for function _int_free:
   0x00003fff903f0160 <+0>:     mflr    r0
   0x00003fff903f0164 <+4>:     std     r30,-16(r1)
   0x00003fff903f0168 <+8>:     std     r0,16(r1)
   0x00003fff903f016c <+12>:    mfcr    r12
   0x00003fff903f0170 <+16>:    std     r29,-24(r1)
   0x00003fff903f0174 <+20>:    mr      r29,r3
   0x00003fff903f0178 <+24>:    std     r31,-8(r1)
   0x00003fff903f017c <+28>:    mr      r31,r4
   0x00003fff903f0180 <+32>:    ld      r10,8(r4)
   0x00003fff903f0184 <+36>:    std     r17,-120(r1)
   0x00003fff903f0188 <+40>:    std     r18,-112(r1)
   0x00003fff903f018c <+44>:    rldicr  r30,r10,0,60
   0x00003fff903f0190 <+48>:    std     r19,-104(r1)
   0x00003fff903f0194 <+52>:    neg     r9,r30
   0x00003fff903f0198 <+56>:    std     r20,-96(r1)
   0x00003fff903f019c <+60>:    cmpld   cr7,r4,r9
   0x00003fff903f01a0 <+64>:    std     r21,-88(r1)
   0x00003fff903f01a4 <+68>:    std     r22,-80(r1)
   0x00003fff903f01a8 <+72>:    std     r23,-72(r1)
   0x00003fff903f01ac <+76>:    std     r24,-64(r1)
   0x00003fff903f01b0 <+80>:    std     r25,-56(r1)
   0x00003fff903f01b4 <+84>:    std     r26,-48(r1)
   0x00003fff903f01b8 <+88>:    std     r27,-40(r1)
   0x00003fff903f01bc <+92>:    std     r28,-32(r1)
   0x00003fff903f01c0 <+96>:    stw     r12,8(r1)
=> 0x00003fff903f01c4 <+100>:   stdu    r1,-256(r1)


Please notes, we are using a powerpc machine. From the stack pointer register in frame 1644 and 0, we know 248224 bytes have been used in the stack of the thread.

From the assemble instructions, we know the crash happens in the "stdu r1,-256(r1)" instruction, so i guess there is a stack overflow. 

We know the stack size of the thread is 256K from the source code, can i fix this crash by increasing the stack size.

Comment 3 Ravishankar N 2020-02-24 12:49:46 UTC
(In reply to Liguang Li from comment #2)

> We know the stack size of the thread is 256K from the source code, can i fix
> this crash by increasing the stack size.

IOT_THREAD_STACK_SIZE (256) must be enough ideally. What is surprising is that you are having such a long call stack with 1644 frames, almost as if there is some bug due to looping. Can you attach the entire backtrace of the thread that crashed? 

> Please notes, we are using a powerpc machine. F
Hmm, I tried setting up a Fedora 31 ppc64le virtual machine using qemu on my x86_64 Fedora laptop but I'm having trouble getting it to boot :(.

Comment 4 Liguang Li 2020-02-28 08:47:53 UTC
Created attachment 1666340 [details]
full backtrace of glusterfsd

Comment 5 Liguang Li 2020-02-28 08:49:55 UTC
> IOT_THREAD_STACK_SIZE (256) must be enough ideally. What is surprising is that you are having such a long call stack with 1644 frames, almost as if there is some bug due to looping. Can you attach the entire backtrace of the thread that crashed? 

Changing IOT_THREAD_STACK_SIZE from 256 to 512, the test works. And then increasing the files in gluster volume, no crash occurs again.

Comment 6 Worker Ant 2020-03-12 12:58:37 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/978, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.