Bug 1010638
Summary: | qemu-kvm crash during fio test with gluster native mode | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Xiaomei Gao <xigao> |
Component: | glusterfs | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> |
Status: | CLOSED WONTFIX | QA Contact: | storage-qa-internal <storage-qa-internal> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.0 | CC: | areis, chrisw, juzhang, virt-maint, wquan, yama |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-15 07:27:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Xiaomei Gao
2013-09-22 08:30:42 UTC
Hit different core dump info and not sure if it is the same issue. (gdb) bt #0 0x00007f79ec757b6a in object_dynamic_cast_assert (obj=obj@entry=0x7f79ed3c3e18 <ioport_read_table+1138488>, typename=typename@entry=0x7f79ec8e4c2a "virtio-device", file=file@entry=0x7f79ec8f8a00 "/builddir/build/BUILD/qemu-1.5.3/hw/block/virtio-blk.c", line=line@entry=45, func=func@entry=0x7f79ec8f8b10 <__func__.27992> "virtio_blk_req_complete") at qom/object.c:438 #1 0x00007f79ec7af9a5 in virtio_blk_req_complete (req=req@entry=0x7f79f148bdf0, status=status@entry=0) at /usr/src/debug/qemu-1.5.3/hw/block/virtio-blk.c:45 #2 0x00007f79ec7afbb1 in virtio_blk_rw_complete (opaque=0x7f79f148bdf0, ret=<optimized out>) at /usr/src/debug/qemu-1.5.3/hw/block/virtio-blk.c:85 #3 0x00007f79ec634912 in bdrv_co_em_bh (opaque=0x7f79f4bc8830) at block.c:4137 #4 0x00007f79ec624f9a in aio_bh_poll (ctx=ctx@entry=0x7f79ef2d3a00) at async.c:70 #5 0x00007f79ec624b68 in aio_poll (ctx=0x7f79ef2d3a00, blocking=blocking@entry=false) at aio-posix.c:185 #6 0x00007f79ec624e90 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at async.c:167 #7 0x00007f79ebc9de06 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #8 0x00007f79ec72077a in glib_pollfds_poll () at main-loop.c:187 #9 os_host_main_loop_wait (timeout=<optimized out>) at main-loop.c:232 #10 main_loop_wait (nonblocking=<optimized out>) at main-loop.c:464 #11 0x00007f79ec620241 in main_loop () at vl.c:2060 #12 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4451 Please try to reproduce after doing "export MALLOC_PERTURB_=42" in the shell. (In reply to Paolo Bonzini from comment #3) > Please try to reproduce after doing "export MALLOC_PERTURB_=42" in the shell. Hi Xiaomei, Would you please reply comment3? Best Regards, Junyi (In reply to Paolo Bonzini from comment #3) > Please try to reproduce after doing "export MALLOC_PERTURB_=42" in the shell. - Try it on the qemu-kvm-1.5.3-19.el7.x86_64 and glusterfs-3.4.0.40rhs-2.el7.x86_64 with "export MALLOC_PERTURB_=42" in the shell. - I didn't hit memory leak which caused qemu-kvm core dump, but qemu reports error "main-loop: WARNING: I/O thread spun for 1000 iterations", and guest reports the following info: [ 30.712106] Bridge firewalling registered [ 43.071494] EXT4-fs (vdb): mounted filesystem with ordered data mode. Opts: (null) [ 49.316813] nr_pdflush_threads exported in /proc is scheduled for removal [ 49.320518] sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case. If you have one, please send an email to linux-mm. [ 142.512487] hrtimer: interrupt took 3901691 ns I've been unable to reproduce this bug. I've left it running overnight with 16 threads, up through 64 threads. Using MALLOC_PERTURB_ has not had any affect for me. I noticed the reproduced ratio is 1/4 - has this been seen more than once? I.e., can it be reproduced easily? If so, could you try it with glusterfs 3.4.1 instead of 3.4.0, and see if it still occurs? (In reply to Jeff Cody from comment #6) > I've been unable to reproduce this bug. I've left it running overnight with > 16 threads, up through 64 threads. Using MALLOC_PERTURB_ has not had any > affect for me. We test it with the following parameter's permutation and combination which can run about one hour, but hit the issue at about the 20th~30th minute. rw = "read write randread randwrite randrw" bs = "4k 16k 64k 256k" iodepth = "1 8 64" threads = "16" > I noticed the reproduced ratio is 1/4 - has this been seen more than once? > I.e., can it be reproduced easily? If so, could you try it with glusterfs > 3.4.1 instead of 3.4.0, and see if it still occurs? It happened more than once. When running for about 20~30 minutes,we could see only ~200M free memory left on the host which finally leads to qemu core dump. Sure, we will test against glusterfs 3.4.1 and see if the issue happened. (In reply to Jeff Cody from comment #6) > I've been unable to reproduce this bug. I've left it running overnight with > 16 threads, up through 64 threads. Using MALLOC_PERTURB_ has not had any > affect for me. > > I noticed the reproduced ratio is 1/4 - has this been seen more than once? > I.e., can it be reproduced easily? If so, could you try it with glusterfs > 3.4.1 instead of 3.4.0, and see if it still occurs? I couldn't find glusterfs 3.4.1 in brew web, Could you help build one or what else means to get it? Thanks Xiaomei (In reply to Xiaomei Gao from comment #8) > (In reply to Jeff Cody from comment #6) > > I've been unable to reproduce this bug. I've left it running overnight with > > 16 threads, up through 64 threads. Using MALLOC_PERTURB_ has not had any > > affect for me. > > > > I noticed the reproduced ratio is 1/4 - has this been seen more than once? > > I.e., can it be reproduced easily? If so, could you try it with glusterfs > > 3.4.1 instead of 3.4.0, and see if it still occurs? > > I couldn't find glusterfs 3.4.1 in brew web, Could you help build one or > what else means to get it? > > Thanks > Xiaomei Hi Xiomei, I've been able to reproduce this now with gluster v3.4.0 and 3.4.1, so I don't need any testing with 3.4.1. The key is the memory leak - as fio keeps running, more memory is leaked by qemu. None of the above backtraces indicate the crash is because of a memory leak. There may be a memory leak - but it doesn't look like the cause for the crash. For e.g: #3 0x00007f4bb5368f47 in malloc_printerr (action=<optimized out>, str=0x7f4bb5468f78 "corrupted double-linked list", ptr=<optimized out>) at malloc.c:4951 does not look like a crash from a memory leak. Similarly, #0 0x00007f79ec757b6a in object_dynamic_cast_assert (obj=obj@entry=0x7f79ed3c3e18 <ioport_read_table+1138488>, typename=typename@entry=0x7f79ec8e4c2a "virtio-device", file=file@entry=0x7f79ec8f8a00 "/builddir/build/BUILD/qemu-1.5.3/hw/block/virtio-blk.c", line=line@entry=45, func=func@entry=0x7f79ec8f8b10 <__func__.27992> "virtio_blk_req_complete") at qom/object.c:438 does not look like a memory leak related crash either. These look like memory corruption issues (unclear yet if the issue is on the gluster side or qemu side). Am I missing something? We can't tell for sure if the crash is caused by a leak or not, so I'm changing the summary. Hi Xiaomei, Can you Jeff's comment? Best Regards, Junyi Using the QE machine from Xiaomei, I was able to reproduce this with autotest. I turned on core dumps, and examined the resulting core. It was a similar backtrace as seen in comment #0: (gdb) thread 1 [Switching to thread 1 (Thread 0x7f90d7fff700 (LWP 12903))] #0 0x00007f9204568989 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007f9204568989 in raise () from /lib64/libc.so.6 #1 0x00007f920456a098 in abort () from /lib64/libc.so.6 #2 0x00007f92045a90e7 in __libc_message () from /lib64/libc.so.6 #3 0x00007f92045aefb7 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f92045b05ea in _int_free () from /lib64/libc.so.6 #5 0x00007f9204079942 in synctask_destroy () from /lib64/libglusterfs.so.0 #6 0x00007f920407a770 in syncenv_processor () from /lib64/libglusterfs.so.0 #7 0x00007f920791ddf3 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f92046293cd in clone () from /lib64/libc.so.6 (gdb) thread 4 [Switching to thread 4 (Thread 0x7f920989ba40 (LWP 12216))] #0 0x00007f92046638c2 in _dl_addr () from /lib64/libc.so.6 (gdb) bt #0 0x00007f92046638c2 in _dl_addr () from /lib64/libc.so.6 #1 0x00007f920463d915 in backtrace_symbols_fd () from /lib64/libc.so.6 #2 0x00007f92045a913f in __libc_message () from /lib64/libc.so.6 #3 0x00007f92045aefb7 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f92045b2062 in _int_malloc () from /lib64/libc.so.6 #5 0x00007f92045b3efa in calloc () from /lib64/libc.so.6 #6 0x00007f920407a249 in synctask_new () from /lib64/libglusterfs.so.0 #7 0x00007f920730175b in glfs_pwritev_async () from /lib64/libgfapi.so.0 #8 0x00007f9209999b56 in qemu_gluster_aio_rw (qiov=0x7f920cc08130, qiov=0x7f920cc08130, write=1, opaque=<optimized out>, cb=<optimized out>, nb_sectors=32, sector_num=18397760, bs=<optimized out>) at block/gluster.c:566 #9 qemu_gluster_aio_writev (bs=<optimized out>, sector_num=18397760, qiov=0x7f920cc08130, nb_sectors=32, cb=<optimized out>, opaque=<optimized out>) at block/gluster.c:608 #10 0x00007f920998de8a in bdrv_co_io_em (bs=0x7f920abbf0c0, sector_num=18397760, nb_sectors=32, iov=0x7f920cc08130, is_write=<optimized out>) at block.c:4616 #11 0x00007f9209992b4a in bdrv_aligned_pwritev (flags=0, qiov=0x7f920cc08130, bytes=16384, offset=<optimized out>, req=0x7f920fe51730, bs=0x7f920abbf0c0) at block.c:3018 #12 bdrv_co_do_pwritev (bs=0x7f920abbf0c0, offset=<optimized out>, bytes=16384, qiov=<optimized out>, flags=(unknown: 0)) at block.c:3147 #13 0x00007f9209992b4a in bdrv_aligned_pwritev (flags=0, qiov=0x7f920cc08130, bytes=16384, offset=<optimized out>, req=0x7f920fe51870, bs=0x7f920abbae70) at block.c:3018 #14 bdrv_co_do_pwritev (bs=0x7f920abbae70, offset=<optimized out>, bytes=16384, qiov=<optimized out>, flags=(unknown: 0)) at block.c:3147 #15 0x00007f92099935d4 in bdrv_co_do_writev (flags=<optimized out>, qiov=<optimized out>, nb_sectors=<optimized out>, sector_num=<optimized out>, bs=<optimized out>) at block.c:3171 #16 bdrv_co_do_rw (opaque=0x7f920daf9580) at block.c:4470 #17 0x00007f92099c701a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:118 #18 0x00007f920457a570 in ?? () from /lib64/libc.so.6 #19 0x00007fff59d6d630 in ?? () #20 0x0000000000000000 in ?? () (gdb) frame 8 #8 0x00007f9209999b56 in qemu_gluster_aio_rw (qiov=0x7f920cc08130, qiov=0x7f920cc08130, write=1, opaque=<optimized out>, cb=<optimized out>, nb_sectors=32, sector_num=18397760, bs=<optimized out>) at block/gluster.c:566 566 ret = glfs_pwritev_async(s->fd, qiov->iov, qiov->niov, offset, 0, (gdb) print s->fd $11 = (struct glfs_fd *) 0x7f920abec890 (gdb) print *qiov $12 = {iov = 0x7f920cc04128, niov = 4, nalloc = -1, size = 16384} (gdb) print offset $13 = 9419653120 (gdb) print *acb $14 = {common = {aiocb_info = 0x7f9209eb0040 <gluster_aiocb_info>, bs = 0x7f920abbf0c0, cb = 0x7f920998de10 <bdrv_co_io_em_complete>, opaque = 0x7f920fe51650}, size = 16384, ret = 0, finished = 0x0, bh = 0x0} (gdb) From the backtraces seen above (and in comment #0), this looks to be a glusterfs issue, with some memory corruption occurring across threads in libglusterfs. Note for reproducing: I had some runs where the autotest ran to completion (2-3 hours?) without failure. I was able to get 2 failures on the QE machine, and those were around the 45min-60min mark when they occurred. This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. The comment above is incorrect. The correct version is bellow. I'm sorry for any inconvenience. --------------------------------------------------------------- This request was NOT resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you need to escalate this bug. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |