Description of problem: ----------------------- Creation of image file ( sparse ) with format qcow2 using qemu-img ( which uses libgfapi glusterfs driver for qemu ), seg faults Version-Release number of selected component (if applicable): -------------------------------------------------------------- RHGS 3.2.0 ( interim build - glusterfs-3.8.4-1.el7rhgs ) RHEL 7.2 qemu-kvm-1.5.3-105.el7_2.7.x86_64 qemu-img-1.5.3-105.el7_2.7.x86_64 How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Create a qcow2 image file on the gluster volume ( replica 3 or arbiter volume ) Actual results: --------------- qemu-img slows down intially and finally segfaults Expected results: ----------------- qcow2 image creation should be successful using glusterfs-gfapi driver available in qemu
I could see this issue with, 1. replica 3 volume 2. arbiter volume 3. distribute volume Looks like the issue is seen with libgfapi, and not on the volume type
Seeing the following bt. Backtrace of the thread ( that segfaults )doesn't have enough information Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f3c95fb6700 (LWP 3324)] 0x00007f3c963f26f3 in ?? () (gdb) bt #0 0x00007f3c963f26f3 in ?? () #1 0x0000000000000000 in ?? () (gdb) thread apply all bt Thread 11 (Thread 0x7f3ca2890700 (LWP 3317)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f3ca5fd69d8 in syncenv_task (proc=proc@entry=0x7f3caeba4040) at syncop.c:603 #2 0x00007f3ca5fd7820 in syncenv_processor (thdata=0x7f3caeba4040) at syncop.c:695 #3 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3ca2890700) at pthread_create.c:308 #4 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 10 (Thread 0x7f3ca208f700 (LWP 3318)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f3ca5fd69d8 in syncenv_task (proc=proc@entry=0x7f3caeba4400) at syncop.c:603 #2 0x00007f3ca5fd7820 in syncenv_processor (thdata=0x7f3caeba4400) at syncop.c:695 #3 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3ca208f700) at pthread_create.c:308 #4 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 9 (Thread 0x7f3c95fb6700 (LWP 3324)): #0 0x00007f3c963f26f3 in ?? () #1 0x0000000000000000 in ?? () Thread 8 (Thread 0x7f3c99765700 (LWP 3326)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f3ca5fd69d8 in syncenv_task (proc=proc@entry=0x7f3cb0d24040) at syncop.c:603 #2 0x00007f3ca5fd7820 in syncenv_processor (thdata=0x7f3cb0d24040) at syncop.c:695 #3 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c99765700) at pthread_create.c:308 #4 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 7 (Thread 0x7f3c98560700 (LWP 3327)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f3ca5fd69d8 in syncenv_task (proc=proc@entry=0x7f3cb0d24400) at syncop.c:603 #2 0x00007f3ca5fd7820 in syncenv_processor (thdata=0x7f3cb0d24400) at syncop.c:695 #3 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c98560700) at pthread_create.c:308 #4 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 6 (Thread 0x7f3c95eb5700 (LWP 3328)): #0 0x00007f3ca8cc496d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f3ca5fab816 in gf_timer_proc (data=0x7f3caea26640) at timer.c:176 #2 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c95eb5700) at pthread_create.c:308 #3 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 5 (Thread 0x7f3c97934700 (LWP 3329)): #0 0x00007f3ca8cbeef7 in pthread_join (threadid=139898209384192, thread_return=thread_return@entry=0x0) at pthread_join.c:92 #1 0x00007f3ca5ff83b8 in event_dispatch_epoll (event_pool=0x7f3cb0d2e040) at event-epoll.c:758 #2 0x00007f3cab349c64 in glfs_poller (data=<optimized out>) at glfs.c:612 #3 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c97934700) at pthread_create.c:308 #4 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 4 (Thread 0x7f3c97133700 (LWP 3330)): #0 0x00007f3ca89eb2c3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f3ca5ff7e10 in event_dispatch_epoll_worker (data=0x7f3caea0ba20) at event-epoll.c:664 ---Type <return> to continue, or q <return> to quit--- #2 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c97133700) at pthread_create.c:308 #3 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 3 (Thread 0x7f3c9442c700 (LWP 3331)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f3c948686f3 in iot_worker (data=0x7f3cb0382200) at io-threads.c:176 #2 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c9442c700) at pthread_create.c:308 #3 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 2 (Thread 0x7f3c9432b700 (LWP 3332)): #0 0x00007f3ca89eb2c3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f3ca5ff7e10 in event_dispatch_epoll_worker (data=0x7f3cb0f363a0) at event-epoll.c:664 #2 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c9432b700) at pthread_create.c:308 #3 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7f3cac0058c0 (LWP 3316)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f3ca5fd866b in syncop_lookup (subvol=subvol@entry=0x7f3cb3404c40, loc=loc@entry=0x7f3caeb41a20, iatt=iatt@entry=0x7f3caeb41b50, parent=parent@entry=0x0, xdata_in=xdata_in@entry=0x0, xdata_out=xdata_out@entry=0x0) at syncop.c:1223 #2 0x00007f3cab35b88f in glfs_resolve_base (fs=fs@entry=0x7f3caeb46000, subvol=subvol@entry=0x7f3cb3404c40, inode=inode@entry=0x7f3cb18ec05c, iatt=iatt@entry=0x7f3caeb41b50) at glfs-resolve.c:225 #3 0x00007f3cab35c09a in priv_glfs_resolve_at (fs=0x7f3caeb46000, subvol=0x7f3cb3404c40, at=at@entry=0x0, origpath=origpath@entry=0x7f3cab361c4e "/", loc=loc@entry=0x7f3caeb41ca0, iatt=iatt@entry=0x7f3caeb41ce0, follow=follow@entry=1, reval=reval@entry=0) at glfs-resolve.c:404 #4 0x00007f3cab35d63c in glfs_resolve_path (fs=fs@entry=0x7f3caeb46000, subvol=subvol@entry=0x7f3cb3404c40, origpath=origpath@entry=0x7f3cab361c4e "/", loc=loc@entry=0x7f3caeb41ca0, iatt=iatt@entry=0x7f3caeb41ce0, follow=follow@entry=1, reval=reval@entry=0) at glfs-resolve.c:530 #5 0x00007f3cab35d6d3 in priv_glfs_resolve (fs=fs@entry=0x7f3caeb46000, subvol=subvol@entry=0x7f3cb3404c40, origpath=origpath@entry=0x7f3cab361c4e "/", loc=loc@entry=0x7f3caeb41ca0, iatt=iatt@entry=0x7f3caeb41ce0, reval=reval@entry=0) at glfs-resolve.c:557 #6 0x00007f3cab359df9 in pub_glfs_chdir (fs=fs@entry=0x7f3caeb46000, path=path@entry=0x7f3cab361c4e "/") at glfs-fops.c:3971 #7 0x00007f3cab34b144 in pub_glfs_init (fs=fs@entry=0x7f3caeb46000) at glfs.c:1003 #8 0x00007f3cac0477b3 in qemu_gluster_init (gconf=gconf@entry=0x7f3cae9fa2d0, filename=<optimized out>) at block/gluster.c:219 #9 0x00007f3cac047a03 in qemu_gluster_open (bs=<optimized out>, options=0x7f3cb0d29200, bdrv_flags=66, errp=<optimized out>) at block/gluster.c:341 #10 0x00007f3cac03d0b0 in bdrv_open_common (bs=bs@entry=0x7f3cb0406000, file=file@entry=0x0, options=options@entry=0x7f3cb0d29200, flags=flags@entry=2, drv=drv@entry=0x7f3cac2dce80 <bdrv_gluster>, errp=0x7f3caeb41ea0) at block.c:836 #11 0x00007f3cac042194 in bdrv_file_open (pbs=pbs@entry=0x7f3caeb41f38, filename=filename@entry=0x7f3cae9fa030 "gluster://10.70.37.104/distvol/test3.img", options=0x7f3cb0d29200, options@entry=0x0, flags=flags@entry=2, errp=errp@entry=0x7f3caeb41f40) at block.c:972 #12 0x00007f3cac057850 in qcow2_create2 (errp=0x7f3caeb41f30, version=3, prealloc=<optimized out>, cluster_size=65536, flags=0, backing_format=0x0, backing_file=0x0, total_size=2097152, filename=0x7f3cae9fa030 "gluster://10.70.37.104/distvol/test3.img") at block/qcow2.c:1677 #13 qcow2_create (filename=0x7f3cae9fa030 "gluster://10.70.37.104/distvol/test3.img", options=<optimized out>, errp=0x7f3caeb41f90) at block/qcow2.c:1856 #14 0x00007f3cac03bbd9 in bdrv_create_co_entry (opaque=0x7fff3e460170) at block.c:393 #15 0x00007f3cac077a1a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:118 #16 0x00007f3ca893b110 in ?? () from /lib64/libc.so.6 #17 0x00007fff3e45f9e0 in ?? () #18 0x0000000000000000 in ?? ()
To add to that information, this issue is not seen while creating raw image on gluster volume of any type
You mentioned a replica-3 volume; does this occur with a replica-2 volume? I tried reproducing this with gluster 3.8.4, and a local build of qemu-img-1.5.3-105.el7, and I did not run into this issue. However, my test gluster volume is as follows: gluster volume info gv0 Volume Name: gv0 Type: Replicate Volume ID: 6bcb7964-0594-4801-a60b-22dae7f871f6 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 192.168.15.180:/mnt/brick1/brick Brick2: 192.168.15.180:/mnt/brick2/brick Options Reconfigured: performance.readdir-ahead: on Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.15.180:/mnt/brick1/brick 49157 0 Y 7929 Brick 192.168.15.180:/mnt/brick2/brick 49158 0 Y 7930 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 7916 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks Creating the image: qemu-img create -f qcow2 gluster://192.168.15.180/gv0/test-bz.qcow2 5G Formatting 'gluster://192.168.15.180/gv0/test-bz.qcow2', fmt=qcow2 size=5368709120 encryption=off cluster_size=65536 lazy_refcounts=off [2016-09-28 03:29:42.693494] E [MSGID: 108006] [afr-common.c:4316:afr_notify] 0-gv0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2016-09-28 03:29:43.812529] E [MSGID: 108006] [afr-common.c:4316:afr_notify] 0-gv0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2016-09-28 03:29:44.706092] E [MSGID: 108006] [afr-common.c:4316:afr_notify] 0-gv0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up Verifying the image: qemu-img info gluster://192.168.15.180/gv0/test-bz.qcow2 [2016-09-28 03:30:19.361722] E [MSGID: 108006] [afr-common.c:4316:afr_notify] 0-gv0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. image: gluster://192.168.15.180/gv0/test-bz.qcow2 file format: qcow2 virtual size: 5.0G (5368709120 bytes) disk size: 193K cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false
(In reply to Jeff Cody from comment #7) > You mentioned a replica-3 volume; does this occur with a replica-2 volume? > > I tried reproducing this with gluster 3.8.4, and a local build of > qemu-img-1.5.3-105.el7, and I did not run into this issue. However, my test > gluster volume is as follows: > I tried with replica 2 volume as well. I am hitting the same issue. @Jeff, are you using the upstream gluster 3.8.4 ? I am talking about the interim RHGS downstream build - glusterfs-3.8.4-1.el7rhgs on server, and glusterfs-3.8.4-1.el7 on client
(In reply to SATHEESARAN from comment #9) > (In reply to Jeff Cody from comment #7) > > You mentioned a replica-3 volume; does this occur with a replica-2 volume? > > > > I tried reproducing this with gluster 3.8.4, and a local build of > > qemu-img-1.5.3-105.el7, and I did not run into this issue. However, my test > > gluster volume is as follows: > > > I tried with replica 2 volume as well. > I am hitting the same issue. > > @Jeff, are you using the upstream gluster 3.8.4 ? > I am talking about the interim RHGS downstream build - > glusterfs-3.8.4-1.el7rhgs on server, and glusterfs-3.8.4-1.el7 on client Yes, I was using the upstream gluster 3.8.4. I will retest with glusterfs-3.8.4-1.el7 and glusterfs-3.8.4-1.el7rhgs.
This reminds me of bug 1350789, which should have been fixed with glusterfs-3.8.1 (and hence in the RHGS-3.2 packages). I am not aware of any backports that could have re-introduced this though. The easiest might be to reproduce the problem on a volume that consists out of a single brick. If someone has a system available where this happens, please let us know here so that we can debug it a little quicker (make sure that all needed -debuginfo RPMs are installed too).
Possibly also reported on the gluster-devel mailinglist, with suggestion of the backported patch that causes the problem: - http://www.gluster.org/pipermail/gluster-devel/2016-October/051234.html I am not sure if http://review.gluster.org/15585 (in upstream glusterfs-3.8.5) was backported to glusterfs-3.8.4 in RHGS-3.2.
This looks like an issue with client-io-threads enabled (bug1381830). I see iot_worker threads - Thread 3 (Thread 0x7f3c9442c700 (LWP 3331)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f3c948686f3 in iot_worker (data=0x7f3cb0382200) at io-threads.c:176 #2 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c9442c700) at pthread_create.c:308 #3 0x00007f3ca89eaced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Its being discussed in gluster ML as well. I think you need to retest with client-io-threads disabled once.
(In reply to Soumya Koduri from comment #14) > This looks like an issue with client-io-threads enabled (bug1381830). I see > iot_worker threads - > > Thread 3 (Thread 0x7f3c9442c700 (LWP 3331)): > #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at > ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 > #1 0x00007f3c948686f3 in iot_worker (data=0x7f3cb0382200) at > io-threads.c:176 > #2 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c9442c700) at > pthread_create.c:308 > #3 0x00007f3ca89eaced in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > > Its being discussed in gluster ML as well. I think you need to retest with > client-io-threads disabled once. Sas - given this crash is been already addressed and fixed with latest build (glusterfs-3.8.4-3) can we retest this behaviour (with out disabling client.io-threads of course) and close the bug if the issue doesn't persist?
(In reply to Atin Mukherjee from comment #15) > (In reply to Soumya Koduri from comment #14) > > This looks like an issue with client-io-threads enabled (bug1381830). I see > > iot_worker threads - > > > > Thread 3 (Thread 0x7f3c9442c700 (LWP 3331)): > > #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at > > ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 > > #1 0x00007f3c948686f3 in iot_worker (data=0x7f3cb0382200) at > > io-threads.c:176 > > #2 0x00007f3ca8cbddc5 in start_thread (arg=0x7f3c9442c700) at > > pthread_create.c:308 > > #3 0x00007f3ca89eaced in clone () at > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > > > > > Its being discussed in gluster ML as well. I think you need to retest with > > client-io-threads disabled once. > > Sas - given this crash is been already addressed and fixed with latest build > (glusterfs-3.8.4-3) can we retest this behaviour (with out disabling > client.io-threads of course) and close the bug if the issue doesn't persist? Atin, client-io-threads is enabled by default. I have tested with and without client-io-threads, and I still see the same issue. I have tested with the latest glusterfs downstream interim build - glusterfs-3.8.4-3.el7rhgs and I still see the same issue
All, I tested with the latest downstream RHGS 3.2.0 interim build - glusterfs-3.8.4-5.el7rhgs. I am no longer seeing this issue. Please provide the patch URL for the fix and move this bug to ON_QA with the proper fixed-in-version
Fix for BZ1391093 also fixes this issue. Following is the corresponding downstream patch: https://code.engineering.redhat.com/gerrit/#/c/89229/ Therefore moving the bug to ON_QA.
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-5.el7rhgs ) 1. Created qcow2 image on the replica 3 gluster volume # qemu-img create gluster://<server>/<vol-name>/vm.img 10G I could successfully create qcow2 images # qemu-img check testvm.img No errors were found on the image. Image end offset: 262144
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html