Created attachment 1069585 [details] Log of the crash Description of problem: Attempting to remove 2 bricks from the 2x2 setup was resulting in the crash. Now with two bricks back, I've added them back in to the cluster. And attempting to do a rebalance: # gluster rebalance gv1 start On all bricks that are part of the cluster rebalance crashes: Version-Release number of selected component (if applicable): 3.7.3 RPS from gluster.org for OpenSuSE 13.2 How reproducible: Create 2x2 volume with 4 bricks. Copy some files. Attempt to remove 2 bricks: # gluster volume remove-brick gv1 brick1:/export brick2:/export Create volume with 2 replicas. Copy some files. Add 2 more bricks. Attempt to rebalance: # gluster volume rebalance gv1 start Actual results: # gluster volume rebalance gv1 start volume rebalance: gv1: success: Rebalance on gv1 has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 1dde3a6f-a5f2-41d0-ac49-03b4dd4ac1c6 # gluster volume status ... Task Status of Volume gv1 ------------------------------------------------------------------------------ Task : Rebalance ID : 1dde3a6f-a5f2-41d0-ac49-03b4dd4ac1c6 Status : failed Expected results: Rebalance succeeding. Additional info: Have core files if you want them. Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id rebalance/gv1 --xlator-option *dh'. Program terminated with signal SIGILL, Illegal instruction. #0 0x00007f5426d6312b in ?? ()
Here is the backtrace of the thread in question: Thread 1 (Thread 0x7f542425d700 (LWP 30350)): #0 0x00007f5426d6312b in __lll_lock_elision () from /lib64/libpthread.so.0 #1 0x00007f5427f15029 in inode_ref (inode=0x7f537e7fc700) at inode.c:545 #2 0x00007f5427ef2ada in loc_copy (dst=dst@entry=0x7f5420c99074, src=src@entry=0x7f542025aec0) at xlator.c:854 #3 0x00007f5421e18c6b in dht_local_init (frame=frame@entry=0x7f54259db724, loc=loc@entry=0x7f542025aec0, fd=fd@entry=0x0, fop=fop@entry=GF_FOP_LOOKUP) at dht-helper.c:484 #4 0x00007f5421e49ca1 in dht_lookup (frame=0x7f54259db724, this=0x7f541c00dd20, loc=0x7f542025aec0, xattr_req=0x0) at dht-common.c:2134 #5 0x00007f5427f39132 in syncop_lookup (subvol=subvol@entry=0x7f541c00dd20, loc=loc@entry=0x7f542025aec0, iatt=iatt@entry=0x7f542025abd0, parent=parent@entry=0x0, xdata_in=xdata_in@entry=0x0, xdata_out=xdata_out@entry=0x0) at syncop.c:1229 #6 0x00007f5421e22137 in gf_defrag_fix_layout (this=this@entry=0x7f541c00dd20, defrag=defrag@entry=0x7f541c034ae0, loc=loc@entry=0x7f542025aec0, fix_layout=fix_layout@entry=0x7f54281bda44, migrate_data=migrate_data@entry=0x7f54281bdb5c) at dht-rebalance.c:2419 #7 0x00007f5421e23433 in gf_defrag_start_crawl (data=0x7f541c00dd20) at dht-rebalance.c:2776 #8 0x00007f5427f35992 in synctask_wrap (old_task=<optimized out>) at syncop.c:381 #9 0x00007f54265cdfd0 in ?? () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? ()
Hi Vitaliy, Would you be able to upload the core-file? Regards, Susant
Created attachment 1069785 [details] Core file of the crash. Here you go.
Any progress on this?
Updated to the 3.7.4 - same thing. Rebalance process crashes.
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id rebalance/gv1 --xlator-option *dh'. Program terminated with signal SIGILL, Illegal instruction. (gdb) bt #0 0x00007fd2396d012b in __lll_lock_elision () from /lib64/libpthread.so.0 #1 0x00007fd23a881eb9 in inode_ref (inode=0x7fd198ff9700) at inode.c:545 #2 0x00007fd23a85fb0a in loc_copy (dst=dst@entry=0x7fd23405e074, src=src@entry=0x7fd22d69bec0) at xlator.c:854 #3 0x00007fd23478e11b in dht_local_init (frame=frame@entry=0x7fd238348270, loc=loc@entry=0x7fd22d69bec0, fd=fd@entry=0x0, fop=fop@entry=GF_FOP_LOOKUP) at dht-helper.c:484 #4 0x00007fd2347c0f81 in dht_lookup (frame=0x7fd238348270, this=0x7fd23000dd20, loc=0x7fd22d69bec0, xattr_req=0x0) at dht-common.c:2146 #5 0x00007fd23a8a6042 in syncop_lookup (subvol=subvol@entry=0x7fd23000dd20, loc=loc@entry=0x7fd22d69bec0, iatt=iatt@entry=0x7fd22d69bbd0, parent=parent@entry=0x0, xdata_in=xdata_in@entry=0x0, xdata_out=xdata_out@entry=0x0) at syncop.c:1227 #6 0x00007fd234797657 in gf_defrag_fix_layout (this=this@entry=0x7fd23000dd20, defrag=defrag@entry=0x7fd230034ae0, loc=loc@entry=0x7fd22d69bec0, fix_layout=fix_layout@entry=0x7fd23ab2abe8, migrate_data=migrate_data@entry=0x7fd23ab2aad0) at dht-rebalance.c:2427 #7 0x00007fd234798953 in gf_defrag_start_crawl (data=0x7fd23000dd20) at dht-rebalance.c:2784 #8 0x00007fd23a8a28b2 in synctask_wrap (old_task=<optimized out>) at syncop.c:380 #9 0x00007fd238f3afd0 in ?? () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? ()
Hi Vitaliy, Sorry for delayed reponse. This looks like a crash in libc. I will get back on this after consulting someone from libc team.
(In reply to Susant Kumar Palai from comment #7) > Hi Vitaliy, > Sorry for delayed reponse. This looks like a crash in libc. I will get > back on this after consulting someone from libc team. (In reply to Vitaliy Margolen from comment #6) > Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id > rebalance/gv1 --xlator-option *dh'. > Program terminated with signal SIGILL, Illegal instruction. > (gdb) bt > #0 0x00007fd2396d012b in __lll_lock_elision () from /lib64/libpthread.so.0 > #1 0x00007fd23a881eb9 in inode_ref (inode=0x7fd198ff9700) at inode.c:545 > #2 0x00007fd23a85fb0a in loc_copy (dst=dst@entry=0x7fd23405e074, > src=src@entry=0x7fd22d69bec0) at xlator.c:854 > #3 0x00007fd23478e11b in dht_local_init (frame=frame@entry=0x7fd238348270, > loc=loc@entry=0x7fd22d69bec0, fd=fd@entry=0x0, fop=fop@entry=GF_FOP_LOOKUP) > at dht-helper.c:484 > #4 0x00007fd2347c0f81 in dht_lookup (frame=0x7fd238348270, > this=0x7fd23000dd20, loc=0x7fd22d69bec0, xattr_req=0x0) at dht-common.c:2146 > #5 0x00007fd23a8a6042 in syncop_lookup (subvol=subvol@entry=0x7fd23000dd20, > loc=loc@entry=0x7fd22d69bec0, iatt=iatt@entry=0x7fd22d69bbd0, > parent=parent@entry=0x0, xdata_in=xdata_in@entry=0x0, > xdata_out=xdata_out@entry=0x0) at syncop.c:1227 > #6 0x00007fd234797657 in gf_defrag_fix_layout > (this=this@entry=0x7fd23000dd20, defrag=defrag@entry=0x7fd230034ae0, > loc=loc@entry=0x7fd22d69bec0, fix_layout=fix_layout@entry=0x7fd23ab2abe8, > migrate_data=migrate_data@entry=0x7fd23ab2aad0) at dht-rebalance.c:2427 > #7 0x00007fd234798953 in gf_defrag_start_crawl (data=0x7fd23000dd20) at > dht-rebalance.c:2784 > #8 0x00007fd23a8a28b2 in synctask_wrap (old_task=<optimized out>) at > syncop.c:380 > #9 0x00007fd238f3afd0 in ?? () from /lib64/libc.so.6 > #10 0x0000000000000000 in ?? () This is either a bug in glusterfs locking code (undefined behaviour which previously worked but under elision triggers a failure) or a defect in the OpenSUSE glibc POSIX Lock Elision feature (supported by Intel TSX). You have three options and one potential workaround: Options: (1) Open a bug against the opensuse 13.2 glibc and have them look at the bug. You'll need to provide them with a core dump, and hopefully a reproducer so they can look at the issue. (2) Reproduce the problem under Fedora or RHEL so that I can look at the issue more closely. I expect you won't be able to reproduce on stable Fedora or RHEL because we don't enable elision since we consider the feature experimental and unstable. (3) Reproduce the issue under upstream glibc and file an upstream bug for Intel (Andi Kleen) and others like Red Hat (myself) to look at. This is not a recommended course of action because it really requires an expert to set up such a reproducer. Normally to test this out we'd use Fedora Rawhide (which tracks glibc usptream). Workaround: The opensuse glibc build does have a no-elision build of glibc already part of the build. You should be able to do this: LD_PRELOAD=/lib64/noelision/libpthread.so.0 ./myapplication To force the application to start with a libpthread that has elision disabled. I warn you though that helper processes might not inherit that environment variable and such processes would again be running with elision enabled. I hope that helps.
Thanks Carlos for the update. :) Susant
Thanks for the update! Will try some tests to possibly narrow the issue down. Probably won't be able to replicate this under RHEL. Don't have free hardware to install it on.
(In reply to Carlos O'Donell from comment #8) > LD_PRELOAD=/lib64/noelision/libpthread.so.0 ./myapplication Alternatively edit /etc/ld.so.conf and add "/lib64/noelision" as the first line and rerun ldconfig as root. This will ensure the dynamic loader searches /lib64/noelision first for all processes.
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.