Bug 1259511 - Rebalance crashes
Summary: Rebalance crashes
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: unclassified
Version: 3.7.4
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Susant Kumar Palai
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-09-02 20:26 UTC by Vitaliy Margolen
Modified: 2017-03-08 10:56 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-08 10:56:10 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Log of the crash (12.63 KB, text/plain)
2015-09-02 20:26 UTC, Vitaliy Margolen
no flags Details
Core file of the crash. (1.86 MB, application/x-gzip)
2015-09-03 12:42 UTC, Vitaliy Margolen
no flags Details

Description Vitaliy Margolen 2015-09-02 20:26:30 UTC
Created attachment 1069585 [details]
Log of the crash

Description of problem:
Attempting to remove 2 bricks from the 2x2 setup was resulting in the crash. Now with two bricks back, I've added them back in to the cluster. And attempting to do a rebalance:
# gluster rebalance gv1 start

On all bricks that are part of the cluster rebalance crashes:


Version-Release number of selected component (if applicable):
3.7.3 RPS from gluster.org for OpenSuSE 13.2

How reproducible:
Create 2x2 volume with 4 bricks. Copy some files. Attempt to remove 2 bricks:
# gluster volume remove-brick gv1 brick1:/export brick2:/export

Create volume with 2 replicas. Copy some files. Add 2 more bricks. Attempt to rebalance:
# gluster volume rebalance gv1 start


Actual results:
# gluster volume rebalance gv1 start
volume rebalance: gv1: success: Rebalance on gv1 has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 1dde3a6f-a5f2-41d0-ac49-03b4dd4ac1c6

# gluster volume status
...
Task Status of Volume gv1
------------------------------------------------------------------------------
Task                 : Rebalance
ID                   : 1dde3a6f-a5f2-41d0-ac49-03b4dd4ac1c6
Status               : failed


Expected results:
Rebalance succeeding.

Additional info:
Have core files if you want them.

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id rebalance/gv1 --xlator-option *dh'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x00007f5426d6312b in ?? ()

Comment 1 Vitaliy Margolen 2015-09-02 20:50:56 UTC
Here is the backtrace of the thread in question:
Thread 1 (Thread 0x7f542425d700 (LWP 30350)):
#0  0x00007f5426d6312b in __lll_lock_elision () from /lib64/libpthread.so.0
#1  0x00007f5427f15029 in inode_ref (inode=0x7f537e7fc700) at inode.c:545
#2  0x00007f5427ef2ada in loc_copy (dst=dst@entry=0x7f5420c99074, src=src@entry=0x7f542025aec0) at xlator.c:854
#3  0x00007f5421e18c6b in dht_local_init (frame=frame@entry=0x7f54259db724, loc=loc@entry=0x7f542025aec0, fd=fd@entry=0x0, fop=fop@entry=GF_FOP_LOOKUP) at dht-helper.c:484
#4  0x00007f5421e49ca1 in dht_lookup (frame=0x7f54259db724, this=0x7f541c00dd20, loc=0x7f542025aec0, xattr_req=0x0) at dht-common.c:2134
#5  0x00007f5427f39132 in syncop_lookup (subvol=subvol@entry=0x7f541c00dd20, loc=loc@entry=0x7f542025aec0, iatt=iatt@entry=0x7f542025abd0, parent=parent@entry=0x0, xdata_in=xdata_in@entry=0x0, 
    xdata_out=xdata_out@entry=0x0) at syncop.c:1229
#6  0x00007f5421e22137 in gf_defrag_fix_layout (this=this@entry=0x7f541c00dd20, defrag=defrag@entry=0x7f541c034ae0, loc=loc@entry=0x7f542025aec0, fix_layout=fix_layout@entry=0x7f54281bda44, 
    migrate_data=migrate_data@entry=0x7f54281bdb5c) at dht-rebalance.c:2419
#7  0x00007f5421e23433 in gf_defrag_start_crawl (data=0x7f541c00dd20) at dht-rebalance.c:2776
#8  0x00007f5427f35992 in synctask_wrap (old_task=<optimized out>) at syncop.c:381
#9  0x00007f54265cdfd0 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Comment 2 Susant Kumar Palai 2015-09-03 10:07:12 UTC
Hi Vitaliy,
   Would you be able to upload the core-file?

Regards,
Susant

Comment 3 Vitaliy Margolen 2015-09-03 12:42:45 UTC
Created attachment 1069785 [details]
Core file of the crash.

Here you go.

Comment 4 Vitaliy Margolen 2015-09-27 01:12:31 UTC
Any progress on this?

Comment 5 Vitaliy Margolen 2015-10-03 14:01:40 UTC
Updated to the 3.7.4 - same thing. Rebalance process crashes.

Comment 6 Vitaliy Margolen 2015-10-03 14:24:01 UTC
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id rebalance/gv1 --xlator-option *dh'.
Program terminated with signal SIGILL, Illegal instruction.
(gdb) bt
#0  0x00007fd2396d012b in __lll_lock_elision () from /lib64/libpthread.so.0
#1  0x00007fd23a881eb9 in inode_ref (inode=0x7fd198ff9700) at inode.c:545
#2  0x00007fd23a85fb0a in loc_copy (dst=dst@entry=0x7fd23405e074, src=src@entry=0x7fd22d69bec0) at xlator.c:854
#3  0x00007fd23478e11b in dht_local_init (frame=frame@entry=0x7fd238348270, loc=loc@entry=0x7fd22d69bec0, fd=fd@entry=0x0, fop=fop@entry=GF_FOP_LOOKUP) at dht-helper.c:484
#4  0x00007fd2347c0f81 in dht_lookup (frame=0x7fd238348270, this=0x7fd23000dd20, loc=0x7fd22d69bec0, xattr_req=0x0) at dht-common.c:2146
#5  0x00007fd23a8a6042 in syncop_lookup (subvol=subvol@entry=0x7fd23000dd20, loc=loc@entry=0x7fd22d69bec0, iatt=iatt@entry=0x7fd22d69bbd0, parent=parent@entry=0x0, xdata_in=xdata_in@entry=0x0, 
    xdata_out=xdata_out@entry=0x0) at syncop.c:1227
#6  0x00007fd234797657 in gf_defrag_fix_layout (this=this@entry=0x7fd23000dd20, defrag=defrag@entry=0x7fd230034ae0, loc=loc@entry=0x7fd22d69bec0, fix_layout=fix_layout@entry=0x7fd23ab2abe8, 
    migrate_data=migrate_data@entry=0x7fd23ab2aad0) at dht-rebalance.c:2427
#7  0x00007fd234798953 in gf_defrag_start_crawl (data=0x7fd23000dd20) at dht-rebalance.c:2784
#8  0x00007fd23a8a28b2 in synctask_wrap (old_task=<optimized out>) at syncop.c:380
#9  0x00007fd238f3afd0 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Comment 7 Susant Kumar Palai 2015-10-05 06:50:37 UTC
Hi Vitaliy,
  Sorry for delayed reponse. This looks like a crash in libc. I will get back on this after consulting someone from libc team.

Comment 8 Carlos O'Donell 2015-10-08 13:19:27 UTC
(In reply to Susant Kumar Palai from comment #7)
> Hi Vitaliy,
>   Sorry for delayed reponse. This looks like a crash in libc. I will get
> back on this after consulting someone from libc team.

(In reply to Vitaliy Margolen from comment #6)
> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id
> rebalance/gv1 --xlator-option *dh'.
> Program terminated with signal SIGILL, Illegal instruction.
> (gdb) bt
> #0  0x00007fd2396d012b in __lll_lock_elision () from /lib64/libpthread.so.0
> #1  0x00007fd23a881eb9 in inode_ref (inode=0x7fd198ff9700) at inode.c:545
> #2  0x00007fd23a85fb0a in loc_copy (dst=dst@entry=0x7fd23405e074,
> src=src@entry=0x7fd22d69bec0) at xlator.c:854
> #3  0x00007fd23478e11b in dht_local_init (frame=frame@entry=0x7fd238348270,
> loc=loc@entry=0x7fd22d69bec0, fd=fd@entry=0x0, fop=fop@entry=GF_FOP_LOOKUP)
> at dht-helper.c:484
> #4  0x00007fd2347c0f81 in dht_lookup (frame=0x7fd238348270,
> this=0x7fd23000dd20, loc=0x7fd22d69bec0, xattr_req=0x0) at dht-common.c:2146
> #5  0x00007fd23a8a6042 in syncop_lookup (subvol=subvol@entry=0x7fd23000dd20,
> loc=loc@entry=0x7fd22d69bec0, iatt=iatt@entry=0x7fd22d69bbd0,
> parent=parent@entry=0x0, xdata_in=xdata_in@entry=0x0, 
>     xdata_out=xdata_out@entry=0x0) at syncop.c:1227
> #6  0x00007fd234797657 in gf_defrag_fix_layout
> (this=this@entry=0x7fd23000dd20, defrag=defrag@entry=0x7fd230034ae0,
> loc=loc@entry=0x7fd22d69bec0, fix_layout=fix_layout@entry=0x7fd23ab2abe8, 
>     migrate_data=migrate_data@entry=0x7fd23ab2aad0) at dht-rebalance.c:2427
> #7  0x00007fd234798953 in gf_defrag_start_crawl (data=0x7fd23000dd20) at
> dht-rebalance.c:2784
> #8  0x00007fd23a8a28b2 in synctask_wrap (old_task=<optimized out>) at
> syncop.c:380
> #9  0x00007fd238f3afd0 in ?? () from /lib64/libc.so.6
> #10 0x0000000000000000 in ?? ()

This is either a bug in glusterfs locking code (undefined behaviour which previously worked but under elision triggers a failure) or a defect in the OpenSUSE glibc POSIX Lock Elision feature (supported by Intel TSX).

You have three options and one potential workaround:

Options:

(1) Open a bug against the opensuse 13.2 glibc and have them look at the bug. You'll need to provide them with a core dump, and hopefully a reproducer so they can look at the issue.

(2) Reproduce the problem under Fedora or RHEL so that I can look at the issue more closely. I expect you won't be able to reproduce on stable Fedora or RHEL because we don't enable elision since we consider the feature experimental and unstable.

(3) Reproduce the issue under upstream glibc and file an upstream bug for Intel (Andi Kleen) and others like Red Hat (myself) to look at. This is not a recommended course of action because it really requires an expert to set up such a reproducer. Normally to test this out we'd use Fedora Rawhide (which tracks glibc usptream).

Workaround:

The opensuse glibc build does have a no-elision build of glibc already part of the build. You should be able to do this:

LD_PRELOAD=/lib64/noelision/libpthread.so.0 ./myapplication

To force the application to start with a libpthread that has elision disabled. I warn you though that helper processes might not inherit that environment variable and such processes would again be running with elision enabled.

I hope that helps.

Comment 9 Susant Kumar Palai 2015-10-09 06:20:11 UTC
Thanks Carlos for the update. :)

Susant

Comment 10 Vitaliy Margolen 2015-10-09 13:25:41 UTC
Thanks for the update!

Will try some tests to possibly narrow the issue down. Probably won't be able to replicate this under RHEL. Don't have free hardware to install it on.

Comment 11 Carlos O'Donell 2015-10-09 16:26:02 UTC
(In reply to Carlos O'Donell from comment #8)
> LD_PRELOAD=/lib64/noelision/libpthread.so.0 ./myapplication

Alternatively edit /etc/ld.so.conf and add "/lib64/noelision" as the first line and rerun ldconfig as root. This will ensure the dynamic loader searches /lib64/noelision first for all processes.

Comment 12 Kaushal 2017-03-08 10:56:10 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.


Note You need to log in before you can comment on or make changes to this bug.