Cause:
------
Rebalance getting triggered even if gfid mismatch is found,
and rebalance process crashes.
Fix:
----
Due to race condition, it may so happen that, gfid obtained
in readdirp and gfid found by lookup are different for a given name. in that case do no allow the rebalance.
Readdirp of an entry will bring the gfid, which will be stored in the inode through inode_link, and when lookup is done and gfid brought by lookup is different from the one stored in the inode, client3_3_lookup_cbk will return ESATLE and error will be captured by rebalance process.
DescriptionSachidananda Urs
2014-05-19 09:21:45 UTC
Created attachment 897058[details]
Log and core files
Description of problem:
Start the rebalance process after removing existing brick and adding a new brick.
After around 30 minutes rebalance process crashes, and rebalance status is shown as `failed'
[root@g60ds-2 ~]# gluster volume rebalance sixtydrive status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 111468 0 0 failed 2153.00
172.17.69.1 0 0Bytes 97354 0 20149 failed 2162.00
volume rebalance: sixtydrive: success:
Version-Release number of selected component (if applicable):
[root@g60ds-2 ~]# gluster --version
glusterfs 3.6.0.3 built on May 17 2014 10:49:46
How reproducible:
Always.
Steps to Reproduce:
1. Create huge amount of data
2. Remove brick (Migrates data)
3. Add brick and rebalance.
Actual results:
glusterfs rebalance process crashes.
Additional info:
Back trace:
(gdb) bt
#0 0x00007f6e0ef0222f in dht_layout_entry_cmp_volname (layout=0x7f6e04023ec0, i=0, j=<value optimized out>) at dht-layout.c:434
#1 0x00007f6e0ef0228d in dht_layout_sort_volname (layout=0x7f6e04023ec0) at dht-layout.c:506
#2 0x00007f6e0ef0b48b in dht_fix_layout_of_directory (frame=0x7f6e1c90c5ec, loc=0x7f6e0df97800, layout=0x14006d0) at dht-selfheal.c:776
#3 0x00007f6e0ef0cd59 in dht_fix_directory_layout (frame=<value optimized out>, dir_cbk=<value optimized out>, layout=0x14006d0)
at dht-selfheal.c:915
#4 0x00007f6e0ef1ed82 in dht_setxattr (frame=0x7f6e1c90c5ec, this=0x13dded0, loc=0x7f6e0b386000, xattr=0x7f6e1c3060b4, flags=0,
xdata=0x0) at dht-common.c:2621
#5 0x00007f6e1db10761 in syncop_setxattr (subvol=0x13dded0, loc=0x7f6e0b386000, dict=0x7f6e1c3060b4, flags=0) at syncop.c:1314
#6 0x00007f6e0ef07ad1 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386220, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1575
#7 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386440, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#8 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386660, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#9 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386880, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#10 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386aa0, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#11 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386cc0, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#12 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386ee0, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#13 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387100, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#14 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387320, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#15 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387540, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#16 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387760, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#17 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387980, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#18 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387ba0, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#19 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387dc0, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#20 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387f60, fix_layout=0x7f6e1c3060b4,
migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586
#21 0x00007f6e0ef08086 in gf_defrag_start_crawl (data=0x13dded0) at dht-rebalance.c:1705
#22 0x00007f6e1db0a5d2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333
#23 0x0000003558a43bf0 in ?? () from /lib64/libc-2.12.so
#24 0x0000000000000000 in ?? ()
(gdb)
=================
Attached log file and core file.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHEA-2014-1278.html
Created attachment 897058 [details] Log and core files Description of problem: Start the rebalance process after removing existing brick and adding a new brick. After around 30 minutes rebalance process crashes, and rebalance status is shown as `failed' [root@g60ds-2 ~]# gluster volume rebalance sixtydrive status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 111468 0 0 failed 2153.00 172.17.69.1 0 0Bytes 97354 0 20149 failed 2162.00 volume rebalance: sixtydrive: success: Version-Release number of selected component (if applicable): [root@g60ds-2 ~]# gluster --version glusterfs 3.6.0.3 built on May 17 2014 10:49:46 How reproducible: Always. Steps to Reproduce: 1. Create huge amount of data 2. Remove brick (Migrates data) 3. Add brick and rebalance. Actual results: glusterfs rebalance process crashes. Additional info: Back trace: (gdb) bt #0 0x00007f6e0ef0222f in dht_layout_entry_cmp_volname (layout=0x7f6e04023ec0, i=0, j=<value optimized out>) at dht-layout.c:434 #1 0x00007f6e0ef0228d in dht_layout_sort_volname (layout=0x7f6e04023ec0) at dht-layout.c:506 #2 0x00007f6e0ef0b48b in dht_fix_layout_of_directory (frame=0x7f6e1c90c5ec, loc=0x7f6e0df97800, layout=0x14006d0) at dht-selfheal.c:776 #3 0x00007f6e0ef0cd59 in dht_fix_directory_layout (frame=<value optimized out>, dir_cbk=<value optimized out>, layout=0x14006d0) at dht-selfheal.c:915 #4 0x00007f6e0ef1ed82 in dht_setxattr (frame=0x7f6e1c90c5ec, this=0x13dded0, loc=0x7f6e0b386000, xattr=0x7f6e1c3060b4, flags=0, xdata=0x0) at dht-common.c:2621 #5 0x00007f6e1db10761 in syncop_setxattr (subvol=0x13dded0, loc=0x7f6e0b386000, dict=0x7f6e1c3060b4, flags=0) at syncop.c:1314 #6 0x00007f6e0ef07ad1 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386220, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1575 #7 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386440, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #8 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386660, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #9 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386880, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #10 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386aa0, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #11 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386cc0, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #12 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b386ee0, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #13 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387100, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #14 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387320, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #15 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387540, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #16 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387760, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #17 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387980, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #18 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387ba0, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #19 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387dc0, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #20 0x00007f6e0ef07af5 in gf_defrag_fix_layout (this=0x13dded0, defrag=0x13ff8e0, loc=0x7f6e0b387f60, fix_layout=0x7f6e1c3060b4, migrate_data=0x7f6e1c306140) at dht-rebalance.c:1586 #21 0x00007f6e0ef08086 in gf_defrag_start_crawl (data=0x13dded0) at dht-rebalance.c:1705 #22 0x00007f6e1db0a5d2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333 #23 0x0000003558a43bf0 in ?? () from /lib64/libc-2.12.so #24 0x0000000000000000 in ?? () (gdb) ================= Attached log file and core file.