Description of problem: ======================= Remove-brick rebalance failed with below error messages while rm -rf is in progress, [2016-12-15 10:05:55.958248] E [dht-rebalance.c:3225:gf_defrag_fix_layout] 0-ecvol-dht: Failed to open dir /tar/linux-4.4.1/drivers/base/power [2016-12-15 10:05:55.958284] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1/drivers/base/power [2016-12-15 10:05:55.959295] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1/drivers/base [2016-12-15 10:05:55.960522] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1/drivers [2016-12-15 10:05:55.961263] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1 [2016-12-15 10:05:55.962100] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar Version-Release number of selected component (if applicable): 3.8.4-8.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: =================== 1) Create a Distributed-Disperse and start it. 2) FUSE mount the volume on multiple clients. 3) create files, hardlinks, untar linux kernel on the mount. 4) From a client open two terminal sessions and run command rm -rf *. 5) While rm -rf is in progress, remove few bricks and check the status. At some point, remove-brick rebalance will fail with above errors. Actual results: =============== Remove-brick rebalance failed while removing files on the mount. Expected results: ================= Rebalance should complete successfully.
RCA: Most likely the rmdir has removed the directory entry before opendir and opendir failed. patch: http://review.gluster.org/#/c/15846/ should be able to fix this test case.
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/93573/
This issue got reproduced with glusterfs-3.8.4-10 build when tried the same. scenario that is, rm -rf when remove-brick operation was in progress. Moving to assigned state.
We have hit a old bug where "errno" is not constant under a synctask function, since rebalance is operated under sync_threads(gf_defrag_start_crawl). More info at [1]. The patch written to address the current bug, uses errno (per thread error context) to figure out ENOENT and ESTALE and ignore them for a remove-brick process. But because of [1], at the place of error checking, the errno we are checking might belong to a different thread context because of the way syncenv functions and we will error out. [1]: Pasting the details from the patch here: http://review.gluster.org/#/c/6475/ "syncop: Change return value of syncop Problem: We found a day-1 bug when syncop_xxx() infra is used inside a synctask with compilation optimization (CFLAGS -O2). Detailed explanation of the Root cause: We found the bug in 'gf_defrag_migrate_data' in rebalance operation: Lets look at interesting parts of the function: int gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc, dict_t *migrate_data) { ..... code section - [ Loop ] while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL, &entries)) != 0) { ..... code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a thread) /* Need to keep track of ENOENT errno, that means, there is no need to send more readdirp() */ readdir_operrno = errno; ..... code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread) ret = syncop_getxattr (this, &entry_loc, &dict, GF_XATTR_LINKINFO_KEY); code section - [ ERRNO-2] (checking for failures of syncop_getxattr(). This may not always be executed in same thread which executed [SYNCOP-1]) if (ret < 0) { if (errno != ENODATA) { loglevel = GF_LOG_ERROR; defrag->total_failures += 1; ..... } the function above could be executed by thread(t1) till [SYNCOP-1] and code from [ERRNO-2] can be executed by a different thread(t2) because of the way syncop-infra schedules the tasks. when the code is compiled with -O2 optimization this is the assembly code that is generated: [ERRNO-1] 1165 readdir_operrno = errno; <<---- errno gets expanded as *(__errno_location()) 0x00007fd149d48b60 <+496>: callq 0x7fd149d410c0 <address@hidden> 0x00007fd149d48b72 <+514>: mov %rax,0x50(%rsp) <<------ Address returned by __errno_location() is stored in a special location in stack for later use. 0x00007fd149d48b77 <+519>: mov (%rax),%eax 0x00007fd149d48b79 <+521>: mov %eax,0x78(%rsp) .... [ERRNO-2] 1281 if (errno != ENODATA) { 0x00007fd149d492ae <+2366>: mov 0x50(%rsp),%rax <<----- Because it already stored the address returned by __errno_location(), it just dereferences the address to get the errno value. BUT THIS CODE NEED NOT BE EXECUTED BY SAME THREAD!!! 0x00007fd149d492b3 <+2371>: mov $0x9,%ebp 0x00007fd149d492b8 <+2376>: mov (%rax),%edi 0x00007fd149d492ba <+2378>: cmp $0x3d,%edi The problem is that __errno_location() value of t1 and t2 are different. So [ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is executing [ERRNO-2] code section. When code is compiled without any optimization for [ERRNO-2]: 1281 if (errno != ENODATA) { 0x00007fd58e7a326f <+2237>: callq 0x7fd58e797300 <address@hidden><<--- As it is calling __errno_location() again it gets the location from t2 so it works as intended. 0x00007fd58e7a3274 <+2242>: mov (%rax),%eax 0x00007fd58e7a3276 <+2244>: cmp $0x3d,%eax 0x00007fd58e7a3279 <+2247>: je 0x7fd58e7a32a1 <gf_defrag_migrate_data+2287> Fix: Make syncop_xxx() return (-errno) value as the return value in case of errors and all the functions which make syncop_xxx() will need to use (-ret) to figure out the reason for failure in case of syncop_xxx() failures." Will be sending out a new patch to address the current bug.
Correction: For the above RCA to be true, there needs to be at-least one syncop between setting the errno and checking the same. Since there is no syncop function between gf_defrag_get_entry return to gf_defrag_fix_layout check, discarding the above RCA.
Byreddy, Not able to reproduce on my machine and as well as the machine on which it got reproduced. Would you be able to reproduce once.
(In reply to Susant Kumar Palai from comment #12) > Byreddy, > Not able to reproduce on my machine and as well as the machine on which it > got reproduced. > > Would you be able to reproduce once. Pls want to know, sos reports shared from all cluster nodes and client when issue was reproduced is not helpful to debug ? Any way i will try it once again
(In reply to Byreddy from comment #14) > (In reply to Susant Kumar Palai from comment #12) > > Byreddy, > > Not able to reproduce on my machine and as well as the machine on which it > > got reproduced. > > > > Would you be able to reproduce once. > > Pls want to know, sos reports shared from all cluster nodes and client when > issue was reproduced is not helpful to debug ? > > > Any way i will try it once again Issue got reproduced once again...i will share you the Live setup details to debug in ML. Exact steps to reproduce this issue every time =============================================== 1. Have 2 node cluster and one client. 2. Create 2 * 2 volume using both the node bricks and fuse mount it. 3. Under mount point, create two sub directories say /mnt/terminal{1..2} 4. Start Linux kernel untar from both sub directories that is /mnt/terminal1 and /mnt/terminal2 5. After 50% of untar complete, add couple of bricks to convert the volume to 2*3 6. immediately remove the added bricks in step-5 // this will start rebalance 7. wait for untar to complete from both terminal directories, once it's over, issue rm -rf * from each terminal directories. 8. once rm -rf is complete, Check the remove-brick operation status //you will see failure
(In reply to Byreddy from comment #14) > (In reply to Susant Kumar Palai from comment #12) > > Byreddy, > > Not able to reproduce on my machine and as well as the machine on which it > > got reproduced. > > > > Would you be able to reproduce once. > > Pls want to know, sos reports shared from all cluster nodes and client when > issue was reproduced is not helpful to debug ? > > > Any way i will try it once again From the code reading, the errno should have been ENOENT, but it got changed somehow. Since the errno is a per thread error context, it can not be changed by any other thread. And also there is no other function in between to modify the errno value. Hence, wanted to make sure the bug is hit all the time.
sent upstream patch here: http://review.gluster.org/#/c/16328/1
Just for sanity, pasting the logs here of the test case I ran. 1. Created 10000 dirs. 2. Ran rm -rf * from a client and started remove-brick From the rebalance log it can be seen that Readdirp hit the ENOENT error and the caller gf_defrag_fix_layout ignored the error and moved on, hence there is no failures for remove-brick. [2017-01-05 08:13:22.594141] W [MSGID: 109021] [dht-rebalance.c:2481:gf_defrag_get_entry] 0-test1-dht: Readdirp failed. Aborting data migration for directory: /dir/9991 [No such file or directory] [2017-01-05 08:13:22.594166] W [dht-rebalance.c:2940:gf_defrag_process_dir] 0-test1-dht: Found error from gf_defrag_get_entry [2017-01-05 08:13:22.595168] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9992 [2017-01-05 08:13:22.597440] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9992 [2017-01-05 08:13:22.599712] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9992 took 0.00 secs [2017-01-05 08:13:22.605515] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9993 [2017-01-05 08:13:22.607549] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9993 [2017-01-05 08:13:22.609101] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9993 took 0.00 secs [2017-01-05 08:13:22.614320] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9994 [2017-01-05 08:13:22.618288] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9994 [2017-01-05 08:13:22.619472] W [MSGID: 109021] [dht-rebalance.c:2789:gf_defrag_process_dir] 0-test1-dht: Migrate data failed: Failed to open dir /dir/9994 [2017-01-05 08:13:22.621032] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9995 [2017-01-05 08:13:22.625783] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9995 [2017-01-05 08:13:22.626583] W [MSGID: 109021] [dht-rebalance.c:2789:gf_defrag_process_dir] 0-test1-dht: Migrate data failed: Failed to open dir /dir/9995 [2017-01-05 08:13:22.628072] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9996 [2017-01-05 08:13:22.632024] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9996 [2017-01-05 08:13:22.635951] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9996 took 0.00 secs [2017-01-05 08:13:22.637782] W [MSGID: 114031] [client-rpc-fops.c:2638:client3_3_readdirp_cbk] 0-test1-client-1: remote operation failed [No such file or directory] [2017-01-05 08:13:22.640368] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9997 [2017-01-05 08:13:22.643291] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9997 [2017-01-05 08:13:22.648241] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9997 took 0.00 secs [2017-01-05 08:13:22.652646] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9998 [2017-01-05 08:13:22.655472] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9998 [2017-01-05 08:13:22.657819] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9998 took 0.00 secs [2017-01-05 08:13:22.660269] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9999 [2017-01-05 08:13:22.672088] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9999 [2017-01-05 08:13:22.673253] W [MSGID: 114031] [client-rpc-fops.c:2638:client3_3_readdirp_cbk] 0-test1-client-1: remote operation failed [No such file or directory] [2017-01-05 08:13:22.673331] W [MSGID: 109021] [dht-rebalance.c:2481:gf_defrag_get_entry] 0-test1-dht: Readdirp failed. Aborting data migration for directory: /dir/9999 [No such file or directory] [2017-01-05 08:13:22.673402] W [dht-rebalance.c:2940:gf_defrag_process_dir] 0-test1-dht: Found error from gf_defrag_get_entry [2017-01-05 08:13:22.674645] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/10000 [2017-01-05 08:13:22.682212] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/10000 [2017-01-05 08:13:22.684272] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/10000 took 0.00 secs [2017-01-05 08:13:22.697989] I [dht-rebalance.c:3952:gf_defrag_start_crawl] 0-DHT: crawling file-system completed [2017-01-05 08:13:22.698037] I [dht-rebalance.c:2360:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3 [2017-01-05 08:13:22.698094] I [dht-rebalance.c:2360:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4 [2017-01-05 08:13:22.699254] I [MSGID: 109028] [dht-rebalance.c:4194:gf_defrag_status_get] 0-test1-dht: Rebalance is completed. Time taken is 93.00 secs [2017-01-05 08:13:22.699994] I [MSGID: 109028] [dht-rebalance.c:4198:gf_defrag_status_get] 0-test1-dht: Files migrated: 0, size: 0, lookups: 2, failures: 0, skipped: 0 [2017-01-05 08:13:22.700998] W [glusterfsd.c:1329:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x75ba) [0x7f964580b5ba] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0xdc) [0x409994] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x77) [0x407ef4] ) 0-: received signum (15), shutting down
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/94359/
BZ added to erratum https://errata.devel.redhat.com/advisory/24866 Moving to ON_QA
Verified this BZ on glusterfs version: 3.8.4-11.el7rhgs.x86_64. Test-1: 1) Create a Distributed-Disperse and start it. 2) FUSE mount the volume on multiple clients. 3) create files, hardlinks, untar linux kernel on the mount. 4) From a client open two terminal sessions and run command rm -rf *. 5) While rm -rf is in progress, remove few bricks and check the status. Test-2: 1. Have 2 node cluster and one client. 2. Create 2 * 2 volume using both the node bricks and fuse mount it. 3. Under mount point, create two sub directories say /mnt/terminal{1..2} 4. Start Linux kernel untar from both sub directories that is /mnt/terminal1 and /mnt/terminal2 5. After 50% of untar complete, add couple of bricks to convert the volume to 2*3 6. immediately remove the added bricks in step-5 // this will start rebalance 7. wait for untar to complete from both terminal directories, once it's over, issue rm -rf * from each terminal directories. 8. once rm -rf is complete, Check the remove-brick operation status Before the fix when the above tests were ran, "fix layout failed" errors were seen in the rebalance logs and remove-brick rebalance failed on one of the nodes, after the fix we are not seeing these errors and remove-brick rebalance completed successfully on all the nodes. Moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html