1405000 – Remove-brick rebalance failed while rm -rf is in progress

Bug 1405000 - Remove-brick rebalance failed while rm -rf is in progress

Summary: Remove-brick rebalance failed while rm -rf is in progress

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Susant Kumar Palai
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528 1351530 1408115 1408138 1408414 1410355 1410764
TreeView+	depends on / blocked

Reported:	2016-12-15 11:01 UTC by Prasad Desala
Modified:	2018-11-13 10:07 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-11
Doc Type:	Bug Fix
Doc Text:	Previously, if the rm command was still running when the remove-brick rebalance operation was triggered, the rebalance operation aborted when it received the resulting ENOENT and ESTALE errors. These errors are now ignored so that the rebalance process can complete as expected in this situation.
Clone Of:
Clones:	1408115 1410355 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:57:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Prasad Desala 2016-12-15 11:01:24 UTC

Description of problem:
=======================
Remove-brick rebalance failed with below error messages while rm -rf is in progress,

[2016-12-15 10:05:55.958248] E [dht-rebalance.c:3225:gf_defrag_fix_layout] 0-ecvol-dht: Failed to open dir /tar/linux-4.4.1/drivers/base/power
[2016-12-15 10:05:55.958284] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1/drivers/base/power
[2016-12-15 10:05:55.959295] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1/drivers/base
[2016-12-15 10:05:55.960522] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1/drivers
[2016-12-15 10:05:55.961263] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar/linux-4.4.1
[2016-12-15 10:05:55.962100] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-ecvol-dht: Fix layout failed for /tar

Version-Release number of selected component (if applicable):
3.8.4-8.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
===================
1) Create a Distributed-Disperse and start it.
2) FUSE mount the volume on multiple clients.
3) create files, hardlinks, untar linux kernel on the mount.  
4) From a client open two terminal sessions and run command rm -rf *.
5) While rm -rf is in progress, remove few bricks and check the status.

At some point, remove-brick rebalance will fail with above errors. 

Actual results:
===============
Remove-brick rebalance failed while removing files on the mount.

Expected results:
=================
Rebalance should complete successfully.

Comment 3 Susant Kumar Palai 2016-12-15 13:54:26 UTC

RCA: Most likely the rmdir has removed the directory entry before opendir and opendir failed.

patch: http://review.gluster.org/#/c/15846/ should be able to fix this test case.

Comment 6 Atin Mukherjee 2016-12-22 12:15:39 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/93573/

Comment 8 Byreddy 2017-01-04 08:47:22 UTC

This issue got reproduced with glusterfs-3.8.4-10 build when tried the same.
scenario that is, rm -rf when remove-brick operation was in progress.

Moving to assigned state.

Comment 10 Susant Kumar Palai 2017-01-04 09:52:51 UTC

We have hit a old bug where "errno" is not constant under a synctask function, since rebalance is operated under sync_threads(gf_defrag_start_crawl). More info at [1].

The patch written to address the current bug, uses errno (per thread error context) to figure out ENOENT and ESTALE and ignore them for a remove-brick process. But because of [1], at the place of error checking, the errno we are checking might belong to a different thread context because of the way syncenv functions and we will error out.

[1]:

Pasting the details from the patch here: http://review.gluster.org/#/c/6475/

"syncop: Change return value of syncop

Problem:
We found a day-1 bug when syncop_xxx() infra is used inside a synctask with
compilation optimization (CFLAGS -O2).

Detailed explanation of the Root cause:
We found the bug in 'gf_defrag_migrate_data' in rebalance operation:

Lets look at interesting parts of the function:

int
gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
                        dict_t *migrate_data)
{
.....
code section - [ Loop ]
        while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
                                       &entries)) != 0) {
.....
code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a
thread)
                /* Need to keep track of ENOENT errno, that means, there is no
                   need to send more readdirp() */
                readdir_operrno = errno;
.....
code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
                        ret = syncop_getxattr (this, &entry_loc, &dict,
                                               GF_XATTR_LINKINFO_KEY);
code section - [ ERRNO-2]   (checking for failures of syncop_getxattr(). This
may not always be executed in same thread which executed [SYNCOP-1])
                        if (ret < 0) {
                                if (errno != ENODATA) {
                                        loglevel = GF_LOG_ERROR;
                                        defrag->total_failures += 1;
.....
}

the function above could be executed by thread(t1) till [SYNCOP-1] and code
from [ERRNO-2] can be executed by a different thread(t2) because of the way
syncop-infra schedules the tasks.

when the code is compiled with -O2 optimization this is the assembly code that
is generated:
 [ERRNO-1]
1165                        readdir_operrno = errno; <<---- errno gets expanded
as *(__errno_location())
   0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0 <address@hidden>
   0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------ Address
returned by __errno_location() is stored in a special location in stack for
later use.
   0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
   0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
....
 [ERRNO-2]
1281                                        if (errno != ENODATA) {
   0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----  Because
it already stored the address returned by __errno_location(), it just
dereferences the address to get the errno value. BUT THIS CODE NEED NOT BE
EXECUTED BY SAME THREAD!!!
   0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
   0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
   0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi

The problem is that __errno_location() value of t1 and t2 are different. So
[ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is
executing [ERRNO-2] code section.

When code is compiled without any optimization for [ERRNO-2]:
1281                                        if (errno != ENODATA) {
   0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
<address@hidden><<--- As it is calling __errno_location() again it gets the
location from t2 so it works as intended.
   0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
   0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
   0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1
<gf_defrag_migrate_data+2287>

Fix:
Make syncop_xxx() return (-errno) value as the return value in
case of errors and all the functions which make syncop_xxx() will need to use
(-ret) to figure out the reason for failure in case of syncop_xxx() failures."


Will be sending out a new patch to address the current bug.

Comment 11 Susant Kumar Palai 2017-01-04 12:20:45 UTC

Correction: For the above RCA to be true, there needs to be at-least one syncop between setting the errno and checking the same. Since there is no syncop function between gf_defrag_get_entry return to gf_defrag_fix_layout check, discarding the above RCA.

Comment 12 Susant Kumar Palai 2017-01-04 12:43:18 UTC

Byreddy,
Not able to reproduce on my machine and as well as the machine on which it got reproduced.

Would you be able to reproduce once.

Comment 14 Byreddy 2017-01-05 04:50:48 UTC

(In reply to Susant Kumar Palai from comment #12)
> Byreddy,
> Not able to reproduce on my machine and as well as the machine on which it
> got reproduced.
> 
> Would you be able to reproduce once.

Pls want to know, sos reports shared from all cluster nodes and client when issue was reproduced is not helpful to debug ?


Any way i will try it once again

Comment 15 Byreddy 2017-01-05 06:00:01 UTC

(In reply to Byreddy from comment #14)
> (In reply to Susant Kumar Palai from comment #12)
> > Byreddy,
> > Not able to reproduce on my machine and as well as the machine on which it
> > got reproduced.
> > 
> > Would you be able to reproduce once.
> 
> Pls want to know, sos reports shared from all cluster nodes and client when
> issue was reproduced is not helpful to debug ?
> 
> 
> Any way i will try it once again

Issue got reproduced once again...i will share you the Live setup details to debug in ML.


Exact steps to reproduce this issue every time
===============================================
1. Have 2 node cluster and one client.
2. Create 2 * 2 volume using both the node bricks and fuse mount it.
3. Under mount point, create two sub directories say /mnt/terminal{1..2}
4. Start Linux kernel untar from both sub directories that is /mnt/terminal1 and /mnt/terminal2
5. After 50% of untar complete, add couple of bricks to convert the volume to 2*3
6. immediately remove the added bricks in step-5 // this will start rebalance 
7. wait for untar to complete from both terminal directories, once it's over, issue rm -rf * from each terminal directories.
8. once rm -rf is complete, Check the remove-brick operation status  //you will see failure

Comment 16 Susant Kumar Palai 2017-01-05 06:13:18 UTC

(In reply to Byreddy from comment #14)
> (In reply to Susant Kumar Palai from comment #12)
> > Byreddy,
> > Not able to reproduce on my machine and as well as the machine on which it
> > got reproduced.
> > 
> > Would you be able to reproduce once.
> 
> Pls want to know, sos reports shared from all cluster nodes and client when
> issue was reproduced is not helpful to debug ?
> 
> 
> Any way i will try it once again

From the code reading, the errno should have been ENOENT, but it got changed somehow. Since the errno is a per thread error context, it can not be changed by any other thread. And also there is no other function in between to modify the errno value. Hence, wanted to make sure the bug is hit all the time.

Comment 17 Susant Kumar Palai 2017-01-05 06:28:27 UTC

sent upstream patch here: http://review.gluster.org/#/c/16328/1

Comment 18 Susant Kumar Palai 2017-01-05 08:16:23 UTC

Just for sanity, pasting the logs here of the test case I ran.
1. Created 10000 dirs.
2. Ran rm -rf * from a client and started remove-brick 

From the rebalance log it can be seen that Readdirp hit the ENOENT error and the caller gf_defrag_fix_layout ignored the error and moved on, hence there is no failures for remove-brick.

[2017-01-05 08:13:22.594141] W [MSGID: 109021] [dht-rebalance.c:2481:gf_defrag_get_entry] 0-test1-dht: Readdirp failed. Aborting data migration for directory: /dir/9991 [No such file or directory]
[2017-01-05 08:13:22.594166] W [dht-rebalance.c:2940:gf_defrag_process_dir] 0-test1-dht: Found error from gf_defrag_get_entry
[2017-01-05 08:13:22.595168] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9992
[2017-01-05 08:13:22.597440] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9992
[2017-01-05 08:13:22.599712] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9992 took 0.00 secs
[2017-01-05 08:13:22.605515] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9993
[2017-01-05 08:13:22.607549] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9993
[2017-01-05 08:13:22.609101] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9993 took 0.00 secs
[2017-01-05 08:13:22.614320] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9994
[2017-01-05 08:13:22.618288] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9994
[2017-01-05 08:13:22.619472] W [MSGID: 109021] [dht-rebalance.c:2789:gf_defrag_process_dir] 0-test1-dht: Migrate data failed: Failed to open dir /dir/9994
[2017-01-05 08:13:22.621032] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9995
[2017-01-05 08:13:22.625783] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9995
[2017-01-05 08:13:22.626583] W [MSGID: 109021] [dht-rebalance.c:2789:gf_defrag_process_dir] 0-test1-dht: Migrate data failed: Failed to open dir /dir/9995
[2017-01-05 08:13:22.628072] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9996
[2017-01-05 08:13:22.632024] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9996
[2017-01-05 08:13:22.635951] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9996 took 0.00 secs
[2017-01-05 08:13:22.637782] W [MSGID: 114031] [client-rpc-fops.c:2638:client3_3_readdirp_cbk] 0-test1-client-1: remote operation failed [No such file or directory]
[2017-01-05 08:13:22.640368] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9997
[2017-01-05 08:13:22.643291] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9997
[2017-01-05 08:13:22.648241] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9997 took 0.00 secs
[2017-01-05 08:13:22.652646] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9998
[2017-01-05 08:13:22.655472] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9998
[2017-01-05 08:13:22.657819] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/9998 took 0.00 secs
[2017-01-05 08:13:22.660269] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/9999
[2017-01-05 08:13:22.672088] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/9999
[2017-01-05 08:13:22.673253] W [MSGID: 114031] [client-rpc-fops.c:2638:client3_3_readdirp_cbk] 0-test1-client-1: remote operation failed [No such file or directory]
[2017-01-05 08:13:22.673331] W [MSGID: 109021] [dht-rebalance.c:2481:gf_defrag_get_entry] 0-test1-dht: Readdirp failed. Aborting data migration for directory: /dir/9999 [No such file or directory]
[2017-01-05 08:13:22.673402] W [dht-rebalance.c:2940:gf_defrag_process_dir] 0-test1-dht: Found error from gf_defrag_get_entry
[2017-01-05 08:13:22.674645] I [MSGID: 109081] [dht-common.c:4195:dht_setxattr] 0-test1-dht: fixing the layout of /dir/10000
[2017-01-05 08:13:22.682212] I [dht-rebalance.c:2766:gf_defrag_process_dir] 0-test1-dht: migrate data called on /dir/10000
[2017-01-05 08:13:22.684272] I [dht-rebalance.c:2980:gf_defrag_process_dir] 0-test1-dht: Migration operation on dir /dir/10000 took 0.00 secs
[2017-01-05 08:13:22.697989] I [dht-rebalance.c:3952:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2017-01-05 08:13:22.698037] I [dht-rebalance.c:2360:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3
[2017-01-05 08:13:22.698094] I [dht-rebalance.c:2360:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4
[2017-01-05 08:13:22.699254] I [MSGID: 109028] [dht-rebalance.c:4194:gf_defrag_status_get] 0-test1-dht: Rebalance is completed. Time taken is 93.00 secs
[2017-01-05 08:13:22.699994] I [MSGID: 109028] [dht-rebalance.c:4198:gf_defrag_status_get] 0-test1-dht: Files migrated: 0, size: 0, lookups: 2, failures: 0, skipped: 0
[2017-01-05 08:13:22.700998] W [glusterfsd.c:1329:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x75ba) [0x7f964580b5ba] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0xdc) [0x409994] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x77) [0x407ef4] ) 0-: received signum (15), shutting down

Comment 19 Atin Mukherjee 2017-01-06 11:00:58 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/94359/

Comment 20 Milind Changire 2017-01-06 15:48:02 UTC

BZ added to erratum https://errata.devel.redhat.com/advisory/24866
Moving to ON_QA

Comment 21 Prasad Desala 2017-01-13 10:28:15 UTC

Verified this BZ on glusterfs version: 3.8.4-11.el7rhgs.x86_64. 

Test-1:
1) Create a Distributed-Disperse and start it.
2) FUSE mount the volume on multiple clients.
3) create files, hardlinks, untar linux kernel on the mount.  
4) From a client open two terminal sessions and run command rm -rf *.
5) While rm -rf is in progress, remove few bricks and check the status.

Test-2:
1. Have 2 node cluster and one client.
2. Create 2 * 2 volume using both the node bricks and fuse mount it.
3. Under mount point, create two sub directories say /mnt/terminal{1..2}
4. Start Linux kernel untar from both sub directories that is /mnt/terminal1 and /mnt/terminal2
5. After 50% of untar complete, add couple of bricks to convert the volume to 2*3
6. immediately remove the added bricks in step-5 // this will start rebalance 
7. wait for untar to complete from both terminal directories, once it's over, issue rm -rf * from each terminal directories.
8. once rm -rf is complete, Check the remove-brick operation status 

Before the fix when the above tests were ran, "fix layout failed" errors were seen in the rebalance logs and remove-brick rebalance failed on one of the nodes, after the fix we are not seeing these errors and remove-brick rebalance completed successfully on all the nodes. 

Moving this BZ to Verified.

Comment 24 errata-xmlrpc 2017-03-23 05:57:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.