1806551 – remove-brick hiding file content

Bug 1806551 - remove-brick hiding file content

Summary: remove-brick hiding file content

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-24 14:06 UTC by g.amedick
Modified:	2020-03-12 12:21 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:21:39 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description g.amedick 2020-02-24 14:06:46 UTC

Description of problem:

Hi,

I have a GlusterFS distributed dispersed volume (20 x (4 + 2)) with 9 productive servers and tried to remove a subvolume (disperse-12). I used the command 

gluster volume remove-brick OMICS $bricks

When I started the rebalance, all bricks of the volume were somewhere between 60 and 70% full.
When I returned to office on Monday, the bricks of the subvolume I wanted to remove were at 7%. All other subvolumes that are located on the same servers as disperse-12 had dropped, too. Some of them were down to 33%. The bricks of three other servers were hitting 90%. The bricks of the last three servers still are at about 65%, give or take a bit. The status of the remove-brick command showed ~50.000 failures per server. And I found a LOT of log entries like this:

[2020-02-17 10:02:47.971011] I [dht-rebalance.c:1589:dht_migrate_file] 0-OMICS-dht: $FILE: attempting to move from OMICS-disperse-0 to OMICS-disperse-
10
[2020-02-17 10:02:47.997915] W [MSGID: 0] [dht-rebalance.c:1026:__dht_check_free_space] 0-OMICS-dht: Write will cross min-free-disk for file - $FILE
on subvol - OMICS-disperse-10. Looking for new subvol
[2020-02-17 10:02:47.997970] I [MSGID: 0] [dht-rebalance.c:1082:__dht_check_free_space] 0-OMICS-dht: new target found - OMICS-disperse-1 for file -
$FILE
[2020-02-17 10:02:48.192873] I [MSGID: 0] [dht-rebalance.c:1788:dht_migrate_file] 0-OMICS-dht: destination for file - $FILE is changed to - OMICS-
disperse-1
[2020-02-17 10:02:48.407606] E [MSGID: 109023] [dht-rebalance.c:2055:dht_migrate_file] 0-OMICS-dht: failed to set xattr on $FILE in OMICS-disperse-10
[Operation not supported]
[2020-02-17 10:02:48.414374] E [MSGID: 109023] [dht-rebalance.c:2874:gf_defrag_migrate_single_file] 0-OMICS-dht: migrate-data failed for $FILE
[Operation not supported]

disperse-10 is one of the subvolumes who had hit 90%.


If I look for $FILE on the bricks of the first server, I find copies on both subvol disperse-0 and subvol disperse-1, and those on subvol disperse-1 look weird (brick
0100 and 0101 belong to subvol disperse-0, brick 0102 and 0103 are part of subvol disperse-1):

# ls -lah $BRICKS/$FILE
-rw-r--r-- 2 $USER $GROUP 3.5K Feb 13 07:47 $BRICK0100/$FILE
-rw-r--r-- 2 $USER $GROUP 3.5K Feb 13 07:47 $BRICK0101/$FILE
-rw-r--r-- 2 $USER $GROUP    0 Feb 17 11:02 $BRICK0102/$FILE
-rw-r--r-- 2 $USER $GROUP    0 Feb 17 11:02 $BRICK0103/$FILE

The copy on bricks 0102 and 0103 look broken to me. The files have no size, no content but wrong permissions for a linkfile.

When I look at the files from client side, some still have content. But some are empty and report a file size of 0.


Version-Release number of selected component (if applicable):
GlusterFS 6.6-1 (installed via Gluster deb mirror on Debian Stretch)


How reproducible:
Didn't dare to try again.

Actual results:
I suddenly have files who have lost their content and bricks that are hitting 90% while others have plenty of room.

Expected results:
All files should remain the way they were.

Additional info:

More of a question, really. Is it possible to restore the file content?

Comment 1 g.amedick 2020-03-03 11:08:40 UTC

I've monitored those files a while and they actually seem to oscillate. Sometimes they have content, sometimes they don't. I suspect it depends on whether the empty copy or the one with content is found first when looking for the file.

Could it be sufficient to delete this empty copy on brick side (including the corresponding GFID file in .glusterfs) to restore my files?

Comment 2 Worker Ant 2020-03-12 12:21:39 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/879, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.