Description of problem: Hi, I have a GlusterFS distributed dispersed volume (20 x (4 + 2)) with 9 productive servers and tried to remove a subvolume (disperse-12). I used the command gluster volume remove-brick OMICS $bricks When I started the rebalance, all bricks of the volume were somewhere between 60 and 70% full. When I returned to office on Monday, the bricks of the subvolume I wanted to remove were at 7%. All other subvolumes that are located on the same servers as disperse-12 had dropped, too. Some of them were down to 33%. The bricks of three other servers were hitting 90%. The bricks of the last three servers still are at about 65%, give or take a bit. The status of the remove-brick command showed ~50.000 failures per server. And I found a LOT of log entries like this: [2020-02-17 10:02:47.971011] I [dht-rebalance.c:1589:dht_migrate_file] 0-OMICS-dht: $FILE: attempting to move from OMICS-disperse-0 to OMICS-disperse- 10 [2020-02-17 10:02:47.997915] W [MSGID: 0] [dht-rebalance.c:1026:__dht_check_free_space] 0-OMICS-dht: Write will cross min-free-disk for file - $FILE on subvol - OMICS-disperse-10. Looking for new subvol [2020-02-17 10:02:47.997970] I [MSGID: 0] [dht-rebalance.c:1082:__dht_check_free_space] 0-OMICS-dht: new target found - OMICS-disperse-1 for file - $FILE [2020-02-17 10:02:48.192873] I [MSGID: 0] [dht-rebalance.c:1788:dht_migrate_file] 0-OMICS-dht: destination for file - $FILE is changed to - OMICS- disperse-1 [2020-02-17 10:02:48.407606] E [MSGID: 109023] [dht-rebalance.c:2055:dht_migrate_file] 0-OMICS-dht: failed to set xattr on $FILE in OMICS-disperse-10 [Operation not supported] [2020-02-17 10:02:48.414374] E [MSGID: 109023] [dht-rebalance.c:2874:gf_defrag_migrate_single_file] 0-OMICS-dht: migrate-data failed for $FILE [Operation not supported] disperse-10 is one of the subvolumes who had hit 90%. If I look for $FILE on the bricks of the first server, I find copies on both subvol disperse-0 and subvol disperse-1, and those on subvol disperse-1 look weird (brick 0100 and 0101 belong to subvol disperse-0, brick 0102 and 0103 are part of subvol disperse-1): # ls -lah $BRICKS/$FILE -rw-r--r-- 2 $USER $GROUP 3.5K Feb 13 07:47 $BRICK0100/$FILE -rw-r--r-- 2 $USER $GROUP 3.5K Feb 13 07:47 $BRICK0101/$FILE -rw-r--r-- 2 $USER $GROUP 0 Feb 17 11:02 $BRICK0102/$FILE -rw-r--r-- 2 $USER $GROUP 0 Feb 17 11:02 $BRICK0103/$FILE The copy on bricks 0102 and 0103 look broken to me. The files have no size, no content but wrong permissions for a linkfile. When I look at the files from client side, some still have content. But some are empty and report a file size of 0. Version-Release number of selected component (if applicable): GlusterFS 6.6-1 (installed via Gluster deb mirror on Debian Stretch) How reproducible: Didn't dare to try again. Actual results: I suddenly have files who have lost their content and bricks that are hitting 90% while others have plenty of room. Expected results: All files should remain the way they were. Additional info: More of a question, really. Is it possible to restore the file content?
I've monitored those files a while and they actually seem to oscillate. Sometimes they have content, sometimes they don't. I suspect it depends on whether the empty copy or the one with content is found first when looking for the file. Could it be sufficient to delete this empty copy on brick side (including the corresponding GFID file in .glusterfs) to restore my files?
This bug is moved to https://github.com/gluster/glusterfs/issues/879, and will be tracked there from now on. Visit GitHub issues URL for further details