Bug 1734251 - Files inaccessible if one rebalance process is killed in a multinode volume
Summary: Files inaccessible if one rebalance process is killed in a multinode volume
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 6
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Barak Sason Rofman
QA Contact:
Depends On: 1711764
Blocks: 1714124
TreeView+ depends on / blocked
Reported: 2019-07-30 04:56 UTC by Nithya Balachandran
Modified: 2020-03-12 12:56 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1711764
Last Closed: 2020-03-12 12:56:46 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)

Description Nithya Balachandran 2019-07-30 04:56:58 UTC
+++ This bug was initially created as a clone of Bug #1711764 +++

Description of problem:

This is a consequence of https://review.gluster.org/#/c/glusterfs/+/17239/ and lookup-optimize being enabled.

Rebalance directory processing steps on each node:

1. Set new layout on directory without the commit hash
2. List files on that local subvol. Migrate those files which fall into its bucket. Lookups are performed on the files only if it is determined that it is to be migrated by the process.
3. When done, update the layout on the local subvol with the layout containing the commit hash.

When there are multiple rebalance processes processing the same directory, they finish at different times and one process can update the layout with the commit hash before the others are done listing and migrating their files.
Clients will therefore see a complete layout even before all files have been looked up according to the new layout causing file access to fail.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create a 2x2 volume spanning 2 nodes. Create some directories and files on it.
2. Add 2 bricks to convert it to a 3x2 volume.
3. Start a rebalance on the volume and break into one rebalance process before it starts processing the directories.
4. Allow the second rebalance process to complete. Kill the process that is blocked by gdb.
5. Mount the volume and try to stat the files without listing the directories.

Actual results:

The stat will fail for several files with the error :

stat: cannot stat ‘<filename>’: No such file or directory

Expected results:

Additional info:

--- Additional comment from Nithya Balachandran on 2019-05-20 05:05:30 UTC ---

The easiest solution is to have each node do the file lookups before the call to gf_defrag_should_i_migrate.

Pros:  Simple
Cons: Will introduce more lookups but is pretty much the same as the number seen before https://review.gluster.org/#/c/glusterfs/+/17239/

--- Additional comment from Worker Ant on 2019-05-20 10:01:20 UTC ---

REVIEW: https://review.gluster.org/22746 (cluster/dht: Lookup all files when processing directory) posted (#1) for review on master by N Balachandran

Comment 2 Worker Ant 2020-03-12 12:56:46 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/973, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.