1811284 – Data loss during rebalance

Bug 1811284 - Data loss during rebalance

Summary: Data loss during rebalance

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-07 07:42 UTC by Pavel Znamensky
Modified:	2020-03-12 12:22 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:22:29 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gluster volume get vol1 all (24.21 KB, text/plain) 2020-03-07 07:42 UTC, Pavel Znamensky	no flags	Details
View All

Description Pavel Znamensky 2020-03-07 07:42:53 UTC

Created attachment 1668276 [details]
gluster volume get vol1 all

Description of problem:
We have quite a big glusterfs cluster with 37 billion files and 75TB used space.
All these files were stored on a distributed-replication cluster on six servers and glusterfs v5.5.
At some point, we decided to expand our cluster and orderer yet another six servers.
After some sort of smoke test (which included expanding a cluster) of glusterfs v6.8 on our stage environment, we upgraded our production (op.version still configured to use v5.5). After several days we started expanding the cluster and a couple of days everything looked fine, but then we found out that some files are absent. We stopped rebalance immediately, and at that moment, progress was around 10%.
The only application uses this storage only adds new files and never deletes them, and the only thing that was changed in the last days is started rebalance process and the new version of glusterfs. So the main suspected is glusterfs.
After checking files, we found out that about 2% of them were lost. We can restore some of them from a backup, which was made before the upgrade and rebalance, but some of them lost forever.
Unfortunately, there are no relevant errors in log files and no coredump files on any storage node.
How can I help to find a reason of data loss?

Version-Release number of selected component (if applicable):
6.8

How reproducible:
Unable to reproduce on our stage, but on production, it reproduces

Steps to Reproduce:
1. create distributed-replication cluster with replications factor 2
2. mount volume and copy files to it
2. add new servers to a pool: gluster peer probe ..
3. expand cluster: gluster volume add-brick my-vol srv6:/br srv7/br
4. invoke rebalance: gluster volume rebalance my-vol start
5. check all files exist

Actual results:
some files disappeared

Expected results:
all files exist

Additional info:

~ $ sudo gluster volume info
 
Volume Name: vol1
Type: Distributed-Replicate
Volume ID: fb35a90e-5174-466e-bb66-39391e8e83b9
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: srv1:/vol3/brick
Brick2: srv2:/vol3/brick
Brick3: srv3:/vol3/brick
Brick4: srv4:/vol3/brick
Brick5: srv5:/vol3/brick
Brick6: srv6:/vol3/brick
Brick7: srv7:/var/lib/gluster-bricks/brick
Brick8: srv8:/var/lib/gluster-bricks/brick
Brick9: srv9:/var/lib/gluster-bricks/brick
Brick10: srv10:/var/lib/gluster-bricks/brick
Brick11: srv11:/var/lib/gluster-bricks/brick
Brick12: srv12:/var/lib/gluster-bricks/brick
Options Reconfigured:
cluster.self-heal-daemon: enable
cluster.rebal-throttle: normal
performance.readdir-ahead: off
transport.address-family: inet6
performance.io-thread-count: 64
nfs.disable: on
performance.io-cache: on
performance.quick-read: off
performance.parallel-readdir: on
performance.client-io-threads: off
features.sdfs: enable
performance.read-ahead: off
client.event-threads: 4
server.event-threads: 32

Comment 1 Worker Ant 2020-03-12 12:22:29 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/885, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.