Created attachment 1668276 [details] gluster volume get vol1 all Description of problem: We have quite a big glusterfs cluster with 37 billion files and 75TB used space. All these files were stored on a distributed-replication cluster on six servers and glusterfs v5.5. At some point, we decided to expand our cluster and orderer yet another six servers. After some sort of smoke test (which included expanding a cluster) of glusterfs v6.8 on our stage environment, we upgraded our production (op.version still configured to use v5.5). After several days we started expanding the cluster and a couple of days everything looked fine, but then we found out that some files are absent. We stopped rebalance immediately, and at that moment, progress was around 10%. The only application uses this storage only adds new files and never deletes them, and the only thing that was changed in the last days is started rebalance process and the new version of glusterfs. So the main suspected is glusterfs. After checking files, we found out that about 2% of them were lost. We can restore some of them from a backup, which was made before the upgrade and rebalance, but some of them lost forever. Unfortunately, there are no relevant errors in log files and no coredump files on any storage node. How can I help to find a reason of data loss? Version-Release number of selected component (if applicable): 6.8 How reproducible: Unable to reproduce on our stage, but on production, it reproduces Steps to Reproduce: 1. create distributed-replication cluster with replications factor 2 2. mount volume and copy files to it 2. add new servers to a pool: gluster peer probe .. 3. expand cluster: gluster volume add-brick my-vol srv6:/br srv7/br 4. invoke rebalance: gluster volume rebalance my-vol start 5. check all files exist Actual results: some files disappeared Expected results: all files exist Additional info: ~ $ sudo gluster volume info Volume Name: vol1 Type: Distributed-Replicate Volume ID: fb35a90e-5174-466e-bb66-39391e8e83b9 Status: Started Snapshot Count: 0 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: srv1:/vol3/brick Brick2: srv2:/vol3/brick Brick3: srv3:/vol3/brick Brick4: srv4:/vol3/brick Brick5: srv5:/vol3/brick Brick6: srv6:/vol3/brick Brick7: srv7:/var/lib/gluster-bricks/brick Brick8: srv8:/var/lib/gluster-bricks/brick Brick9: srv9:/var/lib/gluster-bricks/brick Brick10: srv10:/var/lib/gluster-bricks/brick Brick11: srv11:/var/lib/gluster-bricks/brick Brick12: srv12:/var/lib/gluster-bricks/brick Options Reconfigured: cluster.self-heal-daemon: enable cluster.rebal-throttle: normal performance.readdir-ahead: off transport.address-family: inet6 performance.io-thread-count: 64 nfs.disable: on performance.io-cache: on performance.quick-read: off performance.parallel-readdir: on performance.client-io-threads: off features.sdfs: enable performance.read-ahead: off client.event-threads: 4 server.event-threads: 32
This bug is moved to https://github.com/gluster/glusterfs/issues/885, and will be tracked there from now on. Visit GitHub issues URL for further details