Bug 1741899

Summary:	the volume of occupied space in the bricks of gluster volume (3 nodes replica) differs on nodes and the healing does not fix it
Product:	[Community] GlusterFS	Reporter:	Sergey Pleshkov <s.pleshkov>
Component:	replicate	Assignee:	bugs <bugs>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5	CC:	amukherj, bugs, kdhananj, pkarampu, ravishankar, s.pleshkov
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-25 05:05:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sergey Pleshkov 2019-08-16 11:49:02 UTC

Description of problem:
I have a gluster volume on 3 nodes (replicate) with following configuration

[root@LSY-GL-0(1,2,3) /]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)

[root@LSY-GL-02 host]# gluster volume info TST

Volume Name: TST
Type: Replicate
Volume ID: a96c7b8c-61ec-4a4d-b47e-b445faf6c39b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: lsy-gl-01:/diskForTestData/tst
Brick2: lsy-gl-02:/diskForTestData/tst
Brick3: lsy-gl-03:/diskForTestData/tst
Options Reconfigured:
cluster.favorite-child-policy: size
features.shard-block-size: 64MB
features.shard: on
performance.io-thread-count: 24
client.event-threads: 24
server.event-threads: 24
server.allow-insecure: on
network.ping-timeout: 5
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.heal-timeout: 120

Recently this volume been moved to other disk by command
gluster volume replace-brick TST lsy-gl-0(1,2,3):/diskForData/tst lsy-gl-0(1,2,3):/diskForTestData/tst commit force

sequentialy, started with lsy-gl-03 node, all nodes been online

And now i have this state

[root@LSY-GL-02 host]# gluster volume status TST detail
Status of volume: TST
------------------------------------------------------------------------------
Brick                : Brick lsy-gl-01:/diskForTestData/tst
TCP Port             : 49154
RDMA Port            : 0
Online               : Y
Pid                  : 7555
File System          : xfs
Device               : /dev/sdc1
Mount Options        : rw,seclabel,relatime,attr2,inode64,noquota
Inode Size           : 512
Disk Space Free      : 399.9GB
Total Disk Space     : 499.8GB
Inode Count          : 262143488
Free Inodes          : 261684925
------------------------------------------------------------------------------
Brick                : Brick lsy-gl-02:/diskForTestData/tst
TCP Port             : 49154
RDMA Port            : 0
Online               : Y
Pid                  : 25732
File System          : xfs
Device               : /dev/sdc1
Mount Options        : rw,seclabel,relatime,attr2,inode64,noquota
Inode Size           : 512
Disk Space Free      : 399.9GB
Total Disk Space     : 499.8GB
Inode Count          : 262143488
Free Inodes          : 261684925
------------------------------------------------------------------------------
Brick                : Brick lsy-gl-03:/diskForTestData/tst
TCP Port             : 49154
RDMA Port            : 0
Online               : Y
Pid                  : 25243
File System          : xfs
Device               : /dev/sdc1
Mount Options        : rw,seclabel,relatime,attr2,inode64,noquota
Inode Size           : 512
Disk Space Free      : 357.6GB
Total Disk Space     : 499.8GB
Inode Count          : 262143488
Free Inodes          : 261684112

[root@LSY-GL-02 host]# gluster volume heal TST full
Launching heal operation to perform full self heal on volume TST has been successful
Use heal info commands to check status.
[root@LSY-GL-02 host]# gluster volume heal TST info
Brick lsy-gl-01:/diskForTestData/tst
Status: Connected
Number of entries: 0

Brick lsy-gl-02:/diskForTestData/tst
Status: Connected
Number of entries: 0

Brick lsy-gl-03:/diskForTestData/tst
Status: Connected
Number of entries: 0


[root@LSY-GL-01 /]# df -Th
Filesystem                        Type            Size  Used Avail Use% Mounted on
LSY-GL-01:/TST                    fuse.glusterfs  500G  148G  353G  30% /mnt/tst
/dev/sdc1                         xfs             500G  100G  400G  20% /diskForTestData

[root@LSY-GL-02 host]# df -Th
Filesystem                        Type            Size  Used Avail Use% Mounted on
LSY-GL-02:/TST                    fuse.glusterfs  500G  148G  353G  30% /mnt/tst
/dev/sdc1                         xfs             500G  100G  400G  20% /diskForTestData


[root@LSY-GL-03 host]# df -Th
Filesystem                        Type            Size  Used Avail Use% Mounted on
/dev/sdc1                         xfs             500G  143G  358G  29% /diskForTestData
LSY-GL-03:/TST                    fuse.glusterfs  500G  148G  353G  30% /mnt/tst

Version-Release number of selected component (if applicable):

[root@LSY-GL-0(1,2,3) /]# rpm -qa | grep gluster*
glusterfs-libs-5.5-1.el7.x86_64
glusterfs-fuse-5.5-1.el7.x86_64
glusterfs-client-xlators-5.5-1.el7.x86_64
centos-release-gluster5-1.0-1.el7.centos.noarch
glusterfs-api-5.5-1.el7.x86_64
glusterfs-cli-5.5-1.el7.x86_64
nfs-ganesha-gluster-2.7.1-1.el7.x86_64
glusterfs-5.5-1.el7.x86_64
glusterfs-server-5.5-1.el7.x86_64

How reproducible:

Umm, I will test it again soon and do comment

Steps to Reproduce:
1.
2.
3.

Actual results:
Size of brick on lsy-gl-01, and lsy-gl-02 differ from size brick on lsy-gl-03. Healing full not fixed this situation

Expected results:
What things I should do to fix it ?

Additional info:

Comment 1 Sergey Pleshkov 2019-08-16 11:58:40 UTC

Before replace-brick I have split-brain event, but after nodes up it been healed automaticly

Comment 2 Ravishankar N 2019-08-19 06:02:22 UTC

> gluster volume replace-brick TST lsy-gl-0(1,2,3):/diskForData/tst lsy-gl-0(1,2,3):/diskForTestData/tst commit force
I assume you ran the replace-brick command thrice, once for each brick. Did you wait for heal count to be zero after each replace-brick? If not, you can end up with incomplete heals.

Comment 3 Sergey Pleshkov 2019-08-19 06:11:32 UTC

Hello.

Replace brick commands were executed sequentially on all nodes with 12-24 hour pause. Heal count be zero every time.

Comment 4 Ravishankar N 2019-08-19 06:27:17 UTC

Could you check if there is actual missing data on lsy-gl-03? You might need to compute the checksum of each brick individually. https://github.com/gluster/glusterfs/blob/master/tests/utils/arequal-checksum.c can be used for that.

# gcc tests/utils/arequal-checksum.c -o arequal-checksum

On each brick,
#./arequal-checksum -p /diskForTestData/tst -i .glusterfs

(See ./arequal-checksum --help for details).

Comment 5 Sergey Pleshkov 2019-08-20 12:38:54 UTC

[root@LSY-GL-03 host]# ./arequal-checksum -p /diskForTestData/tst -i .glusterfs
Entry counts
Regular files   : 359953
Directories     : 13244
Symbolic links  : 511
Other           : 0
Total           : 373708

Metadata checksums
Regular files   : 800d132fc8dbd2d3
Directories     : 2a067038668ee0
Symbolic links  : 9edfcc852
Other           : 3e9

Checksums
Regular files   : 523a264a8cb047533c6d72eee606bf2
Directories     : 4f697d5629707031
Symbolic links  : 173f1e2800747538
Other           : 0
Total           : 9aa921a4bd429a8


[root@LSY-GL-02 host]# ./arequal-checksum -p /diskForTestData/tst -i .glusterfs
Entry counts
Regular files   : 359215
Directories     : 13244
Symbolic links  : 511
Other           : 0
Total           : 372970

Metadata checksums
Regular files   : 8098f54e92802273
Directories     : 2a067038668ee0
Symbolic links  : 9edfcc852
Other           : 3e9

Checksums
Regular files   : d992a16c2b695ebaef21668a320a96ac
Directories     : 52134d6004145c08
Symbolic links  : 173f1e2800747538
Other           : 0
Total           : 739f94ae1d03e126


[root@LSY-GL-01 host]# ./arequal-checksum -p /diskForTestData/tst -i .glusterfs
Entry counts
Regular files   : 359215
Directories     : 13244
Symbolic links  : 511
Other           : 0
Total           : 372970

Metadata checksums
Regular files   : 812d17da8db2d6f3
Directories     : 2a067038668ee0
Symbolic links  : 9edfcc852
Other           : 3e9

Checksums
Regular files   : b980694e409c76a1df19442db9576bc1
Directories     : 26433d161d1e130e
Symbolic links  : 173f1e2800747538
Other           : 0
Total           : 57e50e5de4a17b56

Comment 6 Sergey Pleshkov 2019-08-20 12:44:30 UTC

[root@LSY-GL-03 host]# gluster volume heal TST info
Brick lsy-gl-01:/diskForTestData/tst
Status: Connected
Number of entries: 0

Brick lsy-gl-02:/diskForTestData/tst
Status: Connected
Number of entries: 0

Brick lsy-gl-03:/diskForTestData/tst
Status: Connected
Number of entries: 0

Comment 7 Sergey Pleshkov 2019-08-20 13:14:56 UTC

Did compare of folder сontent and find this strange anomaly with size of folders and files on bricks.

lsy-lg-02
/diskForTestData/tst/.shard/.remove_me:
total 48K
  0 .
48K ..
  0 1b69424e-47ca-44b9-b475-9f073956fd10
  0 5459b172-600a-4464-8fcd-8e987a62fb37

/diskForTestData/tst/smb_conf:
total 44K
4.0K .
   0 ..
8.0K failover-dns.conf
4.0K ganesha.conf
4.0K krb5.conf
4.0K mnt-gvol.mount
4.0K mnt-prod.mount
4.0K mnt-tst.mount
   0 mount-restart-scripts
4.0K resolv.conf
4.0K resolv.dnsmasq
4.0K smb.conf
   0 user.map


lsy-gl-03
/diskForTestData/tst/.shard/.remove_me:
total 32K
  0 .
32K ..
  0 1b69424e-47ca-44b9-b475-9f073956fd10
  0 5459b172-600a-4464-8fcd-8e987a62fb37

/diskForTestData/tst/smb_conf:
total 80K
4.0K .
   0 ..
8.0K failover-dns.conf
8.0K ganesha.conf
8.0K krb5.conf
8.0K mnt-gvol.mount
8.0K mnt-prod.mount
8.0K mnt-tst.mount
   0 mount-restart-scripts
8.0K resolv.conf
8.0K resolv.dnsmasq
8.0K smb.conf
4.0K user.map

Also find difference in .gluster folder like this:

lsy-lg-02
/diskForTestData/tst/.glusterfs/e5/25:
total 80K
   0 .
 12K ..
 44K e5250ec5-b28e-4015-a3b3-8c9287b961ef
8.0K e525238c-3ee1-4581-941f-29b50a2159f9
8.0K e5254136-413b-4008-aa2a-871e22fd0e89
8.0K e5257805-e240-401a-a71b-c39718095b9a
										 
lsy-gl-03
/diskForTestData/tst/.glusterfs/e5/25:
total 65M
   0 .
 12K ..
 44K e5250ec5-b28e-4015-a3b3-8c9287b961ef
8.0K e525238c-3ee1-4581-941f-29b50a2159f9
8.0K e5254136-413b-4008-aa2a-871e22fd0e89
8.0K e5257805-e240-401a-a71b-c39718095b9a
 65M e525b876-7fd1-46ba-93fa-293e27db983c

Comment 8 Ravishankar N 2019-08-21 06:57:26 UTC

Looks like the discrepancy is due to the no. of files (738 to be specific) amongst the bricks. The directories and symlinks and their checksums match on all 3 bricks. The only fix I can think of is to find out (manually) which are the files that differ in size and forcefully trigger a heal on them. You could go through "Hack: How to trigger heal on *any* file/directory" section of my blog-post https://ravispeaks.wordpress.com/2019/05/14/gluster-afr-the-complete-guide-part-3/

Comment 9 Ravishankar N 2019-08-21 07:03:13 UTC

Also note that sharding is currently supported only for single writer use case, typically for backing store for oVirt. (https://github.com/gluster/glusterfs/issues/290)

Comment 10 Sergey Pleshkov 2019-08-28 05:27:35 UTC

(In reply to Ravishankar N from comment #8)
> Looks like the discrepancy is due to the no. of files (738 to be specific)
> amongst the bricks. The directories and symlinks and their checksums match
> on all 3 bricks. The only fix I can think of is to find out (manually) which
> are the files that differ in size and forcefully trigger a heal on them. You
> could go through "Hack: How to trigger heal on *any* file/directory" section
> of my blog-post
> https://ravispeaks.wordpress.com/2019/05/14/gluster-afr-the-complete-guide-
> part-3/

Hello

Is there any proven way to compare files / folders on two nodes of a glaster to find different files?
I tried using the "rsync -rin" command but it turned out to be ineffective for comparison (selects all files in general)

Comment 11 Ravishankar N 2019-08-29 03:57:54 UTC

(In reply to Sergey Pleshkov from comment #10)
> Hello
> 
> Is there any proven way to compare files / folders on two nodes of a glaster
> to find different files?
> I tried using the "rsync -rin" command but it turned out to be ineffective
> for comparison (selects all files in general)

To compare just the directory structure (to find what files are missing), maybe you could run `diff <(ssh root@lsy-gl-01 ls -R  /diskForTestData/tst) <(ssh root@lsy-gl-02 ls -R  /diskForTestData/tst)` etc. after setting up password-less ssh. You would need to ignore the contents of .glusterfs though.

Comment 12 Sergey Pleshkov 2019-08-29 11:22:50 UTC

Hello

Once again, I executed the command to replace the brick for lsy-gl-03 (as a simple way to repair a identity of files on lsy-gl-03):

(gluster volume replace-brick TST lsy-gl-03:/diskForData/tst lsy-gl-03:/diskForTestData/tst-fix commit force)

it must synchronize all files from live nodes (lsy-gl-01, lsy-gl-03), as i know.

But as a result, I again got a discrepancy between the actual sizes on the disk (df -h)

[root@LSY-GL-02 host]# df -h
Filesystem                         Size  Used Avail Use% Mounted on
LSY-GL-02:/TST                     500G  115G  385G  23% /mnt/tst
/dev/sdc1                          500G  110G  390G  22% /diskForTestData

[root@LSY-GL-03 host]# df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/sdc1                          500G  107G  394G  22% /diskForTestData
LSY-GL-03:/TST                     500G  115G  385G  23% /mnt/tst

I finded diff files by exec a diff command (/diskForTestData/tst is symlink to /diskForTestData/tst-fix):
[root@LSY-GL-02 ~]# diff <(ls -Ra  /diskForTestData/tst/lsy-tst/) <(ssh host@lsy-gl-03 sudo ls -Ra  /diskForTestData/tst/lsy-tst)
1c1
< /diskForTestData/tst/lsy-tst/:
---
> /diskForTestData/tst/lsy-tst:
357638a357639,357643
> 00b0d046-1e1c-4088-bb67-527513bd432d.1
> 00b0d046-1e1c-4088-bb67-527513bd432d.2
> 00b0d046-1e1c-4088-bb67-527513bd432d.3
> 00b0d046-1e1c-4088-bb67-527513bd432d.4
> 00b0d046-1e1c-4088-bb67-527513bd432d.5
357644a357650,357652
> 0339fa08-fb52-4f9f-bbc1-998a88bad3a9.1
> 0339fa08-fb52-4f9f-bbc1-998a88bad3a9.2
> 0339fa08-fb52-4f9f-bbc1-998a88bad3a9.3
357652a357661,357663
.....

Also finded a reason, what arequal-checksum command shows a lot more regular files on lsy-gl-03 - it is folder /diskForTestData/tst/lsy-tst/.shard and files in it.

But on lsy-gl-03 it have size like 70gb, but on lsy-lg-01,02 - 58gb

[root@LSY-GL-03 .shard]# du -sh /diskForTestData/tst/lsy-tst/.shard/
70G     /diskForTestData/tst/lsy-tst/.shard/
[root@LSY-GL-02 host]# du -sh /diskForTestData/tst/lsy-tst/.shard/
58G     /diskForTestData/tst/lsy-tst/.shard/

Also I have folder /diskForTestData/tst/.shard with identical files (hardlinks, i think)

What should I do with this situation ?
Copy .shard files from lsy-l-03 on lsy-gl-02,01 ?

Heal status count zero

[root@LSY-GL-03 tst]# gluster volume heal TST info
Brick lsy-gl-01:/diskForTestData/tst
Status: Connected
Number of entries: 0

Brick lsy-gl-02:/diskForTestData/tst
Status: Connected
Number of entries: 0

Brick lsy-gl-03:/diskForTestData/tst-fix
Status: Connected
Number of entries: 0

Comment 13 Ravishankar N 2019-08-30 05:36:01 UTC

Adding the Sharding maintainer Krutika to the bug for any possible advice on comment#12.

Comment 14 Krutika Dhananjay 2019-08-30 07:37:14 UTC

(In reply to Ravishankar N from comment #13)
> Adding the Sharding maintainer Krutika to the bug for any possible advice on
> comment#12.

Copying the shards across bricks from the backend is not a good idea. A parallel operation on the file while the copy is going on can lead to inconsistencies.

Ravi,
Seems like the main issue is replication inconsistency after a replace-brick. Any heal-related errors in the logs?
I see cluster.favorite-child-policy set in volume-info. Would it be an issue here?

(As an aside, network.ping-timeout is set to 5s and that's really low. @Sergey, you should probably set to a higher value, say 30s or more)

-Krutika

Comment 15 Ravishankar N 2019-08-30 09:32:25 UTC

cluster.favorite-child-policy should not cause any problems w.r.t missing files. Perhaps Sergey can check for errors in glustershd.log. FWIW, I did try out a replace brick with the volume options being the same as this one (and having files > shard size) and the heals were successful. This was on glusterfs-5.5.

Comment 16 Sergey Pleshkov 2019-08-30 09:47:02 UTC

Errors for what period?

Since the last brick replacement (August 28) on lsy-gl-03 - there are no errors in the file glfsheal-TST.log, only informational messages

Prior to this, all bricks were replaced sequentially (transfer over a separate disk on the node) - also no heal errros in logs glfsheal-TST.log on all nodes.

After that, a problem was seen with the size of the raw data.

In file glustershd.log from lsy-gl-03 - exist error messages, but not about TST volume

Comment 17 Sergey Pleshkov 2019-08-30 09:50:44 UTC

On lsy-gl-01,02 in file glustershd.log many info messages about selfeal operations when replace brick process works

Comment 18 Sergey Pleshkov 2019-09-05 05:39:14 UTC

Hello

Are there any other suggestions or tips for me to do in this situation?

Re-create TST volume and copy all data?

Find all the file names that are broken into shards and run their forced heal?

Also, why do the sizes of .shard folders on nodes differ?

By the way, how much overhead should Gluster files take in volume? Now I see a difference of 5 GB on nodes 1 and 2? this is normal ? or should it be less?