1449167 – After selfheal of brick file size of few files differs

Bug 1449167 - After selfheal of brick file size of few files differs

Summary: After selfheal of brick file size of few files differs

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	selfheal
Sub Component:
Version:	3.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-09 10:56 UTC by amudhan83
Modified:	2018-06-20 18:29 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-06-20 18:29:04 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description amudhan83 2017-05-09 10:56:18 UTC

Description of problem:
EC volume after replacing brick selfheal started and completed. but brick size differs from other bricks in same set.

when comparing files between good brick and healed brick found few files size differ in healed disk.

Version-Release number of selected component (if applicable):
3.10.1

File which is showing size difference after brick heal. 
Also, there is a difference in ls -l and du -h in healed brick

===========================
File info from Healed brick 
===========================

du -h /media/disk11/brick11/file1
2.2G    /media/disk11/brick11/file1

ls -lh /media/disk11/brick11/file1
-rw-r--r-- 2 root root 3.5G Nov 10 00:03 /media/disk11/brick11/file1

stat /media/disk11/brick11/file1
  File: ‘/media/disk11/brick11/file1’
  Size: 3661745152      Blocks: 4565608    IO Block: 4096   regular file
Device: 8c1h/2241d      Inode: 5931163503  Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-11-09 23:58:07.083459000 +0530
Modify: 2016-11-10 00:03:15.955455000 +0530
Change: 2017-04-23 05:56:33.570068918 +0530
 Birth: -


getfattr -m. -e hex -d /media/disk11/brick11/file1
getfattr: Removing leading '/' from absolute path names
# file: media/disk11/brick11/file1
trusted.bit-rot.signature=0x010500000000000000574ef2ff2bba2798a0451de3d9bca857380c1c36a8ca39fc7fd4e8c85dd4e559
trusted.bit-rot.version=0x050000000000000058ef4cad000c2af5
trusted.ec.config=0x0000080a02000200
trusted.ec.size=0x00000006d20e5937
trusted.ec.version=0x00000000000369080000000000036909
trusted.gfid=0xc1fadd2e84c34e5d825d6431cfb17e48

==========================
File info from good brick 
==========================
 ls -lh /media/disk11/brick11/file1
-rw-r--r-- 2 root root 3.5G Nov 10 00:03 /media/disk11/brick11/file1

 du -h /media/disk11/brick11/file1
3.5G    /media/disk11/brick11/file1

getfattr -m. -e hex -d /media/disk11/brick11/file1
getfattr: Removing leading '/' from absolute path names
# file: media/disk11/brick11/file1
trusted.bit-rot.signature=0x010500000000000000b87cccce67fe51c0c2c224459d3987fe6beb2d674264048bf508d793443a6837
trusted.bit-rot.version=0x050000000000000058e10e9d00056438
trusted.ec.config=0x0000080a02000200
trusted.ec.dirty=0x00000000000000000000000000000000
trusted.ec.size=0x00000006d20e5937
trusted.ec.version=0x00000000000369080000000000036909
trusted.gfid=0xc1fadd2e84c34e5d825d6431cfb17e48


How reproducible:

First time seeing this behaviour in production environment.

Listing out few points which i was doing during heal process.

1. during heal process reading file which is about to heal.
2. reading file from healing brick was slow. so, killed healing brick pid for user to download file. this was done twice in a days gap.
3. to speed up heal process tried running command "getfattr -h -n trusted.ec.heal 'filename' " but that also took time to heal file. so stopped
4. other than heal brick process. rebalance fix-layout and bitrot signer process were running in cluster.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ashish Pandey 2017-05-09 12:00:38 UTC

Hi Amudhan,

While the heal is going on you can see difference in "du -h" and "ls -h" 
That is ok. 

Reason - 

When heal starts, it truncate that file on a brick to size 0. If IO is going on, it will start form a specific offset and start write and that will become the size (offset + leanght)
ls -l will give this size while du -h will give you the actual block size written on disk. It is not showing the zeros created because of truncate.

Comment 2 amudhan83 2017-05-09 12:20:07 UTC

Hi Ashish,

heal is completed, but still its showing same.

Comment 3 Ashish Pandey 2017-05-09 12:32:58 UTC

I think you have also mentioned that you have killed some heal process.

2. reading file from healing brick was slow. so, killed healing brick pid for user to download file. this was done twice in a days gap.

That is the reason healing was not completed.
However, it should have been started once you have all the bricks UP again.

I would suggest to make sure that all the bricks are UP and then start heal.

- See if this file is mentioned in heal info or not. If yes, just run index heal and this will be healed.
- If NO, run client side heal using getfattr
- If in doubt and you are seeing that file is not being healed even when all the bricks are UP, try full heal.

If possible perform above steps while IO's are not going on that file.

If still you are not able to heal the files, please give us xattrs of the file from all the brick, vol info and glustershd and mount logs.

Comment 4 Shyamsundar 2018-06-20 18:29:04 UTC

This bug reported is against a version of Gluster that is no longer maintained
(or has been EOL'd). See https://www.gluster.org/release-schedule/ for the
versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline
gluster repository, request that it be reopened and the Version field be marked
appropriately.

Note You need to log in before you can comment on or make changes to this bug.