1428673 – possible repeatedly recursive healing of same file with background heal not happening when IO is going on

Bug 1428673 - possible repeatedly recursive healing of same file with background heal not happening when IO is going on

Summary: possible repeatedly recursive healing of same file with background heal not ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1427159 1459392
TreeView+	depends on / blocked

Reported:	2017-03-03 06:23 UTC by Pranith Kumar K
Modified:	2017-09-05 17:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.12.0
Clone Of:	1427159
Clones:	1459392 (view as bug list)
Environment:
Last Closed:	2017-08-21 07:58:45 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Pranith Kumar K 2017-03-03 06:25:38 UTC

Description of problem:
=========================
When a file requires heal and there is a continuous IO happening, the heal never seems to get over, the problem is the heal happens again and again untill all IOs be it read or write is stopped.This problem will be more serious when IOs are going on , say appends
The problem with this is as below:
1)same file requiring heal many times ,hence taking a very huge amount of time , ie in an ideal case a 1GB heal gets over in about 2 min, but there it can take hours
2) unnessary cpu cycles are spent on healing the same file again and again



1) started append to a file on 2x(4+2) volume  from fuse client as below:
 dd if=/dev/urandom bs=1MB count=10000 >>ddfile2
2)getfattr of one of the bricks(all are healthy) (eager lock so dirty set to 1 as lock not getting released, due to no other request)
# file: ddfile2
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.bit-rot.version=0x020000000000000058b3f6720001a78c
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x00000000000000010000000000000001
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x00000000000000000000000000000000
trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3

3) brick size from one of the client before bringing b2 down
[root@dhcp35-45 ecvol]# for i in {1..100};do du -sh ddfile2 ;echo "##########";ls -lh ddfile2 ;sleep 30;done
1.0G	ddfile2
##########
-rw-r--r--. 2 root root 1012M Feb 27 18:46 ddfile2

4)brought down b2

5) ec attributes get updated periodically on all healthy bricks:

# file: ddfile2
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.bit-rot.version=0x020000000000000058b3f6720001a78c
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x000000000000000c000000000000000c
trusted.ec.size=0x0000000119ded040
trusted.ec.version=0x00000000000093c800000000000093c8
trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3



6)heal info as below:
Every 2.0s: gluster v heal ecvol info                                                         Mon Feb 27 18:48:13 2017

Brick dhcp35-45.lab.eng.blr.redhat.com:/rhs/brick3/ecvol
/ddfile2
Status: Connected
Number of entries: 1

Brick dhcp35-130.lab.eng.blr.redhat.com:/rhs/brick3/ecvol
Status: Transport endpoint is not connected
Number of entries: -

Brick dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick3/ecvol
/ddfile2
Status: Connected
Number of entries: 1

Brick dhcp35-23.lab.eng.blr.redhat.com:/rhs/brick3/ecvol
/ddfile2
Status: Connected
Number of entries: 1

Brick dhcp35-112.lab.eng.blr.redhat.com:/rhs/brick3/ecvol
/ddfile2
Status: Connected
Number of entries: 1

Brick dhcp35-138.lab.eng.blr.redhat.com:/rhs/brick3/ecvol
/ddfile2
Status: Connected
Number of entries: 1




brought back brick up:
healthy brick xattr
Every 2.0s: getfattr -d -m . -e hex ddfile2                                                   Mon Feb 27 18:50:16 2017

# file: ddfile2
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.bit-rot.version=0x020000000000000058b3f6720001a78c
trusted.ec.config=0x0000080602000200
trusted.ec.dirty=0x000000000000000c0000000000000000
trusted.ec.size=0x000000012a05f200
trusted.ec.version=0x0000000000009c400000000000009c40
trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3




now the IOs are still going on, if we check the the xattrs of both healthy brick and brick requiring heal, the METADATA healing completes immediately, but the data heal keeps happening again and again, untill all IOs to that file are stopped

If we see the file size (using du -sh) the file seems to be getting healed again and again(the ls -lh shows the size in a consistent manner ie because the metadata heal is completed)


=====brick requiring heal ====
(note , I reran the case for showing it to dev and in this case the file is testme, but the problem is seen consistently)
^[[A^[[B^[[B#########brick is down
256M	testme
##########
-rw-r--r--. 2 root root 248M Feb 27 18:56 testme
4.7M	testme
##########
-rw-r--r--. 2 root root 611M Feb 27 18:58 testme
477M	testme
##########
-rw-r--r--. 2 root root 705M Feb 27 18:59 testme
110M	testme
##########
-rw-r--r--. 2 root root 798M Feb 27 18:59 testme
552M	testme
##########
-rw-r--r--. 2 root root 891M Feb 27 19:00 testme
1019M	testme
##########
-rw-r--r--. 2 root root 985M Feb 27 19:00 testme
442M	testme
##########
-rw-r--r--. 2 root root 1.1G Feb 27 19:01 testme
899M	testme
##########
-rw-r--r--. 2 root root 1.2G Feb 27 19:01 testme
1.5G	testme
##########
-rw-r--r--. 2 root root 1.3G Feb 27 19:02 testme
458M	testme
##########
-rw-r--r--. 2 root root 1.4G Feb 27 19:02 testme
935M	testme
##########
-rw-r--r--. 2 root root 1.5G Feb 27 19:03 testme
1.6G	testme
##########
-rw-r--r--. 2 root root 1.6G Feb 27 19:03 testme
124M	testme
##########
-rw-r--r--. 2 root root 1.6G Feb 27 19:04 testme
558M	testme
##########
-rw-r--r--. 2 root root 1.7G Feb 27 19:04 testme
1.1G	testme
##########
-rw-r--r--. 2 root root 1.8G Feb 27 19:05 testme
^C



=====healthy brick b1====


^[[A^[[B^[[B#########brick is down
1.0G	testme
##########
-rw-r--r--. 2 root root 516M Feb 27 18:58 testme
1.0G	testme
##########
-rw-r--r--. 2 root root 611M Feb 27 18:58 testme
1.0G	testme
##########
-rw-r--r--. 2 root root 705M Feb 27 18:59 testme
1.0G	testme
##########
-rw-r--r--. 2 root root 798M Feb 27 18:59 testme
1.0G	testme
##########
-rw-r--r--. 2 root root 891M Feb 27 19:00 testme
1.0G	testme
##########
-rw-r--r--. 2 root root 985M Feb 27 19:00 testme
2.0G	testme
##########
-rw-r--r--. 2 root root 1.1G Feb 27 19:01 testme
2.0G	testme
##########
-rw-r--r--. 2 root root 1.2G Feb 27 19:01 testme
2.0G	testme
##########
-rw-r--r--. 2 root root 1.3G Feb 27 19:02 testme
2.0G	testme
##########
-rw-r--r--. 2 root root 1.4G Feb 27 19:02 testme
2.0G	testme
##########
-rw-r--r--. 2 root root 1.5G Feb 27 19:03 testme
2.0G	testme
##########
-rw-r--r--. 2 root root 1.6G Feb 27 19:03 testme
2.0G	testme
##########

Comment 2 Pranith Kumar K 2017-03-03 10:22:50 UTC

Xavi,
     We missed considering this case when we reviewed this patch :-(. I will try to automate this case and it to our tests.

commit 2c76c52c337a0a14e19f3385268c543aaa06534d
Author: Ashish Pandey <aspandey>
Date:   Fri Oct 23 13:27:51 2015 +0530

    cluster/ec: update version and size on good bricks
    
    Problem: readdir/readdirp fops calls [f]xattrop with
    fop->good which contain only one brick for these operations.
    That causes xattrop to be failed as it requires at least
    "minimum" number of brick.
    
    Solution: Use lock->good_mask to call xattrop. lock->good_mask
    contain all the good locked bricks on which the previous write
    opearion was successfull.
    
    Change-Id: If1b500391aa6fca6bd863702e030957b694ab499
    BUG: 1274629
    Signed-off-by: Ashish Pandey <aspandey>
    Reviewed-on: http://review.gluster.org/12419
    Tested-by: NetBSD Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>
    Tested-by: Xavier Hernandez <xhernandez>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Pranith

Comment 3 Worker Ant 2017-04-03 08:19:02 UTC

REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#1) for review on master by Ashish Pandey (aspandey)

Comment 4 Worker Ant 2017-05-04 08:44:45 UTC

REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#2) for review on master by Ashish Pandey (aspandey)

Comment 5 Ashish Pandey 2017-05-08 18:30:04 UTC

After applying https://review.gluster.org/16985 

1 - Steps to see data corruption

- Create 2+1 ec volume and mount it on /mnt/vol (brick{0,1,2})
- cd /mnt/vol
- start writing data on a file using "dd if=/dev/urandom of=file conv=fdatasync"
- kill brick0
- Start brick0 using command line. gluster v start force was not recreating this    
  issue very often.
- keep observing output of "gluster v heal volname info"
- As soon as heal  info shows 0 entries for all the bricks, stop IO on mount point (ctrl + C dd)

>> Check md5sum of the file by following steps - 
- kill brick0
- unmount /mnt/vol
- mount volume again
- md5sum file  <<<<

- start brick0 using command line

- kill brick1
- unmount /mnt/vol
- mount volume again
- md5sum file  <<<<

- start brick1 using command line

- kill brick2
- unmount /mnt/vol
- mount volume again
- md5sum file  <<<<

md5sum would be different in above outputs. I have observed almost 6 out of 10 times.


I have also observed one issue related to heal info - 


Dirty bit remains more than 1 on bricks even after healing size and version completely.( disable self heal)

This is visible with my patch but I think the problem lies somewhere else. 
If we disable server side heal and heal starts due to client side access of file, heal happens, but dirty bit does not become 1 or 0.
if IO is going on that file. The result of which is that heal info will keep showing this file as heal required. 

Note, now heal info also checks that if dirty bit is more than one then something is wrong with file even if all version and size are same.

----------

steps 

- gluster v heal vol disable
- Just try to create issue of data corruption by following above steps
-  getxattr -m. -d -e hex <all brick paths> 
- we can see that after some time version and size would be same . dirty would be same but would be more than 1.
- gluster v heal info will shows 1 entry per brick

Comment 6 Worker Ant 2017-05-09 07:07:53 UTC

REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#3) for review on master by Ashish Pandey (aspandey)

Comment 7 Worker Ant 2017-05-18 07:35:31 UTC

REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#4) for review on master by Ashish Pandey (aspandey)

Comment 8 Worker Ant 2017-06-01 11:28:08 UTC

REVIEW: https://review.gluster.org/16985 (cluster/ec:) posted (#5) for review on master by Ashish Pandey (aspandey)

Comment 9 Worker Ant 2017-06-01 11:29:11 UTC

REVIEW: https://review.gluster.org/16985 (cluster/ec: Update xattr and heal size properly) posted (#6) for review on master by Ashish Pandey (aspandey)

Comment 10 Ashish Pandey 2017-06-01 11:31:44 UTC

RCA for data corruption provided by Xavi-

The real problem happens in ec_rebuild_data(). Here we receive the
'size' argument which contains the real file size at the time of
starting self-heal and it's assigned to heal->total_size.

After that, a sequence of calls to ec_sync_heal_block() are done. Each
call ends up calling ec_manager_heal_block(), which does the actual work
of healing a block.

First a lock on the inode is taken in state EC_STATE_INIT using
ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is
called. This function calls ec_set_inode_size() to store the real size
of the inode (it uses heal->total_size).

The next step is to read the block to be healed. This is done using a
regular ec_readv(). One of the things this call does is to trim the
returned size if the file is smaller than the requested size.

In our case, when we read the last block of a file whose size was = 512
mod 1024 at the time of starting self-heal, ec_readv() will return only
the first 512 bytes, not the whole 1024 bytes.

This isn't a problem since the following ec_writev() sent from the heal
code only attempts to write the amount of data read, so it shouldn't
modify the remaining 512 bytes.

However ec_writev() also checks the file size. If we are writing the
last block of the file (determined by the size stored on the inode that
we have set to heal->total_size), any data beyond the (imposed) end of
file will be cleared with 0's. This causes the 512 bytes after the
heal->total_size to be cleared. Since the file was written after heal
started, the these bytes contained data, so the block written to the
damaged brick will be incorrect.

I've tried to align heal->total_size to a multiple of the stripe size
and it seems to be working fine now.

Comment 11 Worker Ant 2017-06-05 09:33:05 UTC

REVIEW: https://review.gluster.org/16985 (cluster/ec: Update xattr and heal size properly) posted (#7) for review on master by Ashish Pandey (aspandey)

Comment 12 Worker Ant 2017-06-06 14:42:01 UTC

COMMIT: https://review.gluster.org/16985 committed in master by Xavier Hernandez (xhernandez) 
------
commit 88c67b72b1d5843d11ce7cba27dd242bd0c23c6a
Author: Ashish Pandey <aspandey>
Date:   Mon Apr 3 12:46:29 2017 +0530

    cluster/ec: Update xattr and heal size properly
    
    Problem-1 : Recursive healing of same file is happening
    when IO is going on even after data heal completes.
    
    Solution:
    RCA: At the end of the write, when ec_update_size_version
    gets called, we send it only on good bricks and not
    on healing brick. Due to this, xattr on healing brick
    will always remain out of sync and when the background
    heal check source and sink, it finds this brick to be
    healed and start healing from scratch. That involve
    ftruncate and writing all of the data again.
    
    To solve this, send xattrop on all the good bricks as
    well as healing bricks.
    
    Problem-2: The above fix exposes the data corruption
    during heal. If the write on a file is going on and
    heal finishes, we find that the file gets corrupted.
    
    RCA:
    The real problem happens in ec_rebuild_data(). Here we receive the
    'size' argument which contains the real file size at the time of
    starting self-heal and it's assigned to heal->total_size.
    
    After that, a sequence of calls to ec_sync_heal_block() are done. Each
    call ends up calling ec_manager_heal_block(), which does the actual work
    of healing a block.
    
    First a lock on the inode is taken in state EC_STATE_INIT using
    ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is
    called. This function calls ec_set_inode_size() to store the real size
    of the inode (it uses heal->total_size).
    
    The next step is to read the block to be healed. This is done using a
    regular ec_readv(). One of the things this call does is to trim the
    returned size if the file is smaller than the requested size.
    
    In our case, when we read the last block of a file whose size was = 512
    mod 1024 at the time of starting self-heal, ec_readv() will return only
    the first 512 bytes, not the whole 1024 bytes.
    
    This isn't a problem since the following ec_writev() sent from the heal
    code only attempts to write the amount of data read, so it shouldn't
    modify the remaining 512 bytes.
    
    However ec_writev() also checks the file size. If we are writing the
    last block of the file (determined by the size stored on the inode that
    we have set to heal->total_size), any data beyond the (imposed) end of
    file will be cleared with 0's. This causes the 512 bytes after the
    heal->total_size to be cleared. Since the file was written after heal
    started, the these bytes contained data, so the block written to the
    damaged brick will be incorrect.
    
    Solution:
    Align heal->total_size to a multiple of the stripe size.
    
    Thanks "Xavier Hernandez" <xhernandez>
    to find out the root cause and to fix the issue.
    
    Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e
    BUG: 1428673
    Signed-off-by: Ashish Pandey <aspandey>
    Reviewed-on: https://review.gluster.org/16985
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Reviewed-by: Xavier Hernandez <xhernandez>

Comment 13 Shyamsundar 2017-09-05 17:25:47 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.