+++ This bug was initially created as a clone of Bug #1428673 +++ +++ This bug was initially created as a clone of Bug #1427159 +++ Description of problem: ========================= When a file requires heal and there is a continuous IO happening, the heal never seems to get over, the problem is the heal happens again and again untill all IOs be it read or write is stopped.This problem will be more serious when IOs are going on , say appends The problem with this is as below: 1)same file requiring heal many times ,hence taking a very huge amount of time , ie in an ideal case a 1GB heal gets over in about 2 min, but there it can take hours 2) unnessary cpu cycles are spent on healing the same file again and again 1) started append to a file on 2x(4+2) volume from fuse client as below: dd if=/dev/urandom bs=1MB count=10000 >>ddfile2 2)getfattr of one of the bricks(all are healthy) (eager lock so dirty set to 1 as lock not getting released, due to no other request) # file: ddfile2 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.bit-rot.version=0x020000000000000058b3f6720001a78c trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x00000000000000010000000000000001 trusted.ec.size=0x0000000000000000 trusted.ec.version=0x00000000000000000000000000000000 trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3 3) brick size from one of the client before bringing b2 down [root@dhcp35-45 ecvol]# for i in {1..100};do du -sh ddfile2 ;echo "##########";ls -lh ddfile2 ;sleep 30;done 1.0G ddfile2 ########## -rw-r--r--. 2 root root 1012M Feb 27 18:46 ddfile2 4)brought down b2 5) ec attributes get updated periodically on all healthy bricks: # file: ddfile2 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.bit-rot.version=0x020000000000000058b3f6720001a78c trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x000000000000000c000000000000000c trusted.ec.size=0x0000000119ded040 trusted.ec.version=0x00000000000093c800000000000093c8 trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3 6)heal info as below: Every 2.0s: gluster v heal ecvol info Mon Feb 27 18:48:13 2017 Brick dhcp35-45.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-130.lab.eng.blr.redhat.com:/rhs/brick3/ecvol Status: Transport endpoint is not connected Number of entries: - Brick dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-23.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-112.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-138.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 brought back brick up: healthy brick xattr Every 2.0s: getfattr -d -m . -e hex ddfile2 Mon Feb 27 18:50:16 2017 # file: ddfile2 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.bit-rot.version=0x020000000000000058b3f6720001a78c trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x000000000000000c0000000000000000 trusted.ec.size=0x000000012a05f200 trusted.ec.version=0x0000000000009c400000000000009c40 trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3 now the IOs are still going on, if we check the the xattrs of both healthy brick and brick requiring heal, the METADATA healing completes immediately, but the data heal keeps happening again and again, untill all IOs to that file are stopped If we see the file size (using du -sh) the file seems to be getting healed again and again(the ls -lh shows the size in a consistent manner ie because the metadata heal is completed) =====brick requiring heal ==== (note , I reran the case for showing it to dev and in this case the file is testme, but the problem is seen consistently) ^[[A^[[B^[[B#########brick is down 256M testme ########## -rw-r--r--. 2 root root 248M Feb 27 18:56 testme 4.7M testme ########## -rw-r--r--. 2 root root 611M Feb 27 18:58 testme 477M testme ########## -rw-r--r--. 2 root root 705M Feb 27 18:59 testme 110M testme ########## -rw-r--r--. 2 root root 798M Feb 27 18:59 testme 552M testme ########## -rw-r--r--. 2 root root 891M Feb 27 19:00 testme 1019M testme ########## -rw-r--r--. 2 root root 985M Feb 27 19:00 testme 442M testme ########## -rw-r--r--. 2 root root 1.1G Feb 27 19:01 testme 899M testme ########## -rw-r--r--. 2 root root 1.2G Feb 27 19:01 testme 1.5G testme ########## -rw-r--r--. 2 root root 1.3G Feb 27 19:02 testme 458M testme ########## -rw-r--r--. 2 root root 1.4G Feb 27 19:02 testme 935M testme ########## -rw-r--r--. 2 root root 1.5G Feb 27 19:03 testme 1.6G testme ########## -rw-r--r--. 2 root root 1.6G Feb 27 19:03 testme 124M testme ########## -rw-r--r--. 2 root root 1.6G Feb 27 19:04 testme 558M testme ########## -rw-r--r--. 2 root root 1.7G Feb 27 19:04 testme 1.1G testme ########## -rw-r--r--. 2 root root 1.8G Feb 27 19:05 testme ^C =====healthy brick b1==== ^[[A^[[B^[[B#########brick is down 1.0G testme ########## -rw-r--r--. 2 root root 516M Feb 27 18:58 testme 1.0G testme ########## -rw-r--r--. 2 root root 611M Feb 27 18:58 testme 1.0G testme ########## -rw-r--r--. 2 root root 705M Feb 27 18:59 testme 1.0G testme ########## -rw-r--r--. 2 root root 798M Feb 27 18:59 testme 1.0G testme ########## -rw-r--r--. 2 root root 891M Feb 27 19:00 testme 1.0G testme ########## -rw-r--r--. 2 root root 985M Feb 27 19:00 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.1G Feb 27 19:01 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.2G Feb 27 19:01 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.3G Feb 27 19:02 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.4G Feb 27 19:02 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.5G Feb 27 19:03 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.6G Feb 27 19:03 testme 2.0G testme ########## now the append is complete, the xattrs show the size as different as below, eventhough data/metadata dirty bits are cleaned up --- Additional comment from Red Hat Bugzilla Rules Engine on 2017-02-27 08:44:20 EST --- This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from nchilaka on 2017-02-27 08:49:45 EST --- note that I used service glusterd restart to bring the brick online, inorder to avoid restart of all shds using start force --- Additional comment from nchilaka on 2017-02-27 09:19:12 EST --- with very high probability, this seems to me like a regression introduced in b/w dev builds of 3.2 Also, note that this problem can miss the eye due to spurious heal info we have in ec https://bugzilla.redhat.com/show_bug.cgi?id=1347257#c9 https://bugzilla.redhat.com/show_bug.cgi?id=1347251#c7 --- Additional comment from Ashish Pandey on 2017-03-01 00:50:40 EST --- I could reproduce this bug with latest upstream source also. The only difference which I observed is that in my case, data healed happened completely and then it did not update the versions and size on the brick which was killed so again started healing from scratch. - Also, I think metadata version should be healed and updated immediately and then data heal should start. But even metadata version is not healing # file: brick/gluster/vol-1/file security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.bit-rot.version=0x020000000000000058b65a69000587e1 trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x000000000000a1c70000000000000d4a trusted.ec.size=0x0000000392320000 trusted.ec.version=0x000000000001c919000000000001c919 trusted.gfid=0xc20fa4c7134f4a0689316c97ba13519d # file: brick/gluster/vol-2/file security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.bit-rot.version=0x030000000000000058b65cb800085657 trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x000000000000a1c70000000000000d4a trusted.ec.size=0x00000000a4a00000 trusted.ec.version=0x4000000000005250000000000001aba2 trusted.gfid=0xc20fa4c7134f4a0689316c97ba13519d # file: brick/gluster/vol-3/file security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.bit-rot.version=0x020000000000000058b6599c000b3025 trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x000000000000a1c70000000000000d4a trusted.ec.size=0x0000000392320000 trusted.ec.version=0x000000000001c919000000000001c919 trusted.gfid=0xc20fa4c7134f4a0689316c97ba13519d --- Additional comment from Pranith Kumar K on 2017-03-03 01:25:38 EST --- Description of problem: ========================= When a file requires heal and there is a continuous IO happening, the heal never seems to get over, the problem is the heal happens again and again untill all IOs be it read or write is stopped.This problem will be more serious when IOs are going on , say appends The problem with this is as below: 1)same file requiring heal many times ,hence taking a very huge amount of time , ie in an ideal case a 1GB heal gets over in about 2 min, but there it can take hours 2) unnessary cpu cycles are spent on healing the same file again and again 1) started append to a file on 2x(4+2) volume from fuse client as below: dd if=/dev/urandom bs=1MB count=10000 >>ddfile2 2)getfattr of one of the bricks(all are healthy) (eager lock so dirty set to 1 as lock not getting released, due to no other request) # file: ddfile2 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.bit-rot.version=0x020000000000000058b3f6720001a78c trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x00000000000000010000000000000001 trusted.ec.size=0x0000000000000000 trusted.ec.version=0x00000000000000000000000000000000 trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3 3) brick size from one of the client before bringing b2 down [root@dhcp35-45 ecvol]# for i in {1..100};do du -sh ddfile2 ;echo "##########";ls -lh ddfile2 ;sleep 30;done 1.0G ddfile2 ########## -rw-r--r--. 2 root root 1012M Feb 27 18:46 ddfile2 4)brought down b2 5) ec attributes get updated periodically on all healthy bricks: # file: ddfile2 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.bit-rot.version=0x020000000000000058b3f6720001a78c trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x000000000000000c000000000000000c trusted.ec.size=0x0000000119ded040 trusted.ec.version=0x00000000000093c800000000000093c8 trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3 6)heal info as below: Every 2.0s: gluster v heal ecvol info Mon Feb 27 18:48:13 2017 Brick dhcp35-45.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-130.lab.eng.blr.redhat.com:/rhs/brick3/ecvol Status: Transport endpoint is not connected Number of entries: - Brick dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-23.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-112.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 Brick dhcp35-138.lab.eng.blr.redhat.com:/rhs/brick3/ecvol /ddfile2 Status: Connected Number of entries: 1 brought back brick up: healthy brick xattr Every 2.0s: getfattr -d -m . -e hex ddfile2 Mon Feb 27 18:50:16 2017 # file: ddfile2 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.bit-rot.version=0x020000000000000058b3f6720001a78c trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x000000000000000c0000000000000000 trusted.ec.size=0x000000012a05f200 trusted.ec.version=0x0000000000009c400000000000009c40 trusted.gfid=0x5637add23aba4f7a9c3b9535dd6639a3 now the IOs are still going on, if we check the the xattrs of both healthy brick and brick requiring heal, the METADATA healing completes immediately, but the data heal keeps happening again and again, untill all IOs to that file are stopped If we see the file size (using du -sh) the file seems to be getting healed again and again(the ls -lh shows the size in a consistent manner ie because the metadata heal is completed) =====brick requiring heal ==== (note , I reran the case for showing it to dev and in this case the file is testme, but the problem is seen consistently) ^[[A^[[B^[[B#########brick is down 256M testme ########## -rw-r--r--. 2 root root 248M Feb 27 18:56 testme 4.7M testme ########## -rw-r--r--. 2 root root 611M Feb 27 18:58 testme 477M testme ########## -rw-r--r--. 2 root root 705M Feb 27 18:59 testme 110M testme ########## -rw-r--r--. 2 root root 798M Feb 27 18:59 testme 552M testme ########## -rw-r--r--. 2 root root 891M Feb 27 19:00 testme 1019M testme ########## -rw-r--r--. 2 root root 985M Feb 27 19:00 testme 442M testme ########## -rw-r--r--. 2 root root 1.1G Feb 27 19:01 testme 899M testme ########## -rw-r--r--. 2 root root 1.2G Feb 27 19:01 testme 1.5G testme ########## -rw-r--r--. 2 root root 1.3G Feb 27 19:02 testme 458M testme ########## -rw-r--r--. 2 root root 1.4G Feb 27 19:02 testme 935M testme ########## -rw-r--r--. 2 root root 1.5G Feb 27 19:03 testme 1.6G testme ########## -rw-r--r--. 2 root root 1.6G Feb 27 19:03 testme 124M testme ########## -rw-r--r--. 2 root root 1.6G Feb 27 19:04 testme 558M testme ########## -rw-r--r--. 2 root root 1.7G Feb 27 19:04 testme 1.1G testme ########## -rw-r--r--. 2 root root 1.8G Feb 27 19:05 testme ^C =====healthy brick b1==== ^[[A^[[B^[[B#########brick is down 1.0G testme ########## -rw-r--r--. 2 root root 516M Feb 27 18:58 testme 1.0G testme ########## -rw-r--r--. 2 root root 611M Feb 27 18:58 testme 1.0G testme ########## -rw-r--r--. 2 root root 705M Feb 27 18:59 testme 1.0G testme ########## -rw-r--r--. 2 root root 798M Feb 27 18:59 testme 1.0G testme ########## -rw-r--r--. 2 root root 891M Feb 27 19:00 testme 1.0G testme ########## -rw-r--r--. 2 root root 985M Feb 27 19:00 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.1G Feb 27 19:01 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.2G Feb 27 19:01 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.3G Feb 27 19:02 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.4G Feb 27 19:02 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.5G Feb 27 19:03 testme 2.0G testme ########## -rw-r--r--. 2 root root 1.6G Feb 27 19:03 testme 2.0G testme ########## --- Additional comment from Pranith Kumar K on 2017-03-03 05:22:50 EST --- Xavi, We missed considering this case when we reviewed this patch :-(. I will try to automate this case and it to our tests. commit 2c76c52c337a0a14e19f3385268c543aaa06534d Author: Ashish Pandey <aspandey> Date: Fri Oct 23 13:27:51 2015 +0530 cluster/ec: update version and size on good bricks Problem: readdir/readdirp fops calls [f]xattrop with fop->good which contain only one brick for these operations. That causes xattrop to be failed as it requires at least "minimum" number of brick. Solution: Use lock->good_mask to call xattrop. lock->good_mask contain all the good locked bricks on which the previous write opearion was successfull. Change-Id: If1b500391aa6fca6bd863702e030957b694ab499 BUG: 1274629 Signed-off-by: Ashish Pandey <aspandey> Reviewed-on: http://review.gluster.org/12419 Tested-by: NetBSD Build System <jenkins.org> Reviewed-by: Xavier Hernandez <xhernandez> Tested-by: Xavier Hernandez <xhernandez> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Pranith --- Additional comment from Worker Ant on 2017-04-03 04:19:02 EDT --- REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#1) for review on master by Ashish Pandey (aspandey) --- Additional comment from Worker Ant on 2017-05-04 04:44:45 EDT --- REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#2) for review on master by Ashish Pandey (aspandey) --- Additional comment from Ashish Pandey on 2017-05-08 14:30:04 EDT --- After applying https://review.gluster.org/16985 1 - Steps to see data corruption - Create 2+1 ec volume and mount it on /mnt/vol (brick{0,1,2}) - cd /mnt/vol - start writing data on a file using "dd if=/dev/urandom of=file conv=fdatasync" - kill brick0 - Start brick0 using command line. gluster v start force was not recreating this issue very often. - keep observing output of "gluster v heal volname info" - As soon as heal info shows 0 entries for all the bricks, stop IO on mount point (ctrl + C dd) >> Check md5sum of the file by following steps - - kill brick0 - unmount /mnt/vol - mount volume again - md5sum file <<<< - start brick0 using command line - kill brick1 - unmount /mnt/vol - mount volume again - md5sum file <<<< - start brick1 using command line - kill brick2 - unmount /mnt/vol - mount volume again - md5sum file <<<< md5sum would be different in above outputs. I have observed almost 6 out of 10 times. I have also observed one issue related to heal info - Dirty bit remains more than 1 on bricks even after healing size and version completely.( disable self heal) This is visible with my patch but I think the problem lies somewhere else. If we disable server side heal and heal starts due to client side access of file, heal happens, but dirty bit does not become 1 or 0. if IO is going on that file. The result of which is that heal info will keep showing this file as heal required. Note, now heal info also checks that if dirty bit is more than one then something is wrong with file even if all version and size are same. ---------- steps - gluster v heal vol disable - Just try to create issue of data corruption by following above steps - getxattr -m. -d -e hex <all brick paths> - we can see that after some time version and size would be same . dirty would be same but would be more than 1. - gluster v heal info will shows 1 entry per brick --- Additional comment from Worker Ant on 2017-05-09 03:07:53 EDT --- REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#3) for review on master by Ashish Pandey (aspandey) --- Additional comment from Worker Ant on 2017-05-18 03:35:31 EDT --- REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#4) for review on master by Ashish Pandey (aspandey) --- Additional comment from Worker Ant on 2017-06-01 07:28:08 EDT --- REVIEW: https://review.gluster.org/16985 (cluster/ec:) posted (#5) for review on master by Ashish Pandey (aspandey) --- Additional comment from Worker Ant on 2017-06-01 07:29:11 EDT --- REVIEW: https://review.gluster.org/16985 (cluster/ec: Update xattr and heal size properly) posted (#6) for review on master by Ashish Pandey (aspandey) --- Additional comment from Ashish Pandey on 2017-06-01 07:31:44 EDT --- RCA for data corruption provided by Xavi- The real problem happens in ec_rebuild_data(). Here we receive the 'size' argument which contains the real file size at the time of starting self-heal and it's assigned to heal->total_size. After that, a sequence of calls to ec_sync_heal_block() are done. Each call ends up calling ec_manager_heal_block(), which does the actual work of healing a block. First a lock on the inode is taken in state EC_STATE_INIT using ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is called. This function calls ec_set_inode_size() to store the real size of the inode (it uses heal->total_size). The next step is to read the block to be healed. This is done using a regular ec_readv(). One of the things this call does is to trim the returned size if the file is smaller than the requested size. In our case, when we read the last block of a file whose size was = 512 mod 1024 at the time of starting self-heal, ec_readv() will return only the first 512 bytes, not the whole 1024 bytes. This isn't a problem since the following ec_writev() sent from the heal code only attempts to write the amount of data read, so it shouldn't modify the remaining 512 bytes. However ec_writev() also checks the file size. If we are writing the last block of the file (determined by the size stored on the inode that we have set to heal->total_size), any data beyond the (imposed) end of file will be cleared with 0's. This causes the 512 bytes after the heal->total_size to be cleared. Since the file was written after heal started, the these bytes contained data, so the block written to the damaged brick will be incorrect. I've tried to align heal->total_size to a multiple of the stripe size and it seems to be working fine now. --- Additional comment from Worker Ant on 2017-06-05 05:33:05 EDT --- REVIEW: https://review.gluster.org/16985 (cluster/ec: Update xattr and heal size properly) posted (#7) for review on master by Ashish Pandey (aspandey) --- Additional comment from Worker Ant on 2017-06-06 10:42:01 EDT --- COMMIT: https://review.gluster.org/16985 committed in master by Xavier Hernandez (xhernandez) ------ commit 88c67b72b1d5843d11ce7cba27dd242bd0c23c6a Author: Ashish Pandey <aspandey> Date: Mon Apr 3 12:46:29 2017 +0530 cluster/ec: Update xattr and heal size properly Problem-1 : Recursive healing of same file is happening when IO is going on even after data heal completes. Solution: RCA: At the end of the write, when ec_update_size_version gets called, we send it only on good bricks and not on healing brick. Due to this, xattr on healing brick will always remain out of sync and when the background heal check source and sink, it finds this brick to be healed and start healing from scratch. That involve ftruncate and writing all of the data again. To solve this, send xattrop on all the good bricks as well as healing bricks. Problem-2: The above fix exposes the data corruption during heal. If the write on a file is going on and heal finishes, we find that the file gets corrupted. RCA: The real problem happens in ec_rebuild_data(). Here we receive the 'size' argument which contains the real file size at the time of starting self-heal and it's assigned to heal->total_size. After that, a sequence of calls to ec_sync_heal_block() are done. Each call ends up calling ec_manager_heal_block(), which does the actual work of healing a block. First a lock on the inode is taken in state EC_STATE_INIT using ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is called. This function calls ec_set_inode_size() to store the real size of the inode (it uses heal->total_size). The next step is to read the block to be healed. This is done using a regular ec_readv(). One of the things this call does is to trim the returned size if the file is smaller than the requested size. In our case, when we read the last block of a file whose size was = 512 mod 1024 at the time of starting self-heal, ec_readv() will return only the first 512 bytes, not the whole 1024 bytes. This isn't a problem since the following ec_writev() sent from the heal code only attempts to write the amount of data read, so it shouldn't modify the remaining 512 bytes. However ec_writev() also checks the file size. If we are writing the last block of the file (determined by the size stored on the inode that we have set to heal->total_size), any data beyond the (imposed) end of file will be cleared with 0's. This causes the 512 bytes after the heal->total_size to be cleared. Since the file was written after heal started, the these bytes contained data, so the block written to the damaged brick will be incorrect. Solution: Align heal->total_size to a multiple of the stripe size. Thanks "Xavier Hernandez" <xhernandez> to find out the root cause and to fix the issue. Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e BUG: 1428673 Signed-off-by: Ashish Pandey <aspandey> Reviewed-on: https://review.gluster.org/16985 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Reviewed-by: Xavier Hernandez <xhernandez>
REVIEW: https://review.gluster.org/17482 (cluster/ec: Update xattr and heal size properly) posted (#2) for review on release-3.11 by Ashish Pandey (aspandey)
COMMIT: https://review.gluster.org/17482 committed in release-3.11 by Shyamsundar Ranganathan (srangana) ------ commit 838cf30e3cf4c34f294742a38f0b4ae5cf23bc2a Author: Ashish Pandey <aspandey> Date: Mon Apr 3 12:46:29 2017 +0530 cluster/ec: Update xattr and heal size properly Problem-1 : Recursive healing of same file is happening when IO is going on even after data heal completes. Solution: RCA: At the end of the write, when ec_update_size_version gets called, we send it only on good bricks and not on healing brick. Due to this, xattr on healing brick will always remain out of sync and when the background heal check source and sink, it finds this brick to be healed and start healing from scratch. That involve ftruncate and writing all of the data again. To solve this, send xattrop on all the good bricks as well as healing bricks. Problem-2: The above fix exposes the data corruption during heal. If the write on a file is going on and heal finishes, we find that the file gets corrupted. RCA: The real problem happens in ec_rebuild_data(). Here we receive the 'size' argument which contains the real file size at the time of starting self-heal and it's assigned to heal->total_size. After that, a sequence of calls to ec_sync_heal_block() are done. Each call ends up calling ec_manager_heal_block(), which does the actual work of healing a block. First a lock on the inode is taken in state EC_STATE_INIT using ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is called. This function calls ec_set_inode_size() to store the real size of the inode (it uses heal->total_size). The next step is to read the block to be healed. This is done using a regular ec_readv(). One of the things this call does is to trim the returned size if the file is smaller than the requested size. In our case, when we read the last block of a file whose size was = 512 mod 1024 at the time of starting self-heal, ec_readv() will return only the first 512 bytes, not the whole 1024 bytes. This isn't a problem since the following ec_writev() sent from the heal code only attempts to write the amount of data read, so it shouldn't modify the remaining 512 bytes. However ec_writev() also checks the file size. If we are writing the last block of the file (determined by the size stored on the inode that we have set to heal->total_size), any data beyond the (imposed) end of file will be cleared with 0's. This causes the 512 bytes after the heal->total_size to be cleared. Since the file was written after heal started, the these bytes contained data, so the block written to the damaged brick will be incorrect. Solution: Align heal->total_size to a multiple of the stripe size. Thanks "Xavier Hernandez" <xhernandez> to find out the root cause and to fix the issue. >Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e >BUG: 1428673 >Signed-off-by: Ashish Pandey <aspandey> >Reviewed-on: https://review.gluster.org/16985 >Smoke: Gluster Build System <jenkins.org> >NetBSD-regression: NetBSD Build System <jenkins.org> >CentOS-regression: Gluster Build System <jenkins.org> >Reviewed-by: Pranith Kumar Karampuri <pkarampu> >Reviewed-by: Xavier Hernandez <xhernandez> >Signed-off-by: Ashish Pandey <aspandey> Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e BUG: 1459392 Signed-off-by: Ashish Pandey <aspandey> Reviewed-on: https://review.gluster.org/17482 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.1, please open a new bug report. glusterfs-3.11.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-June/000074.html [2] https://www.gluster.org/pipermail/gluster-users/