Bug 1428673
Summary: | possible repeatedly recursive healing of same file with background heal not happening when IO is going on | |||
---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Pranith Kumar K <pkarampu> | |
Component: | disperse | Assignee: | Ashish Pandey <aspandey> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | mainline | CC: | amukherj, aspandey, bugs, nchilaka, pkarampu, rcyriac, rhinduja, rhs-bugs, storage-qa-internal | |
Target Milestone: | --- | Keywords: | Regression, Triaged | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.12.0 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | 1427159 | |||
: | 1459392 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-21 07:58:45 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1427159, 1459392 |
Comment 1
Pranith Kumar K
2017-03-03 06:25:38 UTC
Xavi, We missed considering this case when we reviewed this patch :-(. I will try to automate this case and it to our tests. commit 2c76c52c337a0a14e19f3385268c543aaa06534d Author: Ashish Pandey <aspandey> Date: Fri Oct 23 13:27:51 2015 +0530 cluster/ec: update version and size on good bricks Problem: readdir/readdirp fops calls [f]xattrop with fop->good which contain only one brick for these operations. That causes xattrop to be failed as it requires at least "minimum" number of brick. Solution: Use lock->good_mask to call xattrop. lock->good_mask contain all the good locked bricks on which the previous write opearion was successfull. Change-Id: If1b500391aa6fca6bd863702e030957b694ab499 BUG: 1274629 Signed-off-by: Ashish Pandey <aspandey> Reviewed-on: http://review.gluster.org/12419 Tested-by: NetBSD Build System <jenkins.org> Reviewed-by: Xavier Hernandez <xhernandez> Tested-by: Xavier Hernandez <xhernandez> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Pranith REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#1) for review on master by Ashish Pandey (aspandey) REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#2) for review on master by Ashish Pandey (aspandey) After applying https://review.gluster.org/16985 1 - Steps to see data corruption - Create 2+1 ec volume and mount it on /mnt/vol (brick{0,1,2}) - cd /mnt/vol - start writing data on a file using "dd if=/dev/urandom of=file conv=fdatasync" - kill brick0 - Start brick0 using command line. gluster v start force was not recreating this issue very often. - keep observing output of "gluster v heal volname info" - As soon as heal info shows 0 entries for all the bricks, stop IO on mount point (ctrl + C dd) >> Check md5sum of the file by following steps - - kill brick0 - unmount /mnt/vol - mount volume again - md5sum file <<<< - start brick0 using command line - kill brick1 - unmount /mnt/vol - mount volume again - md5sum file <<<< - start brick1 using command line - kill brick2 - unmount /mnt/vol - mount volume again - md5sum file <<<< md5sum would be different in above outputs. I have observed almost 6 out of 10 times. I have also observed one issue related to heal info - Dirty bit remains more than 1 on bricks even after healing size and version completely.( disable self heal) This is visible with my patch but I think the problem lies somewhere else. If we disable server side heal and heal starts due to client side access of file, heal happens, but dirty bit does not become 1 or 0. if IO is going on that file. The result of which is that heal info will keep showing this file as heal required. Note, now heal info also checks that if dirty bit is more than one then something is wrong with file even if all version and size are same. ---------- steps - gluster v heal vol disable - Just try to create issue of data corruption by following above steps - getxattr -m. -d -e hex <all brick paths> - we can see that after some time version and size would be same . dirty would be same but would be more than 1. - gluster v heal info will shows 1 entry per brick REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#3) for review on master by Ashish Pandey (aspandey) REVIEW: https://review.gluster.org/16985 (cluster/ec: update xattr on healing bricks) posted (#4) for review on master by Ashish Pandey (aspandey) REVIEW: https://review.gluster.org/16985 (cluster/ec:) posted (#5) for review on master by Ashish Pandey (aspandey) REVIEW: https://review.gluster.org/16985 (cluster/ec: Update xattr and heal size properly) posted (#6) for review on master by Ashish Pandey (aspandey) RCA for data corruption provided by Xavi- The real problem happens in ec_rebuild_data(). Here we receive the 'size' argument which contains the real file size at the time of starting self-heal and it's assigned to heal->total_size. After that, a sequence of calls to ec_sync_heal_block() are done. Each call ends up calling ec_manager_heal_block(), which does the actual work of healing a block. First a lock on the inode is taken in state EC_STATE_INIT using ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is called. This function calls ec_set_inode_size() to store the real size of the inode (it uses heal->total_size). The next step is to read the block to be healed. This is done using a regular ec_readv(). One of the things this call does is to trim the returned size if the file is smaller than the requested size. In our case, when we read the last block of a file whose size was = 512 mod 1024 at the time of starting self-heal, ec_readv() will return only the first 512 bytes, not the whole 1024 bytes. This isn't a problem since the following ec_writev() sent from the heal code only attempts to write the amount of data read, so it shouldn't modify the remaining 512 bytes. However ec_writev() also checks the file size. If we are writing the last block of the file (determined by the size stored on the inode that we have set to heal->total_size), any data beyond the (imposed) end of file will be cleared with 0's. This causes the 512 bytes after the heal->total_size to be cleared. Since the file was written after heal started, the these bytes contained data, so the block written to the damaged brick will be incorrect. I've tried to align heal->total_size to a multiple of the stripe size and it seems to be working fine now. REVIEW: https://review.gluster.org/16985 (cluster/ec: Update xattr and heal size properly) posted (#7) for review on master by Ashish Pandey (aspandey) COMMIT: https://review.gluster.org/16985 committed in master by Xavier Hernandez (xhernandez) ------ commit 88c67b72b1d5843d11ce7cba27dd242bd0c23c6a Author: Ashish Pandey <aspandey> Date: Mon Apr 3 12:46:29 2017 +0530 cluster/ec: Update xattr and heal size properly Problem-1 : Recursive healing of same file is happening when IO is going on even after data heal completes. Solution: RCA: At the end of the write, when ec_update_size_version gets called, we send it only on good bricks and not on healing brick. Due to this, xattr on healing brick will always remain out of sync and when the background heal check source and sink, it finds this brick to be healed and start healing from scratch. That involve ftruncate and writing all of the data again. To solve this, send xattrop on all the good bricks as well as healing bricks. Problem-2: The above fix exposes the data corruption during heal. If the write on a file is going on and heal finishes, we find that the file gets corrupted. RCA: The real problem happens in ec_rebuild_data(). Here we receive the 'size' argument which contains the real file size at the time of starting self-heal and it's assigned to heal->total_size. After that, a sequence of calls to ec_sync_heal_block() are done. Each call ends up calling ec_manager_heal_block(), which does the actual work of healing a block. First a lock on the inode is taken in state EC_STATE_INIT using ec_heal_inodelk(). When the lock is acquired, ec_heal_lock_cbk() is called. This function calls ec_set_inode_size() to store the real size of the inode (it uses heal->total_size). The next step is to read the block to be healed. This is done using a regular ec_readv(). One of the things this call does is to trim the returned size if the file is smaller than the requested size. In our case, when we read the last block of a file whose size was = 512 mod 1024 at the time of starting self-heal, ec_readv() will return only the first 512 bytes, not the whole 1024 bytes. This isn't a problem since the following ec_writev() sent from the heal code only attempts to write the amount of data read, so it shouldn't modify the remaining 512 bytes. However ec_writev() also checks the file size. If we are writing the last block of the file (determined by the size stored on the inode that we have set to heal->total_size), any data beyond the (imposed) end of file will be cleared with 0's. This causes the 512 bytes after the heal->total_size to be cleared. Since the file was written after heal started, the these bytes contained data, so the block written to the damaged brick will be incorrect. Solution: Align heal->total_size to a multiple of the stripe size. Thanks "Xavier Hernandez" <xhernandez> to find out the root cause and to fix the issue. Change-Id: I6c9f37b3ff9dd7f5dc1858ad6f9845c05b4e204e BUG: 1428673 Signed-off-by: Ashish Pandey <aspandey> Reviewed-on: https://review.gluster.org/16985 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Reviewed-by: Xavier Hernandez <xhernandez> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/ |