| Summary: | self-heal not happening, as self-heal info lists the same pending shards to be healed | |||
|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | SATHEESARAN <sasundar> | |
| Component: | replicate | Assignee: | Krutika Dhananjay <kdhananj> | |
| Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.2 | CC: | amukherj, jbyers, kdhananj, mchangir, nchilaka, rcyriac, rhinduja, rhs-bugs, sabose, storage-qa-internal | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.2.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.8.4-11 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1402730 (view as bug list) | Environment: |
virt RHGS integration
|
|
| Last Closed: | 2017-03-23 05:53:06 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | 1402730, 1403646, 1403743, 1408426, 1408712, 1408785, 1408786 | |||
| Bug Blocks: | 1277939, 1351528 | |||
|
Description
SATHEESARAN
2016-11-30 11:32:17 UTC
The bug is consistently recreatable with shard and granular entry heal enabled with a simple dd of 1G into a file and during this time, bringing bricks offline and online in cyclic order (all of this while heal is disabled). And enabling heal in the end and monitoring heal-info shows that one entry always shows up in heal-info and that's I was not able to hit the bug without granular-entry-heal. This is possibly a bug in granular entry heal code. -Krutika First of all, great catch, Sas!
Now for the test done and the RCA.
TEST:
=====
I ran this same test by modifying shard xl to accept 1KB shard-block-size solely for the purpose of increasing the likelihood of hitting this bug and started a dd of 10G on the volume and in the meantime brought bricks offline and online in cyclic order (also with shd-side and client-side heal disabled).
After the third brick was brought offline and dd failed with EROFS/EIO, I captured the xattrs of all the different shards created on all the replicas and also the entries under indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 on all three replicas.
Then I brought the third brick back up and enabled self-heal. As Sas pointed out, after a while heal came to a halt. I captured the heal-info entries and checked for their names in the entry-changes output captured.
For all of these entries, there was NO entry-changes index corresponding to these entry names on the source bricks. As a result, these entries never got healed in the first place. Because of this, even data and metadata heal didn't proceed and so heal-info count ceased to decrease at this point.
RCA:
====
So the bug seems to be in AFR's transaction (not related to granular-entry-heal code itself because this issue can happen even with non-granular indices in entry txns although it is easier to hit with granular-esh on) and caused by a case of skipped post-op.
In the code there is only one case where this can happen (which is due to a bug of course).
At the start of the MKNOD txn of these affected shards, all three replicas were up. This caused AFR to set local->optimistic_change_log to TRUE (which means AFR can skip the pre-op phase and in the event there is a failure, it will have to execute the post-op phase of hte transaction before unlocking to set the pending markers). But before the locking phase, one of the replicas went down.
<Some-code-detail>
Because of this, local->transaction.pre_op[offline-brick-index] = 0
local->transaction.failed_subvols[offline-brick-index] = 1
</Some-code-detail>
AFR skips setting the dirty xattrs since local->optimistic_change_log is TRUE.
Then the MKNOD happens.
Then when it comes to doing the post-op (afr_changelog_post_op_now()), AFR relies on two parameters to decide whether post-op is necessary or not:
1. whether dirty xattr was set in pre-op, and
2. whether on any of the replicas pre-op was done but OP phase failed.
If 1 is true, post-op will be done to undo the dirty xattr setting on the bricks.
If 2 is true, again post-op will be done.
In all other cases, post-op will be skipped, and it is post-op where granular-entry indices are created.
But in the case described above where the brick went down just before the locking phase but was up at the time of txn parameters initialisation, neither pre-op nor OP was performed on this brick.
Because of this, AFR swould skip performing the post-op. This would imply that the granular-index won't be set and in the event granular entry heal is disabled, the index pertaining to the parent directory in indices/xattrop will be skipped.
http://review.gluster.org/#/c/16075/ <--- upstream patch Here's the downstream patch: https://code.engineering.redhat.com/gerrit/#/c/92713/ When I tested with glusterfs-3.8.4-9.el7rhgs with replica 3 volume, it ended up with the issue - https://bugzilla.redhat.com/show_bug.cgi?id=1404982, where the VM ended up in paused state. But again in this case, I noticed there were some pending heal which were not healed at all, due to the lack of markings on .shard directory. I will try reproducing this issue with tcpdump as suggested by Krutika. Tested with glusterfs-3.8.4-10.el7rhgs with the following steps 1. Create a replica 3 sharded volume with shard-block-size as 4MB, optimized the volume for VM store usecase, and then started the volume 2. Enabled granular-entry-heal and compound-fops on the volume 3. Fuse mounted the volume on the RHEL 7.3 server and created a VM sparse image of size 20GB 4. Started installation RHEL 7.3 on the VM, and when VM installation is in progress, killed the second brick of the volume 5. After installation on the VM is completed, brought back the brick up. 6. Self-heal got triggered and there were 75 entries remain unhealed, as they missed markings I have captured tcpdump and provided the same to Krutika ( as they are few GBs in size, provided the tcpdump separately to Krutika ) Based on this observation, I am moving this bug to ASSIGNED ( FailedQA ) RCA: So this is a new issue that is different from the original issue that was fixed, because optimistic changelog worked as expected in the latest case. In this case, for all the affected files (the ones that remain in heal-info output forever), there were two MKNOD attempts of the same shard - 1. once at the time of creating the shard for writing, and 2. another time while reading the shard (in which case also, shard xl first sends an MKNOD before reading from it, in case it is found to be a hole). Before the start of (1) above, the second brick was brought down. (1) behaved as expected in that pending entry changelog of +1 was set on the parent directory, in this case .shard; and the index entry was created under indices/entry-changes corresponding to the base name of the shard that was created. So far so good. This information would have been sufficient for self-heal to recreate this file on the offline brick once it is back online. Now operation (2) happens. Here, after lock and pre-op phase, the OP phase (which is an MKNOD fop), failed on both the online replicas with EEXIST. AFR treats a case where the operation fails with the same error on all replicas as a symmetric error, and considers the transaction to be a success because the replicas are still consistent and both replicas gave the same error. This causes afr to treat this transaction as a success and as a result, it goes on to "delete" the name index in the post-op (although the name index was created by (1)). This means that all entries under .shard which go through this kind of a cycle will not undergo entry-self-heal. And because they do not undergo entry-self-heal in the first place, they will never be seeing data and metadata self-heal. This means they'll remain in heal-info output forever. This is the same end result as the one seen in the original bug. This seems to be the same issue that Kasturi hit in https://bugzilla.redhat.com/show_bug.cgi?id=1408426 I'll update that bug as to how the gfid mismatch could have happened. Here's an *extremely* simple test case to hit this issue:
1. Create a 1x3 plain replicate volume and start it.
2. Enable granular-entry-heal.
3. Disable self-heal-daemon.
4. Disable client-side heals.
5. Enable sharding.
6. Mount the volume from two mount points - /mnt/glusterfs/0 and .mnt/glusterfs/1 (say).
7. From /mnt/glusterfs/0, execute dd writing 12MB into a file.
# dd if=/dev/urandom of=file bs=1M count=12
8. Kill a replica (say second brick).
9. From /mnt/glusterfs/0, execute dd appending another 12MB of data to the same file.
# dd if=/dev/urandom bs=1M count=12 >> file
10. Ensure that indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 is created and that the name indices for shard 3, 4 and 5 are created under it on the other source bricks.
11. From /mnt/glusterfs/1, cat this file.
With this step, the name indices under indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 will disappear (which is the bug).
12. Now bring the brick back up and enable self-heal-daemon.
13. Now execute
# gluster volume heal <VOLNAME>
14. Watch the output of heal-info. It will never cease to show shards 3,4 and 5 in its output and also note that .shard will never appear in the same output.
(In reply to Krutika Dhananjay from comment #11) > This seems to be the same issue that Kasturi hit in > https://bugzilla.redhat.com/show_bug.cgi?id=1408426 > I'll update that bug as to how the gfid mismatch could have happened. Thanks for the simplified test case in comment12 Could you please elaborate on how the gfid mismatch could have happened ? (In reply to SATHEESARAN from comment #15) > (In reply to Krutika Dhananjay from comment #11) > > This seems to be the same issue that Kasturi hit in > > https://bugzilla.redhat.com/show_bug.cgi?id=1408426 > > I'll update that bug as to how the gfid mismatch could have happened. > > Thanks for the simplified test case in comment12 > > Could you please elaborate on how the gfid mismatch could have happened ? Had already done that here - https://bugzilla.redhat.com/show_bug.cgi?id=1408426#c8 BZ added to erratum https://errata.devel.redhat.com/advisory/24866 Moving to ON_QA Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-11.el7rhgs ). Tested with the simplified steps mentioned in comment12 and also performed test as mentioned in comment10. There are no entries post heal is complete and there is no entries shown in 'heal info' left unhealed. Note: Above mentioned tests was carried over with replica 3 and with shard size of 4MB Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |