Description of problem:
After killing one of bricks of replica 3 in a cyclic fashion ( kill brick1, bring brick1 up, self-heal in-progress, repeat with other 2 bricks ), I see that healing is not happening for few shards.
self-heal info shows the same number of entries for long time
Version-Release number of selected component (if applicable):
RHGS 3.2.0 interim build ( glusterfs-3.8.4-6.el7rhgs )
Steps to Reproduce:
1. Create replica 3 volume and optimize the volume for VM store. Enable granular-entry-heal and also compound-fops
2. Fuse mount the volume and create 2 VM images of size 50G ( sparse, thin allocated )
3. Create 2 app VMs with these images and install RHEL 7.3 on them
4. Keep doing I/O from these VMs using 'dd'
5. Kill brick1, wait for number of entries to heal grows up to some 1000
6. Start brick1 by force starting the volume
7. Let the self-heal happen but before it could complete, kill brick2
8. After few more minutes, start brick2 by force starting the volume
9. Let the self-heal happen but before it could complete, kill brick3
10. After few more minutes, start brick3 by force starting the volume
11. Do one more iteration of 5-10
self-heal doesn't happen for few shards
self-heal should heal all the pending entries
The bug is consistently recreatable with shard and granular entry heal enabled with a simple dd of 1G into a file and during this time, bringing bricks offline and online in cyclic order (all of this while heal is disabled). And enabling heal in the end and monitoring heal-info shows that one entry always shows up in heal-info and that's
I was not able to hit the bug without granular-entry-heal. This is possibly a bug in granular entry heal code.
First of all, great catch, Sas!
Now for the test done and the RCA.
I ran this same test by modifying shard xl to accept 1KB shard-block-size solely for the purpose of increasing the likelihood of hitting this bug and started a dd of 10G on the volume and in the meantime brought bricks offline and online in cyclic order (also with shd-side and client-side heal disabled).
After the third brick was brought offline and dd failed with EROFS/EIO, I captured the xattrs of all the different shards created on all the replicas and also the entries under indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 on all three replicas.
Then I brought the third brick back up and enabled self-heal. As Sas pointed out, after a while heal came to a halt. I captured the heal-info entries and checked for their names in the entry-changes output captured.
For all of these entries, there was NO entry-changes index corresponding to these entry names on the source bricks. As a result, these entries never got healed in the first place. Because of this, even data and metadata heal didn't proceed and so heal-info count ceased to decrease at this point.
So the bug seems to be in AFR's transaction (not related to granular-entry-heal code itself because this issue can happen even with non-granular indices in entry txns although it is easier to hit with granular-esh on) and caused by a case of skipped post-op.
In the code there is only one case where this can happen (which is due to a bug of course).
At the start of the MKNOD txn of these affected shards, all three replicas were up. This caused AFR to set local->optimistic_change_log to TRUE (which means AFR can skip the pre-op phase and in the event there is a failure, it will have to execute the post-op phase of hte transaction before unlocking to set the pending markers). But before the locking phase, one of the replicas went down.
Because of this, local->transaction.pre_op[offline-brick-index] = 0
local->transaction.failed_subvols[offline-brick-index] = 1
AFR skips setting the dirty xattrs since local->optimistic_change_log is TRUE.
Then the MKNOD happens.
Then when it comes to doing the post-op (afr_changelog_post_op_now()), AFR relies on two parameters to decide whether post-op is necessary or not:
1. whether dirty xattr was set in pre-op, and
2. whether on any of the replicas pre-op was done but OP phase failed.
If 1 is true, post-op will be done to undo the dirty xattr setting on the bricks.
If 2 is true, again post-op will be done.
In all other cases, post-op will be skipped, and it is post-op where granular-entry indices are created.
But in the case described above where the brick went down just before the locking phase but was up at the time of txn parameters initialisation, neither pre-op nor OP was performed on this brick.
Because of this, AFR swould skip performing the post-op. This would imply that the granular-index won't be set and in the event granular entry heal is disabled, the index pertaining to the parent directory in indices/xattrop will be skipped.
http://review.gluster.org/#/c/16075/ <--- upstream patch
Here's the downstream patch: https://code.engineering.redhat.com/gerrit/#/c/92713/
When I tested with glusterfs-3.8.4-9.el7rhgs with replica 3 volume, it ended up with the issue - https://bugzilla.redhat.com/show_bug.cgi?id=1404982, where the VM ended up in paused state.
But again in this case, I noticed there were some pending heal which were not healed at all, due to the lack of markings on .shard directory.
I will try reproducing this issue with tcpdump as suggested by Krutika.
Tested with glusterfs-3.8.4-10.el7rhgs with the following steps
1. Create a replica 3 sharded volume with shard-block-size as 4MB, optimized the volume for VM store usecase, and then started the volume
2. Enabled granular-entry-heal and compound-fops on the volume
3. Fuse mounted the volume on the RHEL 7.3 server and created a VM sparse image of size 20GB
4. Started installation RHEL 7.3 on the VM, and when VM installation is in progress, killed the second brick of the volume
5. After installation on the VM is completed, brought back the brick up.
6. Self-heal got triggered and there were 75 entries remain unhealed, as they missed markings
I have captured tcpdump and provided the same to Krutika ( as they are few GBs in size, provided the tcpdump separately to Krutika )
Based on this observation, I am moving this bug to ASSIGNED ( FailedQA )
So this is a new issue that is different from the original issue that was fixed, because optimistic changelog worked as expected in the latest case.
In this case, for all the affected files (the ones that remain in heal-info output forever), there were two MKNOD attempts of the same shard -
1. once at the time of creating the shard for writing, and
2. another time while reading the shard (in which case also, shard xl first sends an MKNOD before reading from it, in case it is found to be a hole).
Before the start of (1) above, the second brick was brought down.
(1) behaved as expected in that pending entry changelog of +1 was set on the parent directory, in this case .shard; and the index entry was created under indices/entry-changes corresponding to the base name of the shard that was created. So far so good. This information would have been sufficient for self-heal to recreate this file on the offline brick once it is back online.
Now operation (2) happens. Here, after lock and pre-op phase, the OP phase (which is an MKNOD fop), failed on both the online replicas with EEXIST. AFR treats a case where the operation fails with the same error on all replicas as a symmetric error, and considers the transaction to be a success because the replicas are still consistent and both replicas gave the same error. This causes afr to treat this transaction as a success and as a result, it goes on to "delete" the name index in the post-op (although the name index was created by (1)). This means that all entries under .shard which go through this kind of a cycle will not undergo entry-self-heal. And because they do not undergo entry-self-heal in the first place, they will never be seeing data and metadata self-heal. This means they'll remain in heal-info output forever. This is the same end result as the one seen in the original bug.
This seems to be the same issue that Kasturi hit in https://bugzilla.redhat.com/show_bug.cgi?id=1408426
I'll update that bug as to how the gfid mismatch could have happened.
Here's an *extremely* simple test case to hit this issue:
1. Create a 1x3 plain replicate volume and start it.
2. Enable granular-entry-heal.
3. Disable self-heal-daemon.
4. Disable client-side heals.
5. Enable sharding.
6. Mount the volume from two mount points - /mnt/glusterfs/0 and .mnt/glusterfs/1 (say).
7. From /mnt/glusterfs/0, execute dd writing 12MB into a file.
# dd if=/dev/urandom of=file bs=1M count=12
8. Kill a replica (say second brick).
9. From /mnt/glusterfs/0, execute dd appending another 12MB of data to the same file.
# dd if=/dev/urandom bs=1M count=12 >> file
10. Ensure that indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 is created and that the name indices for shard 3, 4 and 5 are created under it on the other source bricks.
11. From /mnt/glusterfs/1, cat this file.
With this step, the name indices under indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 will disappear (which is the bug).
12. Now bring the brick back up and enable self-heal-daemon.
13. Now execute
# gluster volume heal <VOLNAME>
14. Watch the output of heal-info. It will never cease to show shards 3,4 and 5 in its output and also note that .shard will never appear in the same output.
(In reply to Krutika Dhananjay from comment #11)
> This seems to be the same issue that Kasturi hit in
> I'll update that bug as to how the gfid mismatch could have happened.
Thanks for the simplified test case in comment12
Could you please elaborate on how the gfid mismatch could have happened ?
(In reply to SATHEESARAN from comment #15)
> (In reply to Krutika Dhananjay from comment #11)
> > This seems to be the same issue that Kasturi hit in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1408426
> > I'll update that bug as to how the gfid mismatch could have happened.
> Thanks for the simplified test case in comment12
> Could you please elaborate on how the gfid mismatch could have happened ?
Had already done that here - https://bugzilla.redhat.com/show_bug.cgi?id=1408426#c8
BZ added to erratum https://errata.devel.redhat.com/advisory/24866
Moving to ON_QA
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-11.el7rhgs ).
Tested with the simplified steps mentioned in comment12 and also performed test as mentioned in comment10.
There are no entries post heal is complete and there is no entries shown in 'heal info' left unhealed.
Note: Above mentioned tests was carried over with replica 3 and with shard size of 4MB
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.