Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1403743

Summary:	self-heal not happening, as self-heal info lists the same pending shards to be healed
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	replicate	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.9	CC:	amukherj, bugs, kdhananj, rcyriac, rhinduja, rhs-bugs, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.9.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1402730	Environment:	virt RHGS integration
Last Closed:	2017-03-08 10:21:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1402730
Bug Blocks:	1400057, 1403646

Description Krutika Dhananjay 2016-12-12 10:16:29 UTC

+++ This bug was initially created as a clone of Bug #1402730 +++

+++ This bug was initially created as a clone of Bug #1400057 +++

Description of problem:
-----------------------
After killing one of bricks of replica 3 in a cyclic fashion ( kill brick1, bring brick1 up, self-heal in-progress, repeat with other 2 bricks ), I see that healing is not happening for few shards.

self-heal info shows the same number of entries for long time

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.2.0 interim build ( glusterfs-3.8.4-6.el7rhgs )
RHEL 7.3
qemu-kvm-rhev-2.6.0-27.el7.x86_64

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Create replica 3 volume and optimize the volume for VM store. Enable granular-entry-heal and also compound-fops
2. Fuse mount the volume and create 2 VM images of size 50G ( sparse, thin allocated )
3. Create 2 app VMs with these images and install RHEL 7.3 on them
4. Keep doing I/O from these VMs using 'dd'
5. Kill brick1, wait for number of entries to heal grows up to some 1000
6. Start brick1 by force starting the volume
7. Let the self-heal happen but before it could complete, kill brick2
8. After few more minutes, start brick2 by force starting the volume
9. Let the self-heal happen but before it could complete, kill brick3
10. After few more minutes, start brick3 by force starting the volume
11. Do one more iteration of 5-10

Actual results:
---------------
self-heal doesn't happen for few shards

Expected results:
-----------------
self-heal should heal all the pending entries

--- Additional comment from SATHEESARAN on 2016-11-30 06:38:08 EST ---

Volume information
-------------------

# gluster volume info vmvol
 
Volume Name: vmvol
Type: Replicate
Volume ID: 0b3267c4-75ac-4660-a602-54a3eb331de5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/gluster/brick2/bk
Brick2: server2:/gluster/brick2/bk
Brick3: server3:/gluster/brick2/bk
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
storage.owner-uid: 36
storage.owner-gid: 36
features.shard-block-size: 4MB
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
features.shard: on
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
performance.strict-o-direct: on
network.ping-timeout: 30
user.cifs: off
cluster.use-compound-fops: on
cluster.granular-entry-heal: enable


Pending heals :
---------------
I have a script which looks for unsynced entries as the heal happens. I am pasting the logs from this script, while the test was going on.

Wed Nov 30 13:16:50 IST 2016    Unsynced entries on all bricks: 1940 - 2016
Wed Nov 30 13:18:01 IST 2016    Unsynced entries on all bricks: 2211 - 2267
Wed Nov 30 13:19:54 IST 2016    Unsynced entries on all bricks: - 55 1988
Wed Nov 30 13:20:53 IST 2016    Unsynced entries on all bricks: - 179 1868
Wed Nov 30 13:21:49 IST 2016    Unsynced entries on all bricks: - 298 1753
Wed Nov 30 13:22:54 IST 2016    Unsynced entries on all bricks: 2014 493 -
Wed Nov 30 13:23:50 IST 2016    Unsynced entries on all bricks: 1921 675 -
Wed Nov 30 13:25:07 IST 2016    Unsynced entries on all bricks: 1774 828 1694
Wed Nov 30 13:27:11 IST 2016    Unsynced entries on all bricks: 1740 - 1613
Wed Nov 30 13:28:56 IST 2016    Unsynced entries on all bricks: 1643 - 1855
Wed Nov 30 13:29:47 IST 2016    Unsynced entries on all bricks: 1896 - 2017
Wed Nov 30 13:30:41 IST 2016    Unsynced entries on all bricks: 2057 683 1856
Wed Nov 30 13:33:19 IST 2016    Unsynced entries on all bricks: 637 460 1657
Wed Nov 30 13:34:23 IST 2016    Unsynced entries on all bricks: - 393 1565
Wed Nov 30 13:35:22 IST 2016    Unsynced entries on all bricks: - 312 1454
Wed Nov 30 13:36:09 IST 2016    Unsynced entries on all bricks: - 323 1301
Wed Nov 30 13:36:52 IST 2016    Unsynced entries on all bricks: - 369 1159
Wed Nov 30 13:37:31 IST 2016    Unsynced entries on all bricks: - 410 1045
Wed Nov 30 13:38:03 IST 2016    Unsynced entries on all bricks: - 441 976
Wed Nov 30 13:38:36 IST 2016    Unsynced entries on all bricks: - 480 894
Wed Nov 30 13:39:09 IST 2016    Unsynced entries on all bricks: - 512 774
Wed Nov 30 13:39:43 IST 2016    Unsynced entries on all bricks: - 548 658
Wed Nov 30 13:40:14 IST 2016    Unsynced entries on all bricks: - 579 548
Wed Nov 30 13:40:44 IST 2016    Unsynced entries on all bricks: - 609 542
Wed Nov 30 13:41:05 IST 2016    Unsynced entries on all bricks: - 687 616
Wed Nov 30 13:41:29 IST 2016    Unsynced entries on all bricks: - 771 694
Wed Nov 30 13:41:56 IST 2016    Unsynced entries on all bricks: - 859 780
Wed Nov 30 13:42:26 IST 2016    Unsynced entries on all bricks: - 950 902
Wed Nov 30 13:43:00 IST 2016    Unsynced entries on all bricks: - 1051 1007
Wed Nov 30 13:43:35 IST 2016    Unsynced entries on all bricks: - 1149 1097
Wed Nov 30 13:44:10 IST 2016    Unsynced entries on all bricks: - 1240 1199
Wed Nov 30 13:44:48 IST 2016    Unsynced entries on all bricks: - 1378 1317
Wed Nov 30 13:45:29 IST 2016    Unsynced entries on all bricks: - 1473 1434
Wed Nov 30 13:46:11 IST 2016    Unsynced entries on all bricks: - 1587 1533
Wed Nov 30 13:46:54 IST 2016    Unsynced entries on all bricks: - 1689 1652
Wed Nov 30 13:47:43 IST 2016    Unsynced entries on all bricks: - 1810 1785
Wed Nov 30 13:48:36 IST 2016    Unsynced entries on all bricks: - 1968 1902
Wed Nov 30 13:49:28 IST 2016    Unsynced entries on all bricks: - 2093 2069
Wed Nov 30 13:50:27 IST 2016    Unsynced entries on all bricks: - 2229 2214
Wed Nov 30 13:51:29 IST 2016    Unsynced entries on all bricks: - 2384 2373
Wed Nov 30 13:52:32 IST 2016    Unsynced entries on all bricks: - 2528 2498
Wed Nov 30 13:53:38 IST 2016    Unsynced entries on all bricks: - 2593 2434
Wed Nov 30 13:54:54 IST 2016    Unsynced entries on all bricks: 1733 2184 1854
Wed Nov 30 13:56:42 IST 2016    Unsynced entries on all bricks: 1733 1577 1322
Wed Nov 30 13:58:12 IST 2016    Unsynced entries on all bricks: 1734 978 751
Wed Nov 30 14:00:17 IST 2016    Unsynced entries on all bricks: 1498 582 423
Wed Nov 30 14:01:49 IST 2016    Unsynced entries on all bricks: 1292 322 190
Wed Nov 30 14:02:57 IST 2016    Unsynced entries on all bricks: 1140 219 124
Wed Nov 30 14:03:28 IST 2016    Unsynced entries on all bricks: 1020 219 124
Wed Nov 30 14:03:54 IST 2016    Unsynced entries on all bricks: 896 219 124
Wed Nov 30 14:04:17 IST 2016    Unsynced entries on all bricks: 778 219 124
Wed Nov 30 14:04:38 IST 2016    Unsynced entries on all bricks: 680 219 124
Wed Nov 30 14:04:57 IST 2016    Unsynced entries on all bricks: 586 219 124
Wed Nov 30 14:05:16 IST 2016    Unsynced entries on all bricks: 503 219 124
Wed Nov 30 14:05:32 IST 2016    Unsynced entries on all bricks: 424 219 124
Wed Nov 30 14:05:45 IST 2016    Unsynced entries on all bricks: 360 219 124
Wed Nov 30 14:05:58 IST 2016    Unsynced entries on all bricks: 296 219 124
Wed Nov 30 14:06:10 IST 2016    Unsynced entries on all bricks: 240 219 124
Wed Nov 30 14:06:22 IST 2016    Unsynced entries on all bricks: 191 219 124
Wed Nov 30 14:06:32 IST 2016    Unsynced entries on all bricks: 140 219 124
Wed Nov 30 14:06:41 IST 2016    Unsynced entries on all bricks: 111 219 124
Wed Nov 30 14:06:46 IST 2016    Unsynced entries on all bricks: 111 219 124
Wed Nov 30 14:06:53 IST 2016    Unsynced entries on all bricks: 111 219 124
....
Wed Nov 30 17:06:10 IST 2016    Unsynced entries on all bricks: 111 219 124

--- Additional comment from Krutika Dhananjay on 2016-12-01 02:01:21 EST ---

The bug is consistently recreatable with shard and granular entry heal enabled with a simple dd of 1G into a file and during this time, bringing bricks offline and online in cyclic order (all of this while heal is disabled). And enabling heal in the end and monitoring heal-info shows that one entry always shows up in heal-info and that's 
I was not able to hit the bug without granular-entry-heal. This is possibly a bug in granular entry heal code.

-Krutika

--- Additional comment from Krutika Dhananjay on 2016-12-08 04:28:49 EST ---

First of all, great catch, Sas!

Now for the test done and the RCA.

TEST:
=====
I ran this same test by modifying shard xl to accept 1KB shard-block-size solely for the purpose of increasing the likelihood of hitting this bug and started a dd of 10G on the volume and in the meantime brought bricks offline and online in cyclic order (also with shd-side and client-side heal disabled).

After the third brick was brought offline and dd failed with EROFS/EIO, I captured the xattrs of all the different shards created on all the replicas and also the entries under indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 on all three replicas.

Then I brought the third brick back up and enabled self-heal. As Sas pointed out, after a while heal came to a halt. I captured the heal-info entries and checked for their names in the entry-changes output captured.
For all of these entries, there was NO entry-changes index corresponding to these entry names on the source bricks. As a result, these entries never got healed in the first place. Because of this, even data and metadata heal didn't proceed and so heal-info count ceased to decrease at this point.

RCA:
====
So the bug seems to be in AFR's transaction (not related to granular-entry-heal code itself because this issue can happen even with non-granular indices in entry txns although it is easier to hit with granular-esh on) and caused by a case of skipped post-op.

In the code there is only one case where this can happen (which is due to a bug of course).

At the start of the MKNOD txn of these affected shards, all three replicas were up. This caused AFR to set local->optimistic_change_log to TRUE (which means AFR can skip the pre-op phase and in the event there is a failure, it will have to execute the post-op phase of hte transaction before unlocking to set the pending markers). But before the locking phase, one of the replicas went down.

<Some-code-detail>
Because of this, local->transaction.pre_op[offline-brick-index] = 0
                 local->transaction.failed_subvols[offline-brick-index] = 1
</Some-code-detail>

AFR skips setting the dirty xattrs since local->optimistic_change_log is TRUE.
Then the MKNOD happens.
Then when it comes to doing the post-op (afr_changelog_post_op_now()), AFR relies on two parameters to decide whether post-op is necessary or not:
1. whether dirty xattr was set in pre-op, and
2. whether on any of the replicas pre-op was done but OP phase failed.

If 1 is true, post-op will be done to undo the dirty xattr setting on the bricks.
If 2 is true, again post-op will be done.

In all other cases, post-op will be skipped, and it is post-op where granular-entry indices are created.

But in the case described above where the brick went down just before the locking phase but was up at the time of txn parameters initialisation, neither pre-op nor OP was performed on this brick.
Because of this, AFR swould skip performing the post-op. This would imply that the granular-index won't be set and in the event granular entry heal is disabled, the index pertaining to the parent directory in indices/xattrop will be skipped.

--- Additional comment from Worker Ant on 2016-12-08 12:33:50 EST ---

REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2016-12-09 01:59:09 EST ---

REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2016-12-12 04:41:08 EST ---

REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#3) for review on master by Krutika Dhananjay (kdhananj)

Comment 1 Worker Ant 2016-12-12 10:17:49 UTC

REVIEW: http://review.gluster.org/16106 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#1) for review on release-3.9 by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2016-12-12 15:00:39 UTC

COMMIT: http://review.gluster.org/16106 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) 
------
commit c4eae872b8988ce41f03dd1e7a146ae7dcb0f963
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Dec 8 22:49:48 2016 +0530

    cluster/afr: Fix per-txn optimistic changelog initialisation
    
            Backport of: http://review.gluster.org/16075
    
    Incorrect initialisation of local->optimistic_change_log was leading
    to skipped pre-op and post-op even when a brick didn't participate in
    the txn because it was down.
    The result - missing granular name index resulting in some entries
    never getting healed.
    
    FIX:
    Initialise local->optimistic_change_log just before pre-op.
    
    Also fixed granular entry heal to create the granular name index in
    pre-op as opposed to post-op. This is to prevent loss of granular
    information when during an entry txn, the good (src) brick goes
    offline before the post-op is done. This would cause self-heal to
    do conservative merge (since dirty xattr is the only information
    available), which when granular-entry-heal is enabled, expects
    granular indices, the lack of which can lead to loss of data in
    the worst case.
    
    Change-Id: I213d98ca9b3c4604b095478bf427fa69c04a7d64
    BUG: 1403743
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/16106
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 3 Kaushal 2017-03-08 10:21:32 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report.

glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html
[2] https://www.gluster.org/pipermail/gluster-users/