+++ This bug was initially created as a clone of Bug #1402730 +++ +++ This bug was initially created as a clone of Bug #1400057 +++ Description of problem: ----------------------- After killing one of bricks of replica 3 in a cyclic fashion ( kill brick1, bring brick1 up, self-heal in-progress, repeat with other 2 bricks ), I see that healing is not happening for few shards. self-heal info shows the same number of entries for long time Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHGS 3.2.0 interim build ( glusterfs-3.8.4-6.el7rhgs ) RHEL 7.3 qemu-kvm-rhev-2.6.0-27.el7.x86_64 How reproducible: ----------------- 1/1 Steps to Reproduce: ------------------- 1. Create replica 3 volume and optimize the volume for VM store. Enable granular-entry-heal and also compound-fops 2. Fuse mount the volume and create 2 VM images of size 50G ( sparse, thin allocated ) 3. Create 2 app VMs with these images and install RHEL 7.3 on them 4. Keep doing I/O from these VMs using 'dd' 5. Kill brick1, wait for number of entries to heal grows up to some 1000 6. Start brick1 by force starting the volume 7. Let the self-heal happen but before it could complete, kill brick2 8. After few more minutes, start brick2 by force starting the volume 9. Let the self-heal happen but before it could complete, kill brick3 10. After few more minutes, start brick3 by force starting the volume 11. Do one more iteration of 5-10 Actual results: --------------- self-heal doesn't happen for few shards Expected results: ----------------- self-heal should heal all the pending entries --- Additional comment from SATHEESARAN on 2016-11-30 06:38:08 EST --- Volume information ------------------- # gluster volume info vmvol Volume Name: vmvol Type: Replicate Volume ID: 0b3267c4-75ac-4660-a602-54a3eb331de5 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: server1:/gluster/brick2/bk Brick2: server2:/gluster/brick2/bk Brick3: server3:/gluster/brick2/bk Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet storage.owner-uid: 36 storage.owner-gid: 36 features.shard-block-size: 4MB performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server features.shard: on performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 performance.strict-o-direct: on network.ping-timeout: 30 user.cifs: off cluster.use-compound-fops: on cluster.granular-entry-heal: enable Pending heals : --------------- I have a script which looks for unsynced entries as the heal happens. I am pasting the logs from this script, while the test was going on. Wed Nov 30 13:16:50 IST 2016 Unsynced entries on all bricks: 1940 - 2016 Wed Nov 30 13:18:01 IST 2016 Unsynced entries on all bricks: 2211 - 2267 Wed Nov 30 13:19:54 IST 2016 Unsynced entries on all bricks: - 55 1988 Wed Nov 30 13:20:53 IST 2016 Unsynced entries on all bricks: - 179 1868 Wed Nov 30 13:21:49 IST 2016 Unsynced entries on all bricks: - 298 1753 Wed Nov 30 13:22:54 IST 2016 Unsynced entries on all bricks: 2014 493 - Wed Nov 30 13:23:50 IST 2016 Unsynced entries on all bricks: 1921 675 - Wed Nov 30 13:25:07 IST 2016 Unsynced entries on all bricks: 1774 828 1694 Wed Nov 30 13:27:11 IST 2016 Unsynced entries on all bricks: 1740 - 1613 Wed Nov 30 13:28:56 IST 2016 Unsynced entries on all bricks: 1643 - 1855 Wed Nov 30 13:29:47 IST 2016 Unsynced entries on all bricks: 1896 - 2017 Wed Nov 30 13:30:41 IST 2016 Unsynced entries on all bricks: 2057 683 1856 Wed Nov 30 13:33:19 IST 2016 Unsynced entries on all bricks: 637 460 1657 Wed Nov 30 13:34:23 IST 2016 Unsynced entries on all bricks: - 393 1565 Wed Nov 30 13:35:22 IST 2016 Unsynced entries on all bricks: - 312 1454 Wed Nov 30 13:36:09 IST 2016 Unsynced entries on all bricks: - 323 1301 Wed Nov 30 13:36:52 IST 2016 Unsynced entries on all bricks: - 369 1159 Wed Nov 30 13:37:31 IST 2016 Unsynced entries on all bricks: - 410 1045 Wed Nov 30 13:38:03 IST 2016 Unsynced entries on all bricks: - 441 976 Wed Nov 30 13:38:36 IST 2016 Unsynced entries on all bricks: - 480 894 Wed Nov 30 13:39:09 IST 2016 Unsynced entries on all bricks: - 512 774 Wed Nov 30 13:39:43 IST 2016 Unsynced entries on all bricks: - 548 658 Wed Nov 30 13:40:14 IST 2016 Unsynced entries on all bricks: - 579 548 Wed Nov 30 13:40:44 IST 2016 Unsynced entries on all bricks: - 609 542 Wed Nov 30 13:41:05 IST 2016 Unsynced entries on all bricks: - 687 616 Wed Nov 30 13:41:29 IST 2016 Unsynced entries on all bricks: - 771 694 Wed Nov 30 13:41:56 IST 2016 Unsynced entries on all bricks: - 859 780 Wed Nov 30 13:42:26 IST 2016 Unsynced entries on all bricks: - 950 902 Wed Nov 30 13:43:00 IST 2016 Unsynced entries on all bricks: - 1051 1007 Wed Nov 30 13:43:35 IST 2016 Unsynced entries on all bricks: - 1149 1097 Wed Nov 30 13:44:10 IST 2016 Unsynced entries on all bricks: - 1240 1199 Wed Nov 30 13:44:48 IST 2016 Unsynced entries on all bricks: - 1378 1317 Wed Nov 30 13:45:29 IST 2016 Unsynced entries on all bricks: - 1473 1434 Wed Nov 30 13:46:11 IST 2016 Unsynced entries on all bricks: - 1587 1533 Wed Nov 30 13:46:54 IST 2016 Unsynced entries on all bricks: - 1689 1652 Wed Nov 30 13:47:43 IST 2016 Unsynced entries on all bricks: - 1810 1785 Wed Nov 30 13:48:36 IST 2016 Unsynced entries on all bricks: - 1968 1902 Wed Nov 30 13:49:28 IST 2016 Unsynced entries on all bricks: - 2093 2069 Wed Nov 30 13:50:27 IST 2016 Unsynced entries on all bricks: - 2229 2214 Wed Nov 30 13:51:29 IST 2016 Unsynced entries on all bricks: - 2384 2373 Wed Nov 30 13:52:32 IST 2016 Unsynced entries on all bricks: - 2528 2498 Wed Nov 30 13:53:38 IST 2016 Unsynced entries on all bricks: - 2593 2434 Wed Nov 30 13:54:54 IST 2016 Unsynced entries on all bricks: 1733 2184 1854 Wed Nov 30 13:56:42 IST 2016 Unsynced entries on all bricks: 1733 1577 1322 Wed Nov 30 13:58:12 IST 2016 Unsynced entries on all bricks: 1734 978 751 Wed Nov 30 14:00:17 IST 2016 Unsynced entries on all bricks: 1498 582 423 Wed Nov 30 14:01:49 IST 2016 Unsynced entries on all bricks: 1292 322 190 Wed Nov 30 14:02:57 IST 2016 Unsynced entries on all bricks: 1140 219 124 Wed Nov 30 14:03:28 IST 2016 Unsynced entries on all bricks: 1020 219 124 Wed Nov 30 14:03:54 IST 2016 Unsynced entries on all bricks: 896 219 124 Wed Nov 30 14:04:17 IST 2016 Unsynced entries on all bricks: 778 219 124 Wed Nov 30 14:04:38 IST 2016 Unsynced entries on all bricks: 680 219 124 Wed Nov 30 14:04:57 IST 2016 Unsynced entries on all bricks: 586 219 124 Wed Nov 30 14:05:16 IST 2016 Unsynced entries on all bricks: 503 219 124 Wed Nov 30 14:05:32 IST 2016 Unsynced entries on all bricks: 424 219 124 Wed Nov 30 14:05:45 IST 2016 Unsynced entries on all bricks: 360 219 124 Wed Nov 30 14:05:58 IST 2016 Unsynced entries on all bricks: 296 219 124 Wed Nov 30 14:06:10 IST 2016 Unsynced entries on all bricks: 240 219 124 Wed Nov 30 14:06:22 IST 2016 Unsynced entries on all bricks: 191 219 124 Wed Nov 30 14:06:32 IST 2016 Unsynced entries on all bricks: 140 219 124 Wed Nov 30 14:06:41 IST 2016 Unsynced entries on all bricks: 111 219 124 Wed Nov 30 14:06:46 IST 2016 Unsynced entries on all bricks: 111 219 124 Wed Nov 30 14:06:53 IST 2016 Unsynced entries on all bricks: 111 219 124 .... Wed Nov 30 17:06:10 IST 2016 Unsynced entries on all bricks: 111 219 124 --- Additional comment from Krutika Dhananjay on 2016-12-01 02:01:21 EST --- The bug is consistently recreatable with shard and granular entry heal enabled with a simple dd of 1G into a file and during this time, bringing bricks offline and online in cyclic order (all of this while heal is disabled). And enabling heal in the end and monitoring heal-info shows that one entry always shows up in heal-info and that's I was not able to hit the bug without granular-entry-heal. This is possibly a bug in granular entry heal code. -Krutika --- Additional comment from Krutika Dhananjay on 2016-12-08 04:28:49 EST --- First of all, great catch, Sas! Now for the test done and the RCA. TEST: ===== I ran this same test by modifying shard xl to accept 1KB shard-block-size solely for the purpose of increasing the likelihood of hitting this bug and started a dd of 10G on the volume and in the meantime brought bricks offline and online in cyclic order (also with shd-side and client-side heal disabled). After the third brick was brought offline and dd failed with EROFS/EIO, I captured the xattrs of all the different shards created on all the replicas and also the entries under indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 on all three replicas. Then I brought the third brick back up and enabled self-heal. As Sas pointed out, after a while heal came to a halt. I captured the heal-info entries and checked for their names in the entry-changes output captured. For all of these entries, there was NO entry-changes index corresponding to these entry names on the source bricks. As a result, these entries never got healed in the first place. Because of this, even data and metadata heal didn't proceed and so heal-info count ceased to decrease at this point. RCA: ==== So the bug seems to be in AFR's transaction (not related to granular-entry-heal code itself because this issue can happen even with non-granular indices in entry txns although it is easier to hit with granular-esh on) and caused by a case of skipped post-op. In the code there is only one case where this can happen (which is due to a bug of course). At the start of the MKNOD txn of these affected shards, all three replicas were up. This caused AFR to set local->optimistic_change_log to TRUE (which means AFR can skip the pre-op phase and in the event there is a failure, it will have to execute the post-op phase of hte transaction before unlocking to set the pending markers). But before the locking phase, one of the replicas went down. <Some-code-detail> Because of this, local->transaction.pre_op[offline-brick-index] = 0 local->transaction.failed_subvols[offline-brick-index] = 1 </Some-code-detail> AFR skips setting the dirty xattrs since local->optimistic_change_log is TRUE. Then the MKNOD happens. Then when it comes to doing the post-op (afr_changelog_post_op_now()), AFR relies on two parameters to decide whether post-op is necessary or not: 1. whether dirty xattr was set in pre-op, and 2. whether on any of the replicas pre-op was done but OP phase failed. If 1 is true, post-op will be done to undo the dirty xattr setting on the bricks. If 2 is true, again post-op will be done. In all other cases, post-op will be skipped, and it is post-op where granular-entry indices are created. But in the case described above where the brick went down just before the locking phase but was up at the time of txn parameters initialisation, neither pre-op nor OP was performed on this brick. Because of this, AFR swould skip performing the post-op. This would imply that the granular-index won't be set and in the event granular entry heal is disabled, the index pertaining to the parent directory in indices/xattrop will be skipped. --- Additional comment from Worker Ant on 2016-12-08 12:33:50 EST --- REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Worker Ant on 2016-12-09 01:59:09 EST --- REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#2) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Worker Ant on 2016-12-12 04:41:08 EST --- REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#3) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/16106 (cluster/afr: Fix per-txn optimistic changelog initialisation) posted (#1) for review on release-3.9 by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/16106 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) ------ commit c4eae872b8988ce41f03dd1e7a146ae7dcb0f963 Author: Krutika Dhananjay <kdhananj> Date: Thu Dec 8 22:49:48 2016 +0530 cluster/afr: Fix per-txn optimistic changelog initialisation Backport of: http://review.gluster.org/16075 Incorrect initialisation of local->optimistic_change_log was leading to skipped pre-op and post-op even when a brick didn't participate in the txn because it was down. The result - missing granular name index resulting in some entries never getting healed. FIX: Initialise local->optimistic_change_log just before pre-op. Also fixed granular entry heal to create the granular name index in pre-op as opposed to post-op. This is to prevent loss of granular information when during an entry txn, the good (src) brick goes offline before the post-op is done. This would cause self-heal to do conservative merge (since dirty xattr is the only information available), which when granular-entry-heal is enabled, expects granular indices, the lack of which can lead to loss of data in the worst case. Change-Id: I213d98ca9b3c4604b095478bf427fa69c04a7d64 BUG: 1403743 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/16106 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report. glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html [2] https://www.gluster.org/pipermail/gluster-users/