+++ This bug was initially created as a clone of Bug #1622821 +++ Description of problem: Problem Statement: Whenever we have access patterns on the mounts that need to perform lookups on multiple directories to reach the leaf file/directory when a brick is replaced/replica-count increased, there is a possibility of introducing hangs because of name-heal and metadata heal. Here is a simulation of access pattern that can show the problem. #!/bin/bash MAX_MOUNTS=40 DEPTH=4 glusterd gluster --mode=script --wignore volume create r2 replica 2 localhost.localdomain:/home/gfs/r2_0 localhost.localdomain:/home/gfs/r2_1 localhost.localdomain:/home/gfs/r2_2 localhost.localdomain:/home/gfs/r2_3 gluster --mode=script volume start r2 for i in $(seq 1 $MAX_MOUNTS); do mkdir /mnt/$i; mount -t glusterfs localhost.localdomain:/r2 /mnt/$i; done depth_str="" for i in $(seq 1 $DEPTH); do depth_str="1/$depth_str"; done mkdir -p /mnt/1/$depth_str for i in $(seq 1 $MAX_MOUNTS); do touch /mnt/1/$depth_str/$i; done gluster v add-brick r2 replica 3 localhost.localdomain:/home/gfs/r2_{4,5} force sleep 5 #for graph switch to complete gluster volume profile r2 start gluster volume profile r2 info clear for i in $(seq 1 $MAX_MOUNTS); do time touch /mnt/$i/$depth_str/${i}_1 & done; wait sleep 2 gluster volume profile r2 info incremental This will create 40 clients which will try to access different files in same directory hierarchy increasing the chances of metadata-heals and name-heals. I modified afr code to print log messages about launching metadata and name heals. What I observed is that lookups on the directories from different clients would all trigger metadata and name heals so all of these lookups will be serialized and the last mount which gets the necessary lock to perform heal took more than a second to get the metadata lock and similarly more than a second to get the lock for name heal. This was the worst I have seen after running the script above multiple times, sometimes enabling disabling different heals. More the activity on the mount, more the users will think that the mount is unusable after a point because of the hangs. This also leads to timeouts with applications that expect response in some time like web servers etc. Outputs: It takes more than 2 seconds to do 'touch' real 0m2.042s user 0m0.001s sys 0m0.002s root@localhost - /var/log/glusterfs 14:42:40 :) ⚡ grep "performing metadata selfheal" mnt-* | awk '{print $12}' | sort | uniq -c 8 805420ef-fff1-4b0a-9430-b145124a756b 41 ab2c7d71-735d-4341-b6ce-6158ebab210a 14 f0314b17-0fe4-45df-ba13-ea42066ce063 Profile info 61.15 106739.19 us 42.00 us 1358447.00 us 278 INODELK ... 14:20:28 :) ⚡ grep "completed name-heal" mnt-* | awk '{print $8}' | sort | uniq -c 44 00000000-0000-0000-0000-000000000001/1 5 1a43321e-5060-4185-9e49-de889f4f5071/1 19 999feb5b-118c-42ff-973b-0f95c6dfb927/1 11 aaa924bf-2891-4eba-b923-2ea3877061c4/1 Profile info 54.23 41022.43 us 16.00 us 1239715.00 us 551 ENTRYLK Potential Solution: Entry self-heal and Data self-heal don't have this problem because lookups don't wait for them to complete. This is not the case with metadata and name-heal. Lookup waits for these to complete. We need to find a way of serving lookup without blocking for heals except in very rare situations to address this problem. Proposed changes to name-heal: For 2-way replication we can't disable name heal because without doing name-heal we won't know what to respond to the application because when a filename exists on one replica and not on the other we need to find which of those 2 bricks is the source and which is not. But in the case of replica-3 and replica-2+Arbiter we know that if there are 2 names with same gfid but one of the bricks doesn't have the file, we don't need to block lookup until name-heal completes. We can trigger it in the background. Similarly when lookup finds that the name is not present on multiple bricks and if readables for the parent inode say only the bricks with no name are the sources, we can move name-heal to back-ground. Same idea can be applied to thin-arbiter also. Proposed changes to metadata heal: The only case which requires metadata heal to happen in foreground is when the metadata mismatches without any pending markers on the inode on the bricks, rest don't need to do metadata heal in foreground. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Always 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Worker Ant on 2018-08-28 02:41:40 EDT --- REVIEW: https://review.gluster.org/21011 (cluster/afr: Delegate metadata heal with pending xattrs to SHD) posted (#3) for review on master by Pranith Kumar Karampuri --- Additional comment from Worker Ant on 2018-08-28 23:57:26 EDT --- COMMIT: https://review.gluster.org/21011 committed in master by "Pranith Kumar Karampuri" <pkarampu> with a commit message- cluster/afr: Delegate metadata heal with pending xattrs to SHD Problem: When metadata-self-heal is triggered on the mount, it blocks lookup until metadata-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs metadata heal and all of them trigger heals waiting for other clients to complete heal. Fix: Only when the heal is needed but the pending xattrs are not set, trigger metadata heal that could block lookup. This is the only case where different clients may give different metadata to the clients without heals, which should be avoided. Updates bz#1622821 Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66 Signed-off-by: Pranith Kumar K <pkarampu> --- Additional comment from Worker Ant on 2018-08-30 02:14:30 EDT --- REVIEW: https://review.gluster.org/21040 (cluster/afr: Delegate name-heal when possible) posted (#1) for review on master by Pranith Kumar Karampuri --- Additional comment from Worker Ant on 2018-09-03 21:52:39 EDT --- COMMIT: https://review.gluster.org/21040 committed in master by "Pranith Kumar Karampuri" <pkarampu> with a commit message- cluster/afr: Delegate name-heal when possible Problem: When name-self-heal is triggered on the mount, it blocks lookup until name-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs name heal and all of them trigger heals waiting for other clients to complete heal. Fix: When a name-heal is needed but quorum number of names have the file and pending xattrs exist on the parent, then better to delegate the heal to SHD which will be completed as part of entry-heal of the parent directory. We could also do the same for quorum-number of names not present but we don't have any known use-case where this is a frequent occurrence so not changing that part at the moment. When there is a gfid mismatch or missing gfid it is important to complete the heal so that next rename doesn't assume everything is fine and perform a rename etc fixes bz#1622821 Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c Signed-off-by: Pranith Kumar K <pkarampu>
REVIEW: https://review.gluster.org/21084 (cluster/afr: Delegate metadata heal with pending xattrs to SHD) posted (#1) for review on release-4.1 by Pranith Kumar Karampuri
REVIEW: https://review.gluster.org/21085 (cluster/afr: Delegate name-heal when possible) posted (#1) for review on release-4.1 by Pranith Kumar Karampuri
COMMIT: https://review.gluster.org/21084 committed in release-4.1 by "Shyamsundar Ranganathan" <srangana> with a commit message- cluster/afr: Delegate metadata heal with pending xattrs to SHD Problem: When metadata-self-heal is triggered on the mount, it blocks lookup until metadata-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs metadata heal and all of them trigger heals waiting for other clients to complete heal. Fix: Only when the heal is needed but the pending xattrs are not set, trigger metadata heal that could block lookup. This is the only case where different clients may give different metadata to the clients without heals, which should be avoided. Updates bz#1625575 Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66 Signed-off-by: Pranith Kumar K <pkarampu>
COMMIT: https://review.gluster.org/21085 committed in release-4.1 by "Shyamsundar Ranganathan" <srangana> with a commit message- cluster/afr: Delegate name-heal when possible Problem: When name-self-heal is triggered on the mount, it blocks lookup until name-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs name heal and all of them trigger heals waiting for other clients to complete heal. Fix: When a name-heal is needed but quorum number of names have the file and pending xattrs exist on the parent, then better to delegate the heal to SHD which will be completed as part of entry-heal of the parent directory. We could also do the same for quorum-number of names not present but we don't have any known use-case where this is a frequent occurrence so not changing that part at the moment. When there is a gfid mismatch or missing gfid it is important to complete the heal so that next rename doesn't assume everything is fine and perform a rename etc fixes bz#1625575 Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c Signed-off-by: Pranith Kumar K <pkarampu>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.5, please open a new bug report. glusterfs-4.1.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-September/000113.html [2] https://www.gluster.org/pipermail/gluster-users/