+++ This bug was initially created as a clone of Bug #1388292 +++ # Description of problem: After updating to v3.7.16 our Replica volume (2 clients, and 2 bricks) developed issues where processes (PHP, clamav/maldet, and occasionally even the gluster mount process) on the clients (over FUSE) would enter into an "un-interruptible kernel wait state" Apparently the cause of this was "performance.read-ahead" After turning it to "off" we have not had any other problems for over 1 week. # Version-Release number of selected component (if applicable): v3.7.16 # How reproducible: It was always reproducible but took a long time (24+hours) before a anything occurred after updating (from v3.7.13) to mainline. # Steps to Reproduce: (Full disclosure I have not replicated with a brand new volume, I don't have the hardware to test currently) Notes on Reproduction: We have a total of just under 2.8million files on the 1 volume. I am not sure if you will need to create them all to reproduce the issue we are having. 1. Create a Replica volume with 2 bricks 2. gluster volume set volume1 performance.read-ahead on 3. Mount gluster volume via FUSE on client. 4. Create a wordpress site, and set it up. 5. run clamav or maldet on the files. 5.1 Or if installed via wordpress run `php wp-cron.php` 6. After a while (we saw a range of time from 2-24hours) specific files will trigger a process to get stuck, and then all following transactions on those files will fail. (such as rm, or moving the file off the volume) # Actual results: Processes running on the clients become stuck and the gluster mount. lsof would hang when trying to find the open file descriptors, as would trying to cat /proc/${pid}/cmdline on an un-interruptible process. # Expected results: Processes do not become stuck in an iowait/uninterruptible state. # Additional info: Source of Packages Used, http://ppa.launchpad.net/gluster/glusterfs-3.7/ubuntu I have also anonymised the path data of the attached statedumps. But i have kept the length the same. eg: 'welove.site.com' becomes 'aqwedsaqwedsaqw'. I hope that is okay. ## Volume Information: sudo gluster volume info Volume Name: gv-www Type: Replicate Volume ID: 6163f185-d58b-45f2-8bf0-eeb3a18142d9 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: demo-gluster01:/gluster Brick2: demo-gluster02:/gluster Options Reconfigured: performance.read-ahead: off nfs.disable: true cluster.self-heal-daemon: enable features.scrub: Active features.bitrot: on diagnostics.client-log-level: INFO diagnostics.brick-log-level: INFO performance.readdir-ahead: on ## Brick Information: ### Brick 1: sudo getfattr -d -m. -ehex /gluster/ getfattr: Removing leading '/' from absolute path names # file: gluster/ trusted.afr.dirty=0x000000000000000000000000 trusted.afr.gv-www-client-1=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0x6163f185d58b45f28bf0eeb3a18142d9 ### Brick 2: sudo getfattr -d -m. -ehex /gluster/ getfattr: Removing leading '/' from absolute path names # file: gluster/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0x6163f185d58b45f28bf0eeb3a18142d9 ## Client Information: All servers are Ubuntu 14.04.5 LTS ### Client 1: df -Th Filesystem Type Size Used Avail Use% Mounted on demo-gluster01:/gv-www fuse.glusterfs 197G 176G 12G 94% /var/www Mount in /etc/fstab demo-gluster01:/gv-www /var/www/ glusterfs defaults,_netdev 0 0 ### Client 2: df -Th Filesystem Type Size Used Avail Use% Mounted on demo-gluster02:/gv-www fuse.glusterfs 197G 176G 12G 94% /var/www Mount in /etc/fstab demo-gluster02:/gv-www /var/www/ glusterfs defaults,_netdev 0 0 --- Additional comment from Karthik U S on 2016-11-11 00:02:25 EST --- It would be great if you can send the log files. --- Additional comment from on 2016-11-13 19:42:00 EST --- (In reply to Karthik U S from comment #1) > It would be great if you can send the log files. It has actually been too long and they have already been rotated. I do also have a 2GB(gzip'd) Brick Debug log (expands to 32GB) which is guaranteed to contain some type of errors when it was happening but to be honest, I'm not sure what I am specifically looking for and I'd like to anonimize the data before providing it. (Plus who wants a 2GB log file attached to a bug report) Let me know if there are some specific grep commands that I can run to minimize the log file so I can attach it. --- Additional comment from Karthik U S on 2016-11-14 01:34:11 EST --- I am not sure about the grep command to be used for this issue. Can you upload the log file somewhere else, so that I can access it form there. --- Additional comment from Karthik U S on 2016-11-21 05:02:25 EST --- You can grep for page.c and/or read-ahead.c, to get the log messages from the read-ahead translator. If you can attach the logs with these names, that would be helpful in debugging the issue. --- Additional comment from Worker Ant on 2016-11-22 06:44:08 EST --- REVIEW: http://review.gluster.org/15901 (io-cache: Fix a read hang) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Poornima G on 2016-11-22 07:19:43 EST --- (In reply to bugzilla from comment #2) > (In reply to Karthik U S from comment #1) > > It would be great if you can send the log files. > > It has actually been too long and they have already been rotated. > > I do also have a 2GB(gzip'd) Brick Debug log (expands to 32GB) which is > guaranteed to contain some type of errors when it was happening but to be > honest, I'm not sure what I am specifically looking for and I'd like to > anonimize the data before providing it. (Plus who wants a 2GB log file > attached to a bug report) > > Let me know if there are some specific grep commands that I can run to > minimize the log file so I can attach it. grep io-cache <log file name> might be helpful --- Additional comment from on 2016-11-22 21:18:39 EST --- Thanks for the grep commands. Unfortunately, there are 0 logs for any of those grep commands. As I was doing grep commands anyway here are the counts of the uniq message ID's. (Its probably not useful, but its better than nothing) If this doesn't help I shall just anonymize all the logs and upload to my server. 64793583 [MSGID: 0] 3 [MSGID: 101055] - server shutdown 5 [MSGID: 113001] - getxattr failed on /gluster/.glusterfs/.... 3 [MSGID: 115029] - connect 6 [MSGID: 115034] - auth option skip 3 [MSGID: 115036] - disconnect 2675797 [MSGID: 115050] - LOOKUP 142 [MSGID: 115053] - INODELK & FINODELK - (These are (some) of the files which i was having issues with) 10 [MSGID: 115054] - ENTRYLK (possibly related to 115053?) 1089 [MSGID: 115058] - REMOVEXATTR 1567371 [MSGID: 115059] - GETXATTR (bitrot related?) 4 [MSGID: 115073] - XATTROP 3 [MSGID: 121037] - set (changetimerecorder?) --- Additional comment from Karthik U S on 2016-11-23 01:04:47 EST --- Hey, Sorry for the inconvenience, the grep commands are for the client logs. If you have the client logs, can you try those on that please? Yesterday Poornima had sent a fix for this issue, if possible, can you also try that patch and let us know whether it fixes the issue. http://review.gluster.org/15901 --- Additional comment from on 2016-11-23 01:15:01 EST --- Unfortunately all I have is the brick logs. I'll just anonymize it & upload and let you take a look. Will send a link in the next couple of hours. --- Additional comment from on 2016-11-23 03:27:41 EST --- Here are the brick logs, https://slowb.ro/static/gluster.log.cleaned.gz --- Additional comment from Karthik U S on 2016-11-23 03:51:15 EST --- (In reply to bugzilla from comment #10) > Here are the brick logs, https://slowb.ro/static/gluster.log.cleaned.gz Thanks for the logs. Will have a look. --- Additional comment from Worker Ant on 2016-11-23 08:11:10 EST --- COMMIT: http://review.gluster.org/15901 committed in master by Jeff Darcy (jdarcy) ------ commit 114c50c1a10d649a8b640627f09fd5872828d4ec Author: Poornima G <pgurusid> Date: Mon Nov 21 19:57:08 2016 +0530 io-cache: Fix a read hang Issue: ===== In certain cases, there was no unwind of read from read-ahead xlator, thus resulting in hang. RCA: ==== In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead of STACK_WIND(). One such case is when inode_ctx for that file is not present (can happen if readdirp was called, and populates md-cache and serves all the lookups from cache). Consider the following graph: ... io-cache (parent) | readdir-ahead | read-ahead ... Below is the code snippet of ioc_readv calling STACK_WIND_TAIL: ioc_readv() { ... if (!inode_ctx) STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this), FIRST_CHILD (frame->this)->fops->readv, fd, size, offset, flags, xdata); /* Ideally, this stack_wind should wind to readdir-ahead:readv() but it winds to read-ahead:readv(). See below for explaination. */ ... } STACK_WIND_TAIL (frame, obj, fn, ...) { frame->this = obj; /* for the above mentioned graph, frame->this will be readdir-ahead * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which * is as expected */ ... THIS = obj; /* THIS will be read-ahead instead of readdir-ahead!, as obj expands * to "FIRST_CHILD (frame->this)" and frame->this was pointing * to readdir-ahead in the previous statement. */ ... fn (frame, obj, params); /* fn will call read-ahead:readv() instead of readdir-ahead:readv()! * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and * frame->this was pointing ro readdir-ahead in the first statement */ ... } Thus, the readdir-ahead's readv() implementation will be skipped, and ra_readv() will be called with frame->this = "readdir-ahead" and this = "read-ahead". This can lead to corruption / hang / other problems. But in this perticular case, when 'frame->this' and 'this' passed to ra_readv() doesn't match, it causes ra_readv() to call ra_readv() again!. Thus the logic of read-ahead readv() falls apart and leads to hang. Solution: ========= Ideally, STACK_WIND_TAIL() should be modified as: STACK_WIND_TAIL (frame, obj, fn, ...) { next_xl = obj /* resolve obj as the variables passed in obj macro can be overwritten in the further instrucions */ next_xl_fn = fn /* resolve fn and store in a tmp variable, before modifying any variables */ frame->this = next_xl; ... THIS = next_xl; ... next_xl_fn (frame, next_xl, params); ... } But for this solution, knowing the type of variable 'next_xl_fn' is a challenge and is not easy. Hence just modifying all the existing callers to pass "FIRST_CHILD (this)" as obj, instead of "FIRST_CHILD (frame->this)". Change-Id: I179ffe3d1f154bc5a1935fd2ee44e912eb0fbb61 BUG: 1388292 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/15901 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> --- Additional comment from Worker Ant on 2016-11-24 01:01:33 EST --- REVIEW: http://review.gluster.org/15923 (io-cache: Fix a read hang) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-11-24 01:02:56 EST --- REVIEW: http://review.gluster.org/15923 (libglusterfs: Fix a read hang) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Worker Ant on 2016-11-24 01:35:02 EST --- REVIEW: http://review.gluster.org/15922 (Revert "io-cache: Fix a read hang") posted (#2) for review on master by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/15933 (libglusterfs: Fix a read hang) posted (#1) for review on release-3.9 by Poornima G (pgurusid)
REVIEW: http://review.gluster.org/15933 (libglusterfs: Fix a read hang) posted (#2) for review on release-3.9 by Poornima G (pgurusid)
COMMIT: http://review.gluster.org/15933 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) ------ commit 7e35f4c1a8bc5db145aba54b88daf16611de803b Author: Poornima G <pgurusid> Date: Mon Nov 21 19:57:08 2016 +0530 libglusterfs: Fix a read hang Backport of http://review.gluster.org/15923 Issue: ===== In certain cases, there was no unwind of read from read-ahead xlator, thus resulting in hang. RCA: ==== In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead of STACK_WIND(). One such case is when inode_ctx for that file is not present (can happen if readdirp was called, and populates md-cache and serves all the lookups from cache). Consider the following graph: ... io-cache (parent) | readdir-ahead | read-ahead ... Below is the code snippet of ioc_readv calling STACK_WIND_TAIL: ioc_readv() { ... if (!inode_ctx) STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this), FIRST_CHILD (frame->this)->fops->readv, fd, size, offset, flags, xdata); /* Ideally, this stack_wind should wind to readdir-ahead:readv() but it winds to read-ahead:readv(). See below for explaination. */ ... } STACK_WIND_TAIL (frame, obj, fn, ...) { frame->this = obj; /* for the above mentioned graph, frame->this will be readdir-ahead * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which * is as expected */ ... THIS = obj; /* THIS will be read-ahead instead of readdir-ahead!, as obj expands * to "FIRST_CHILD (frame->this)" and frame->this was pointing * to readdir-ahead in the previous statement. */ ... fn (frame, obj, params); /* fn will call read-ahead:readv() instead of readdir-ahead:readv()! * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and * frame->this was pointing ro readdir-ahead in the first statement */ ... } Thus, the readdir-ahead's readv() implementation will be skipped, and ra_readv() will be called with frame->this = "readdir-ahead" and this = "read-ahead". This can lead to corruption / hang / other problems. But in this perticular case, when 'frame->this' and 'this' passed to ra_readv() doesn't match, it causes ra_readv() to call ra_readv() again!. Thus the logic of read-ahead readv() falls apart and leads to hang. Solution: ========= Modify STACK_WIND_TAIL() as: STACK_WIND_TAIL (frame, obj, fn, ...) { next_xl = obj /* resolve obj as the variables passed in obj macro can be overwritten in the further instrucions */ next_xl_fn = fn /* resolve fn and store in a tmp variable, before modifying any variables */ frame->this = next_xl; ... THIS = next_xl; ... next_xl_fn (frame, next_xl, params); ... } >Reviewed-on: http://review.gluster.org/15923 >Smoke: Gluster Build System <jenkins.org> >NetBSD-regression: NetBSD Build System <jenkins.org> >Reviewed-by: Rajesh Joseph <rjoseph> >CentOS-regression: Gluster Build System <jenkins.org> >Reviewed-by: Raghavendra G <rgowdapp> (Cherry picked from commit 8943c19a2ef51b6e4fa66cb57211d469fe558579) BUG: 1399015 Change-Id: Ie662ac8f18fa16909376f1e59387bc5b886bd0f9 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/15933 NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report. glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html [2] https://www.gluster.org/pipermail/gluster-users/