Bug 1388292 - performance.read-ahead on results in processes on client stuck in IO wait
Summary: performance.read-ahead on results in processes on client stuck in IO wait
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: read-ahead
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Karthik U S
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1379228 1392299 1399015 1399018 1399023 1399024
TreeView+ depends on / blocked
 
Reported: 2016-10-25 03:37 UTC by bugzilla
Modified: 2017-03-06 17:30 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.10.0
Clone Of:
: 1399015 1399018 1399023 1399024 (view as bug list)
Environment:
Last Closed: 2017-03-06 17:30:52 UTC
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Gluster State Dumps 2 minutes apart (17.09 MB, application/octet-stream)
2016-10-25 03:37 UTC, bugzilla
no flags Details

Description bugzilla 2016-10-25 03:37:27 UTC
Created attachment 1213677 [details]
Gluster State Dumps 2 minutes apart

# Description of problem:
After updating to v3.7.16 our Replica volume (2 clients, and 2 bricks) developed issues where processes (PHP, clamav/maldet, and occasionally even the gluster mount process) on the clients (over FUSE) would enter into an "un-interruptible kernel wait state"

Apparently the cause of this was "performance.read-ahead"
After turning it to "off" we have not had any other problems for over 1 week. 

# Version-Release number of selected component (if applicable):
v3.7.16

# How reproducible:
It was always reproducible but took a long time (24+hours) before a anything occurred after updating (from v3.7.13) to mainline.

# Steps to Reproduce:
(Full disclosure I have not replicated with a brand new volume, I don't have the hardware to test currently)

Notes on Reproduction: We have a total of just under 2.8million files on the 1 volume. I am not sure if you will need to create them all to reproduce the issue we are having. 

1. Create a Replica volume with 2 bricks
2. gluster volume set volume1 performance.read-ahead on
3. Mount gluster volume via FUSE on client.
4. Create a wordpress site, and set it up.
5. run clamav or maldet  on the files.
5.1 Or if installed via wordpress run `php wp-cron.php`
6. After a while (we saw a range of time from 2-24hours) specific files will trigger a process to get stuck, and then all following transactions on those files will fail. (such as rm, or moving the file off the volume)

# Actual results:
Processes running on the clients become stuck and the gluster mount.
lsof would hang when trying to find the open file descriptors, as would trying to cat /proc/${pid}/cmdline on an un-interruptible process. 

# Expected results:
Processes do not become stuck in an iowait/uninterruptible state. 

# Additional info:

Source of Packages Used, http://ppa.launchpad.net/gluster/glusterfs-3.7/ubuntu 

I have also anonymised the path data of the attached statedumps. But i have kept the length the same. eg: 'welove.site.com' becomes 'aqwedsaqwedsaqw'. I hope that is okay.

## Volume Information:

sudo gluster volume info

Volume Name: gv-www
Type: Replicate
Volume ID: 6163f185-d58b-45f2-8bf0-eeb3a18142d9
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: demo-gluster01:/gluster
Brick2: demo-gluster02:/gluster
Options Reconfigured:
performance.read-ahead: off
nfs.disable: true
cluster.self-heal-daemon: enable
features.scrub: Active
features.bitrot: on
diagnostics.client-log-level: INFO
diagnostics.brick-log-level: INFO
performance.readdir-ahead: on

## Brick Information:

### Brick 1:

sudo getfattr -d -m. -ehex /gluster/
getfattr: Removing leading '/' from absolute path names
# file: gluster/
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.gv-www-client-1=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x6163f185d58b45f28bf0eeb3a18142d9

### Brick 2:
sudo getfattr -d -m. -ehex /gluster/
getfattr: Removing leading '/' from absolute path names
# file: gluster/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x6163f185d58b45f28bf0eeb3a18142d9


## Client Information:
All servers are Ubuntu 14.04.5 LTS

### Client 1:
df -Th
Filesystem             Type            Size  Used Avail Use% Mounted on

demo-gluster01:/gv-www fuse.glusterfs  197G  176G   12G  94% /var/www

Mount in /etc/fstab
demo-gluster01:/gv-www /var/www/ glusterfs defaults,_netdev 0 0 


### Client 2:
df -Th
Filesystem             Type            Size  Used Avail Use% Mounted on
demo-gluster02:/gv-www fuse.glusterfs  197G  176G   12G  94% /var/www

Mount in /etc/fstab
demo-gluster02:/gv-www /var/www/ glusterfs defaults,_netdev 0 0

Comment 1 Karthik U S 2016-11-11 05:02:25 UTC
It would be great if you can send the log files.

Comment 2 bugzilla 2016-11-14 00:42:00 UTC
(In reply to Karthik U S from comment #1)
> It would be great if you can send the log files.

It has actually been too long and they have already been rotated. 

I do also have a 2GB(gzip'd) Brick Debug log (expands to 32GB) which is guaranteed to contain some type of errors when it was happening but to be honest, I'm not sure what I am specifically looking for and I'd like to anonimize the data before providing it. (Plus who wants a 2GB log file attached to a bug report)

Let me know if there are some specific grep commands that I can run to minimize the log file so I can attach it.

Comment 3 Karthik U S 2016-11-14 06:34:11 UTC
I am not sure about the grep command to be used for this issue. Can you upload the log file somewhere else, so that I can access it form there.

Comment 4 Karthik U S 2016-11-21 10:02:25 UTC
You can grep for page.c and/or read-ahead.c, to get the log messages from the read-ahead translator. If you can attach the logs with these names, that would be helpful in debugging the issue.

Comment 5 Worker Ant 2016-11-22 11:44:08 UTC
REVIEW: http://review.gluster.org/15901 (io-cache: Fix a read hang) posted (#1) for review on master by Poornima G (pgurusid)

Comment 6 Poornima G 2016-11-22 12:19:43 UTC
(In reply to bugzilla from comment #2)
> (In reply to Karthik U S from comment #1)
> > It would be great if you can send the log files.
> 
> It has actually been too long and they have already been rotated. 
> 
> I do also have a 2GB(gzip'd) Brick Debug log (expands to 32GB) which is
> guaranteed to contain some type of errors when it was happening but to be
> honest, I'm not sure what I am specifically looking for and I'd like to
> anonimize the data before providing it. (Plus who wants a 2GB log file
> attached to a bug report)
> 
> Let me know if there are some specific grep commands that I can run to
> minimize the log file so I can attach it.

grep io-cache <log file name> might be helpful

Comment 7 bugzilla 2016-11-23 02:18:39 UTC
Thanks for the grep commands.

Unfortunately, there are 0 logs for any of those grep commands.

As I was doing grep commands anyway here are the counts of the uniq message ID's. (Its probably not useful, but its better than nothing) If this doesn't help I shall just anonymize all the logs and upload to my server. 

64793583 [MSGID: 0]
      3  [MSGID: 101055] - server shutdown
      5  [MSGID: 113001] - getxattr failed on /gluster/.glusterfs/....
      3  [MSGID: 115029] - connect
      6  [MSGID: 115034] - auth option skip
      3  [MSGID: 115036] - disconnect  
2675797  [MSGID: 115050] - LOOKUP
    142  [MSGID: 115053] - INODELK & FINODELK - (These are (some) of the files which i was having issues with)
     10  [MSGID: 115054] - ENTRYLK (possibly related to 115053?)
   1089  [MSGID: 115058] - REMOVEXATTR 
1567371  [MSGID: 115059] - GETXATTR (bitrot related?)
      4  [MSGID: 115073] - XATTROP
      3  [MSGID: 121037] - set (changetimerecorder?)

Comment 8 Karthik U S 2016-11-23 06:04:47 UTC
Hey,

Sorry for the inconvenience, the grep commands are for the client logs. If you have the client logs, can you try those on that please?

Yesterday Poornima had sent a fix for this issue, if possible, can you also try that patch and let us know whether it fixes the issue.
http://review.gluster.org/15901

Comment 9 bugzilla 2016-11-23 06:15:01 UTC
Unfortunately all I have is the brick logs. I'll just anonymize it & upload and let you take a look. Will send a link in the next couple of hours.

Comment 10 bugzilla 2016-11-23 08:27:41 UTC
Here are the brick logs, https://slowb.ro/static/gluster.log.cleaned.gz

Comment 11 Karthik U S 2016-11-23 08:51:15 UTC
(In reply to bugzilla from comment #10)
> Here are the brick logs, https://slowb.ro/static/gluster.log.cleaned.gz

Thanks for the logs. Will have a look.

Comment 12 Worker Ant 2016-11-23 13:11:10 UTC
COMMIT: http://review.gluster.org/15901 committed in master by Jeff Darcy (jdarcy) 
------
commit 114c50c1a10d649a8b640627f09fd5872828d4ec
Author: Poornima G <pgurusid>
Date:   Mon Nov 21 19:57:08 2016 +0530

    io-cache: Fix a read hang
    
    Issue:
    =====
    In certain cases, there was no unwind of read
    from read-ahead xlator, thus resulting in hang.
    
    RCA:
    ====
    In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead
    of STACK_WIND(). One such case is when inode_ctx for that file
    is not present (can happen if readdirp was called, and populates
    md-cache and serves all the lookups from cache).
    
    Consider the following graph:
    ...
    io-cache (parent)
       |
    readdir-ahead
       |
    read-ahead
    ...
    
    Below is the code snippet of ioc_readv calling STACK_WIND_TAIL:
    ioc_readv()
    {
    ...
     if (!inode_ctx)
       STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this),
                        FIRST_CHILD (frame->this)->fops->readv, fd,
                        size, offset, flags, xdata);
       /* Ideally, this stack_wind should wind to readdir-ahead:readv()
          but it winds to read-ahead:readv(). See below for
          explaination.
        */
    ...
    }
    
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      frame->this = obj;
      /* for the above mentioned graph, frame->this will be readdir-ahead
       * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which
       * is as expected
       */
      ...
      THIS = obj;
      /* THIS will be read-ahead instead of readdir-ahead!, as obj expands
       * to "FIRST_CHILD (frame->this)" and frame->this was pointing
       * to readdir-ahead in the previous statement.
       */
      ...
      fn (frame, obj, params);
      /* fn will call read-ahead:readv() instead of readdir-ahead:readv()!
       * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and
       * frame->this was pointing ro readdir-ahead in the first statement
       */
      ...
    }
    
    Thus, the readdir-ahead's readv() implementation will be skipped, and
    ra_readv() will be called with frame->this = "readdir-ahead" and
    this = "read-ahead". This can lead to corruption / hang / other problems.
    But in this perticular case, when 'frame->this' and 'this' passed
    to ra_readv() doesn't match, it causes ra_readv() to call ra_readv()
    again!. Thus the logic of read-ahead readv() falls apart and leads to
    hang.
    
    Solution:
    =========
    Ideally, STACK_WIND_TAIL() should be modified as:
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      next_xl = obj /* resolve obj as the variables passed in obj macro
                       can be overwritten in the further instrucions */
      next_xl_fn = fn /* resolve fn and store in a tmp variable, before
                         modifying any variables */
      frame->this = next_xl;
      ...
      THIS = next_xl;
      ...
      next_xl_fn (frame, next_xl, params);
      ...
    }
    But for this solution, knowing the type of variable 'next_xl_fn' is
    a challenge and is not easy. Hence just modifying all the existing
    callers to pass "FIRST_CHILD (this)" as obj, instead of
    "FIRST_CHILD (frame->this)".
    
    Change-Id: I179ffe3d1f154bc5a1935fd2ee44e912eb0fbb61
    BUG: 1388292
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/15901
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 13 Worker Ant 2016-11-24 06:01:33 UTC
REVIEW: http://review.gluster.org/15923 (io-cache: Fix a read hang) posted (#1) for review on master by Poornima G (pgurusid)

Comment 14 Worker Ant 2016-11-24 06:02:56 UTC
REVIEW: http://review.gluster.org/15923 (libglusterfs: Fix a read hang) posted (#2) for review on master by Poornima G (pgurusid)

Comment 15 Worker Ant 2016-11-24 06:35:02 UTC
REVIEW: http://review.gluster.org/15922 (Revert "io-cache: Fix a read hang") posted (#2) for review on master by Poornima G (pgurusid)

Comment 16 Worker Ant 2016-11-28 05:16:48 UTC
REVIEW: http://review.gluster.org/15923 (libglusterfs: Fix a read hang) posted (#3) for review on master by Poornima G (pgurusid)

Comment 17 Worker Ant 2016-11-28 05:22:26 UTC
REVIEW: http://review.gluster.org/15923 (libglusterfs: Fix a read hang) posted (#4) for review on master by Poornima G (pgurusid)

Comment 18 Worker Ant 2016-11-29 03:59:11 UTC
COMMIT: http://review.gluster.org/15923 committed in master by Raghavendra G (rgowdapp) 
------
commit 8943c19a2ef51b6e4fa66cb57211d469fe558579
Author: Poornima G <pgurusid>
Date:   Mon Nov 21 19:57:08 2016 +0530

    libglusterfs: Fix a read hang
    
    Issue:
    =====
    In certain cases, there was no unwind of read
    from read-ahead xlator, thus resulting in hang.
    
    RCA:
    ====
    In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead
    of STACK_WIND(). One such case is when inode_ctx for that file
    is not present (can happen if readdirp was called, and populates
    md-cache and serves all the lookups from cache).
    
    Consider the following graph:
    ...
    io-cache (parent)
       |
    readdir-ahead
       |
    read-ahead
    ...
    
    Below is the code snippet of ioc_readv calling STACK_WIND_TAIL:
    ioc_readv()
    {
    ...
     if (!inode_ctx)
       STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this),
                        FIRST_CHILD (frame->this)->fops->readv, fd,
                        size, offset, flags, xdata);
       /* Ideally, this stack_wind should wind to readdir-ahead:readv()
          but it winds to read-ahead:readv(). See below for
          explaination.
        */
    ...
    }
    
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      frame->this = obj;
      /* for the above mentioned graph, frame->this will be readdir-ahead
       * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which
       * is as expected
       */
      ...
      THIS = obj;
      /* THIS will be read-ahead instead of readdir-ahead!, as obj expands
       * to "FIRST_CHILD (frame->this)" and frame->this was pointing
       * to readdir-ahead in the previous statement.
       */
      ...
      fn (frame, obj, params);
      /* fn will call read-ahead:readv() instead of readdir-ahead:readv()!
       * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and
       * frame->this was pointing ro readdir-ahead in the first statement
       */
      ...
    }
    
    Thus, the readdir-ahead's readv() implementation will be skipped, and
    ra_readv() will be called with frame->this = "readdir-ahead" and
    this = "read-ahead". This can lead to corruption / hang / other problems.
    But in this perticular case, when 'frame->this' and 'this' passed
    to ra_readv() doesn't match, it causes ra_readv() to call ra_readv()
    again!. Thus the logic of read-ahead readv() falls apart and leads to
    hang.
    
    Solution:
    =========
    Modify STACK_WIND_TAIL() as:
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      next_xl = obj /* resolve obj as the variables passed in obj macro
                       can be overwritten in the further instrucions */
      next_xl_fn = fn /* resolve fn and store in a tmp variable, before
                         modifying any variables */
      frame->this = next_xl;
      ...
      THIS = next_xl;
      ...
      next_xl_fn (frame, next_xl, params);
      ...
    }
    
    As a part of http://review.gluster.org/15901/ the caller io-cache
    was fixed.
    
    BUG: 1388292
    Change-Id: Ie662ac8f18fa16909376f1e59387bc5b886bd0f9
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/15923
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Rajesh Joseph <rjoseph>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 19 Shyamsundar 2017-03-06 17:30:52 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.