1399015 – performance.read-ahead on results in processes on client stuck in IO wait

Bug 1399015 - performance.read-ahead on results in processes on client stuck in IO wait

Summary: performance.read-ahead on results in processes on client stuck in IO wait

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1388292 1399018 1399023 1399024
Blocks:	1379228 1392299
TreeView+	depends on / blocked

Reported:	2016-11-28 04:54 UTC by Poornima G
Modified:	2017-03-08 10:22 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.9.1
Clone Of:	1388292
Environment:
Last Closed:	2017-03-08 10:22:04 UTC
Regression:	---
Mount Type:	fuse
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Poornima G 2016-11-28 04:54:00 UTC

+++ This bug was initially created as a clone of Bug #1388292 +++

# Description of problem:
After updating to v3.7.16 our Replica volume (2 clients, and 2 bricks) developed issues where processes (PHP, clamav/maldet, and occasionally even the gluster mount process) on the clients (over FUSE) would enter into an "un-interruptible kernel wait state"

Apparently the cause of this was "performance.read-ahead"
After turning it to "off" we have not had any other problems for over 1 week. 

# Version-Release number of selected component (if applicable):
v3.7.16

# How reproducible:
It was always reproducible but took a long time (24+hours) before a anything occurred after updating (from v3.7.13) to mainline.

# Steps to Reproduce:
(Full disclosure I have not replicated with a brand new volume, I don't have the hardware to test currently)

Notes on Reproduction: We have a total of just under 2.8million files on the 1 volume. I am not sure if you will need to create them all to reproduce the issue we are having. 

1. Create a Replica volume with 2 bricks
2. gluster volume set volume1 performance.read-ahead on
3. Mount gluster volume via FUSE on client.
4. Create a wordpress site, and set it up.
5. run clamav or maldet  on the files.
5.1 Or if installed via wordpress run `php wp-cron.php`
6. After a while (we saw a range of time from 2-24hours) specific files will trigger a process to get stuck, and then all following transactions on those files will fail. (such as rm, or moving the file off the volume)

# Actual results:
Processes running on the clients become stuck and the gluster mount.
lsof would hang when trying to find the open file descriptors, as would trying to cat /proc/${pid}/cmdline on an un-interruptible process. 

# Expected results:
Processes do not become stuck in an iowait/uninterruptible state. 

# Additional info:

Source of Packages Used, http://ppa.launchpad.net/gluster/glusterfs-3.7/ubuntu 

I have also anonymised the path data of the attached statedumps. But i have kept the length the same. eg: 'welove.site.com' becomes 'aqwedsaqwedsaqw'. I hope that is okay.

## Volume Information:

sudo gluster volume info

Volume Name: gv-www
Type: Replicate
Volume ID: 6163f185-d58b-45f2-8bf0-eeb3a18142d9
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: demo-gluster01:/gluster
Brick2: demo-gluster02:/gluster
Options Reconfigured:
performance.read-ahead: off
nfs.disable: true
cluster.self-heal-daemon: enable
features.scrub: Active
features.bitrot: on
diagnostics.client-log-level: INFO
diagnostics.brick-log-level: INFO
performance.readdir-ahead: on

## Brick Information:

### Brick 1:

sudo getfattr -d -m. -ehex /gluster/
getfattr: Removing leading '/' from absolute path names
# file: gluster/
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.gv-www-client-1=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x6163f185d58b45f28bf0eeb3a18142d9

### Brick 2:
sudo getfattr -d -m. -ehex /gluster/
getfattr: Removing leading '/' from absolute path names
# file: gluster/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x6163f185d58b45f28bf0eeb3a18142d9


## Client Information:
All servers are Ubuntu 14.04.5 LTS

### Client 1:
df -Th
Filesystem             Type            Size  Used Avail Use% Mounted on

demo-gluster01:/gv-www fuse.glusterfs  197G  176G   12G  94% /var/www

Mount in /etc/fstab
demo-gluster01:/gv-www /var/www/ glusterfs defaults,_netdev 0 0 


### Client 2:
df -Th
Filesystem             Type            Size  Used Avail Use% Mounted on
demo-gluster02:/gv-www fuse.glusterfs  197G  176G   12G  94% /var/www

Mount in /etc/fstab
demo-gluster02:/gv-www /var/www/ glusterfs defaults,_netdev 0 0

--- Additional comment from Karthik U S on 2016-11-11 00:02:25 EST ---

It would be great if you can send the log files.

--- Additional comment from  on 2016-11-13 19:42:00 EST ---

(In reply to Karthik U S from comment #1)
> It would be great if you can send the log files.

It has actually been too long and they have already been rotated. 

I do also have a 2GB(gzip'd) Brick Debug log (expands to 32GB) which is guaranteed to contain some type of errors when it was happening but to be honest, I'm not sure what I am specifically looking for and I'd like to anonimize the data before providing it. (Plus who wants a 2GB log file attached to a bug report)

Let me know if there are some specific grep commands that I can run to minimize the log file so I can attach it.

--- Additional comment from Karthik U S on 2016-11-14 01:34:11 EST ---

I am not sure about the grep command to be used for this issue. Can you upload the log file somewhere else, so that I can access it form there.

--- Additional comment from Karthik U S on 2016-11-21 05:02:25 EST ---

You can grep for page.c and/or read-ahead.c, to get the log messages from the read-ahead translator. If you can attach the logs with these names, that would be helpful in debugging the issue.

--- Additional comment from Worker Ant on 2016-11-22 06:44:08 EST ---

REVIEW: http://review.gluster.org/15901 (io-cache: Fix a read hang) posted (#1) for review on master by Poornima G (pgurusid)

--- Additional comment from Poornima G on 2016-11-22 07:19:43 EST ---

(In reply to bugzilla from comment #2)
> (In reply to Karthik U S from comment #1)
> > It would be great if you can send the log files.
> 
> It has actually been too long and they have already been rotated. 
> 
> I do also have a 2GB(gzip'd) Brick Debug log (expands to 32GB) which is
> guaranteed to contain some type of errors when it was happening but to be
> honest, I'm not sure what I am specifically looking for and I'd like to
> anonimize the data before providing it. (Plus who wants a 2GB log file
> attached to a bug report)
> 
> Let me know if there are some specific grep commands that I can run to
> minimize the log file so I can attach it.

grep io-cache <log file name> might be helpful

--- Additional comment from  on 2016-11-22 21:18:39 EST ---

Thanks for the grep commands.

Unfortunately, there are 0 logs for any of those grep commands.

As I was doing grep commands anyway here are the counts of the uniq message ID's. (Its probably not useful, but its better than nothing) If this doesn't help I shall just anonymize all the logs and upload to my server. 

64793583 [MSGID: 0]
      3  [MSGID: 101055] - server shutdown
      5  [MSGID: 113001] - getxattr failed on /gluster/.glusterfs/....
      3  [MSGID: 115029] - connect
      6  [MSGID: 115034] - auth option skip
      3  [MSGID: 115036] - disconnect  
2675797  [MSGID: 115050] - LOOKUP
    142  [MSGID: 115053] - INODELK & FINODELK - (These are (some) of the files which i was having issues with)
     10  [MSGID: 115054] - ENTRYLK (possibly related to 115053?)
   1089  [MSGID: 115058] - REMOVEXATTR 
1567371  [MSGID: 115059] - GETXATTR (bitrot related?)
      4  [MSGID: 115073] - XATTROP
      3  [MSGID: 121037] - set (changetimerecorder?)

--- Additional comment from Karthik U S on 2016-11-23 01:04:47 EST ---

Hey,

Sorry for the inconvenience, the grep commands are for the client logs. If you have the client logs, can you try those on that please?

Yesterday Poornima had sent a fix for this issue, if possible, can you also try that patch and let us know whether it fixes the issue.
http://review.gluster.org/15901

--- Additional comment from  on 2016-11-23 01:15:01 EST ---

Unfortunately all I have is the brick logs. I'll just anonymize it & upload and let you take a look. Will send a link in the next couple of hours.

--- Additional comment from  on 2016-11-23 03:27:41 EST ---

Here are the brick logs, https://slowb.ro/static/gluster.log.cleaned.gz

--- Additional comment from Karthik U S on 2016-11-23 03:51:15 EST ---

(In reply to bugzilla from comment #10)
> Here are the brick logs, https://slowb.ro/static/gluster.log.cleaned.gz

Thanks for the logs. Will have a look.

--- Additional comment from Worker Ant on 2016-11-23 08:11:10 EST ---

COMMIT: http://review.gluster.org/15901 committed in master by Jeff Darcy (jdarcy) 
------
commit 114c50c1a10d649a8b640627f09fd5872828d4ec
Author: Poornima G <pgurusid>
Date:   Mon Nov 21 19:57:08 2016 +0530

    io-cache: Fix a read hang
    
    Issue:
    =====
    In certain cases, there was no unwind of read
    from read-ahead xlator, thus resulting in hang.
    
    RCA:
    ====
    In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead
    of STACK_WIND(). One such case is when inode_ctx for that file
    is not present (can happen if readdirp was called, and populates
    md-cache and serves all the lookups from cache).
    
    Consider the following graph:
    ...
    io-cache (parent)
       |
    readdir-ahead
       |
    read-ahead
    ...
    
    Below is the code snippet of ioc_readv calling STACK_WIND_TAIL:
    ioc_readv()
    {
    ...
     if (!inode_ctx)
       STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this),
                        FIRST_CHILD (frame->this)->fops->readv, fd,
                        size, offset, flags, xdata);
       /* Ideally, this stack_wind should wind to readdir-ahead:readv()
          but it winds to read-ahead:readv(). See below for
          explaination.
        */
    ...
    }
    
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      frame->this = obj;
      /* for the above mentioned graph, frame->this will be readdir-ahead
       * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which
       * is as expected
       */
      ...
      THIS = obj;
      /* THIS will be read-ahead instead of readdir-ahead!, as obj expands
       * to "FIRST_CHILD (frame->this)" and frame->this was pointing
       * to readdir-ahead in the previous statement.
       */
      ...
      fn (frame, obj, params);
      /* fn will call read-ahead:readv() instead of readdir-ahead:readv()!
       * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and
       * frame->this was pointing ro readdir-ahead in the first statement
       */
      ...
    }
    
    Thus, the readdir-ahead's readv() implementation will be skipped, and
    ra_readv() will be called with frame->this = "readdir-ahead" and
    this = "read-ahead". This can lead to corruption / hang / other problems.
    But in this perticular case, when 'frame->this' and 'this' passed
    to ra_readv() doesn't match, it causes ra_readv() to call ra_readv()
    again!. Thus the logic of read-ahead readv() falls apart and leads to
    hang.
    
    Solution:
    =========
    Ideally, STACK_WIND_TAIL() should be modified as:
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      next_xl = obj /* resolve obj as the variables passed in obj macro
                       can be overwritten in the further instrucions */
      next_xl_fn = fn /* resolve fn and store in a tmp variable, before
                         modifying any variables */
      frame->this = next_xl;
      ...
      THIS = next_xl;
      ...
      next_xl_fn (frame, next_xl, params);
      ...
    }
    But for this solution, knowing the type of variable 'next_xl_fn' is
    a challenge and is not easy. Hence just modifying all the existing
    callers to pass "FIRST_CHILD (this)" as obj, instead of
    "FIRST_CHILD (frame->this)".
    
    Change-Id: I179ffe3d1f154bc5a1935fd2ee44e912eb0fbb61
    BUG: 1388292
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/15901
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

--- Additional comment from Worker Ant on 2016-11-24 01:01:33 EST ---

REVIEW: http://review.gluster.org/15923 (io-cache: Fix a read hang) posted (#1) for review on master by Poornima G (pgurusid)

--- Additional comment from Worker Ant on 2016-11-24 01:02:56 EST ---

REVIEW: http://review.gluster.org/15923 (libglusterfs: Fix a read hang) posted (#2) for review on master by Poornima G (pgurusid)

--- Additional comment from Worker Ant on 2016-11-24 01:35:02 EST ---

REVIEW: http://review.gluster.org/15922 (Revert "io-cache: Fix a read hang") posted (#2) for review on master by Poornima G (pgurusid)

Comment 1 Worker Ant 2016-11-28 04:54:34 UTC

REVIEW: http://review.gluster.org/15933 (libglusterfs: Fix a read hang) posted (#1) for review on release-3.9 by Poornima G (pgurusid)

Comment 2 Worker Ant 2016-11-29 05:20:44 UTC

REVIEW: http://review.gluster.org/15933 (libglusterfs: Fix a read hang) posted (#2) for review on release-3.9 by Poornima G (pgurusid)

Comment 3 Worker Ant 2016-12-13 09:45:54 UTC

COMMIT: http://review.gluster.org/15933 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) 
------
commit 7e35f4c1a8bc5db145aba54b88daf16611de803b
Author: Poornima G <pgurusid>
Date:   Mon Nov 21 19:57:08 2016 +0530

    libglusterfs: Fix a read hang
    
    Backport of http://review.gluster.org/15923
    
    Issue:
    =====
    In certain cases, there was no unwind of read
    from read-ahead xlator, thus resulting in hang.
    
    RCA:
    ====
    In certain cases, ioc_readv() issues STACK_WIND_TAIL() instead
    of STACK_WIND(). One such case is when inode_ctx for that file
    is not present (can happen if readdirp was called, and populates
    md-cache and serves all the lookups from cache).
    
    Consider the following graph:
    ...
    io-cache (parent)
       |
    readdir-ahead
       |
    read-ahead
    ...
    
    Below is the code snippet of ioc_readv calling STACK_WIND_TAIL:
    ioc_readv()
    {
    ...
     if (!inode_ctx)
       STACK_WIND_TAIL (frame, FIRST_CHILD (frame->this),
                        FIRST_CHILD (frame->this)->fops->readv, fd,
                        size, offset, flags, xdata);
       /* Ideally, this stack_wind should wind to readdir-ahead:readv()
          but it winds to read-ahead:readv(). See below for
          explaination.
        */
    ...
    }
    
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      frame->this = obj;
      /* for the above mentioned graph, frame->this will be readdir-ahead
       * frame->this = FIRST_CHILD (frame->this) i.e. readdir-ahead, which
       * is as expected
       */
      ...
      THIS = obj;
      /* THIS will be read-ahead instead of readdir-ahead!, as obj expands
       * to "FIRST_CHILD (frame->this)" and frame->this was pointing
       * to readdir-ahead in the previous statement.
       */
      ...
      fn (frame, obj, params);
      /* fn will call read-ahead:readv() instead of readdir-ahead:readv()!
       * as fn expands to "FIRST_CHILD (frame->this)->fops->readv" and
       * frame->this was pointing ro readdir-ahead in the first statement
       */
      ...
    }
    
    Thus, the readdir-ahead's readv() implementation will be skipped, and
    ra_readv() will be called with frame->this = "readdir-ahead" and
    this = "read-ahead". This can lead to corruption / hang / other problems.
    But in this perticular case, when 'frame->this' and 'this' passed
    to ra_readv() doesn't match, it causes ra_readv() to call ra_readv()
    again!. Thus the logic of read-ahead readv() falls apart and leads to
    hang.
    
    Solution:
    =========
    Modify STACK_WIND_TAIL() as:
    STACK_WIND_TAIL (frame, obj, fn, ...)
    {
      next_xl = obj /* resolve obj as the variables passed in obj macro
                       can be overwritten in the further instrucions */
      next_xl_fn = fn /* resolve fn and store in a tmp variable, before
                         modifying any variables */
      frame->this = next_xl;
      ...
      THIS = next_xl;
      ...
      next_xl_fn (frame, next_xl, params);
      ...
    }
    
    >Reviewed-on: http://review.gluster.org/15923
    >Smoke: Gluster Build System <jenkins.org>
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >Reviewed-by: Rajesh Joseph <rjoseph>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Reviewed-by: Raghavendra G <rgowdapp>
    (Cherry picked from commit 8943c19a2ef51b6e4fa66cb57211d469fe558579)
    
    BUG: 1399015
    Change-Id: Ie662ac8f18fa16909376f1e59387bc5b886bd0f9
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/15933
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 4 Kaushal 2017-03-08 10:22:04 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report.

glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.