Bug 764964 (GLUSTER-3232)

Summary: deadlock related to transparent hugepage migration in kernels >= 2.6.32
Product: [Community] GlusterFS Reporter: Joe Julian <joe>
Component: fuseAssignee: Anand Avati <aavati>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: medium    
Version: mainlineCC: amarts, chrisw, chyd96, csaba, daniel.ortiz, fharshav, gluster-bugs, jdarcy, pasteur, toracat
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 848331 (view as bug list) Environment:
Last Closed: 2013-07-24 17:22:06 UTC Type: ---
Regression: --- Mount Type: fuse
Documentation: DP CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 848331    
Attachments:
Description Flags
strace
none
gdb backtrace none

Description Joe Julian 2011-07-22 19:37:22 UTC
Here's a trace of a lockup occurance (the file is 6913357 bytes):

[2011-07-22 15:10:35.362556] I [trace.c:553:trace_lookup_cbk] 0-home-trace: 414095: gfid=00000000-0000-0000-0000-000000000000 (op_ret=0 *buf {gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 ino=14192555, mode=100600, nlink=1, uid=292, gid=100, size=6913357, blocks=13512, atime=[Jul 21 09:29:46], mtime=[Jul 22 14:51:54], ctime=[Jul 22 14:51:54]}, *postparent {gfid=bdb4141a-0145-4408-be2c-0d5a762b4f5b ino=25191324, mode=40700, nlink=5, uid=292, gid=100, size=16384, blocks=64, atime=[Jul 22 14:20:37], mtime=[Jul 22 03:02:55], ctime=[Jul 22 03:02:55]}
[2011-07-22 15:11:09.476851] I [trace.c:1446:trace_lookup] 0-home-trace: 429072: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 path=/SUZW/.thunderbird/Profiles/fcclfrdy.default/ImapMail/imap-2/INBOX.msf
[2011-07-22 15:11:09.489842] I [trace.c:553:trace_lookup_cbk] 0-home-trace: 429072: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 (op_ret=0 *buf {gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 ino=56770223, mode=100600, nlink=1, uid=292, gid=100, size=6913357, blocks=13512, atime=[Jul 21 09:29:46], mtime=[Jul 22 14:51:54], ctime=[Jul 22 14:51:54]}, *postparent {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]}
[2011-07-22 15:11:09.489943] I [trace.c:1783:trace_open] 0-home-trace: 429073: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 path=/SUZW/.thunderbird/Profiles/fcclfrdy.default/ImapMail/imap-2/INBOX.msf flags=32768 fd=0x7fddc9f9f024 wbflags=0
[2011-07-22 15:11:09.490601] I [trace.c:149:trace_open_cbk] 0-home-trace: 429073: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=0, op_errno=117, *fd=0x7fddc9f9f024
[2011-07-22 15:11:09.490969] I [trace.c:1821:trace_readv] 0-home-trace: 429074: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=0)
[2011-07-22 15:11:09.491124] I [trace.c:1821:trace_readv] 0-home-trace: 429075: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=131072)
[2011-07-22 15:11:09.510378] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429074: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.510752] I [trace.c:1821:trace_readv] 0-home-trace: 429076: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=262144)
[2011-07-22 15:11:09.511951] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429075: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.512299] I [trace.c:1821:trace_readv] 0-home-trace: 429077: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=393216)
[2011-07-22 15:11:09.523417] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429076: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.525530] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429077: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.526230] I [trace.c:1821:trace_readv] 0-home-trace: 429078: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=524288)
[2011-07-22 15:11:09.526386] I [trace.c:1821:trace_readv] 0-home-trace: 429079: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=655360)
[2011-07-22 15:11:09.529856] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429078: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.531257] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429079: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.532697] I [trace.c:1821:trace_readv] 0-home-trace: 429080: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=786432)
[2011-07-22 15:11:09.532920] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429080: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.533133] I [trace.c:1821:trace_readv] 0-home-trace: 429081: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=917504)
[2011-07-22 15:11:09.535794] I [trace.c:1821:trace_readv] 0-home-trace: 429082: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=1048576)
[2011-07-22 15:11:09.536397] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429081: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.537203] I [trace.c:1821:trace_readv] 0-home-trace: 429083: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=1179648)
[2011-07-22 15:11:09.537915] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429082: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.542337] I [trace.c:1821:trace_readv] 0-home-trace: 429084: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=1310720)
[2011-07-22 15:11:09.542895] I [trace.c:203:trace_readv_cbk] 0-home-trace: 429083: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 op_ret=131072 buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Dec 31 16:00:00], mtime=[Dec 31 16:00:00], ctime=[Dec 31 16:00:00]
[2011-07-22 15:11:09.543441] I [trace.c:1821:trace_readv] 0-home-trace: 429085: gfid=0afec1e0-6af1-44e1-b345-d0f299a3e880 fd=0x7fddc9f9f024, size=131072, offset=1441792)

Comment 1 Joe Julian 2011-07-22 19:51:04 UTC
These are the server dumpfiles. The client won't dump anything on SIGUSR1. It just creates an empty file.

http://joejulian.name/scratch/serverdumps.tar.bz2

Comment 2 Joe Julian 2011-07-22 20:16:56 UTC
full loglevel=TRACE log:

http://joejulian.name/scratch/home-log-level-trace.log.bz2

Comment 3 Joe Julian 2011-07-22 22:32:00 UTC
I'm encountering a lockup problem when reading large numbers of files. I cannot break out of the race in gdb, a ps will lock up when it tries to read that process' data, df (of course) locks up. No kill signals have any effect. The only way out of it is to umount -f.

Comment 4 Joe Julian 2011-07-23 14:03:22 UTC
Could this be bug 764743?

Comment 5 Amar Tumballi 2011-07-25 06:06:46 UTC
Need 'volume info' and the type of operation which is being done on the volume. Would help to re-create the issue.

-Amar

Comment 6 Amar Tumballi 2011-08-08 04:10:28 UTC
Joe, is the issue happening to you on latest release-3.1 branch ? Should I be closing the bug?

Comment 7 Joe Julian 2011-08-08 04:15:59 UTC
release-3.1 has not been tested directly. I have cherry-picked patches into 3.1.5 as have been suggested by yourself and Anand Avati. We have isolated the lockup to something in io-cache, but not fixed it. I currently have io-cache disabled in production.

Avati has logged on to the system with the lockup to see the problem first hand.

Comment 8 Amar Tumballi 2011-10-02 23:04:16 UTC
Joe, is this happening now? Let me know if the latest release has worked fine for you. I would like to close the bug if This issue is not happening anymore.

Comment 9 Joe Julian 2011-10-02 23:50:18 UTC
I won't be able to test it until Tuesday night. 

I've been running with io-cache off ever since (per avati) and the only patches to that translator has been to change the License.

Comment 10 Joe Julian 2011-10-05 04:28:22 UTC
Yes, with 3.1.7 I still get the same hard freeze of the process if io-cache is enabled.

Comment 11 Joe Julian 2011-10-11 22:23:13 UTC
Created attachment 704

Comment 12 Joe Julian 2011-10-11 22:23:53 UTC
Created attachment 705

Comment 13 Joe Julian 2011-10-11 22:24:08 UTC
I'm going to go with the assumption that I've found a new way to duplicate the problem. This happens regardless of the translators used. I'm suspicious that the disabling of io-cache simply changes some unrelated dynamic.

I've upgraded one of my VM hosts to CentOS 6.0. Without any specific sequence of events, I'm finding that the client for my vm volume keeps hanging. Killing with USR1 still produces a 0-length state file, touching the process in any way hangs the shell to the point where it cannot be interrupted (ps, cat'ing anything under the pid in /proc, ls of the mountpoint). The glusterfs process (as near as I can tell) is in the D state.

Attached is an strace going back as far as I had screen buffer. The very last line happened when I did an ls of the mountpoint after a vmimage had frozen.

When frozen, the glusterfs process will not respond to any signal but USR1.

The attached backtrace may or may not have anything useful. I hit crtl-c in gdb when the hang happened, but it wouldn't interrupt. When I umount -f the volume, that's when I was able to break out and get this backtrace.

Volume Name: vmimages
Type: Distributed-Replicate
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: ewcs2:/var/spool/glusterfs/a_vmimages
Brick2: ewcs4:/var/spool/glusterfs/a_vmimages
Brick3: ewcs7:/var/spool/glusterfs/a_vmimages
Brick4: ewcs2:/var/spool/glusterfs/b_vmimages
Brick5: ewcs4:/var/spool/glusterfs/b_vmimages
Brick6: ewcs7:/var/spool/glusterfs/b_vmimages
Brick7: ewcs2:/var/spool/glusterfs/c_vmimages
Brick8: ewcs4:/var/spool/glusterfs/c_vmimages
Brick9: ewcs7:/var/spool/glusterfs/c_vmimages
Brick10: ewcs2:/var/spool/glusterfs/d_vmimages
Brick11: ewcs4:/var/spool/glusterfs/d_vmimages
Brick12: ewcs7:/var/spool/glusterfs/d_vmimages


dmesg had this:
INFO: task glusterfs:7811 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D ffffffff8110c330     0  7811   7810 0x00000081
 ffff8802239c39b8 0000000000000086 ffff8802239c3948 ffffffff81447a6c
 ffff8801bdf2f600 ffffea00063a5c00 000000000000000e ffff8800283138c0
 ffff8800bd8705f8 ffff8802239c3fd8 0000000000010518 ffff8800bd8705f8
Call Trace:
 [<ffffffff81447a6c>] ? ip_finish_output+0x13c/0x310
 [<ffffffff8109bba9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff8110c330>] ? sync_page+0x0/0x50
 [<ffffffff814c9a53>] io_schedule+0x73/0xc0
 [<ffffffff8110c36d>] sync_page+0x3d/0x50
 [<ffffffff814ca17a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff8110c307>] __lock_page+0x67/0x70
 [<ffffffff81091ee0>] ? wake_bit_function+0x0/0x50
 [<ffffffff81122781>] ? lru_cache_add_lru+0x21/0x40
 [<ffffffff8115bf10>] lock_page+0x30/0x40
 [<ffffffff8115c58d>] migrate_pages+0x59d/0x5d0
 [<ffffffff811226d7>] ? ____pagevec_lru_add+0x167/0x180
 [<ffffffff81152b20>] ? compaction_alloc+0x0/0x370
 [<ffffffff811525cc>] compact_zone+0x4cc/0x600
 [<ffffffff8111cffc>] ? get_page_from_freelist+0x15c/0x820
 [<ffffffff8115297e>] compact_zone_order+0x7e/0xb0
 [<ffffffff81152ab9>] try_to_compact_pages+0x109/0x170
 [<ffffffff8111e99d>] __alloc_pages_nodemask+0x5ed/0x850
 [<ffffffff810c6b88>] ? start_callback+0xb8/0xd0
 [<ffffffff810c6a35>] ? finish_callback+0xa5/0x140
 [<ffffffff810c8058>] ? finish_report+0x78/0xe0
 [<ffffffff81150db3>] alloc_pages_vma+0x93/0x150
 [<ffffffff81167f15>] do_huge_pmd_anonymous_page+0x135/0x340
 [<ffffffff810c71dc>] ? utrace_stop+0x12c/0x1e0
 [<ffffffff811367c5>] handle_mm_fault+0x245/0x2b0
 [<ffffffff814ce503>] do_page_fault+0x123/0x3a0
 [<ffffffff814cbf75>] page_fault+0x25/0x30

Comment 14 Joe Julian 2011-10-12 14:13:44 UTC
I've found the problem.

There's an issue with khugepaged and it's interaction with userspace filesystems. http://kerneltrap.org/mailarchive/linux-kernel/2010/11/4/4641128/thread

I was able to work around it by disabling THP entirely:
echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled

Someone else on the IRC channel said he was able to just disable hugepage defrag though:
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag

Apparently debian has a different setting for /sys/kernel/mm/transparent_hugepage/enabled, madvise, that is the default (also it's a different pathname, if you're planning around my next suggestion).

I'm not even sure if this is something that you even can program for. If not, maybe add a check for THP and issue a warning?

Comment 16 Amar Tumballi 2011-10-28 09:55:53 UTC
Currently, as the scope of fix is outside the code of GlusterFS, marking it as an 'enhancement' severity.

Comment 17 Amar Tumballi 2012-02-23 10:16:32 UTC
Joe, reducing the priority  of the bug to 'medium', as this issue is not seen in higher version of kernel.

Comment 18 Joe Julian 2012-02-23 13:58:16 UTC
(In reply to comment #17)
> Joe, reducing the priority  of the bug to 'medium', as this issue is not seen
> in higher version of kernel.

Which kernel? I'm still seeing it with the latest RHEL6.

Comment 19 cheng 2012-05-10 01:39:49 UTC
I'm seeing it in CentOS 6.0 (2.6.32-71.el6.x86_64)

Comment 20 Harshavardhana 2012-06-04 18:15:44 UTC
is this issue still valid? Joe?

Comment 21 Joe Julian 2012-06-04 18:22:34 UTC
Yes, it still occurs with the latest kernel in CentOS 6.2.

Comment 22 Jeff Darcy 2012-06-04 18:29:43 UTC
Do you mean it still happens without the workaround, or even with the workaround in place?

Comment 23 Joe Julian 2012-06-04 19:23:21 UTC
kernel-2.6.32-220.17.1.el6.x86_64 with transparent hugepages enabled (specifically the transparent ones) still cause lock races with any fuse based filesystem.

disabling THP is still a valid work-around.

Comment 24 Harshavardhana 2012-06-08 21:18:36 UTC
Joe, is there a reproducible case for this? like any sort of special applications?

Comment 25 Joe Julian 2012-09-18 06:14:27 UTC
Odd, I thought I had posted an update to this quite some time ago. At least as of 2.6.32-220.23.1.el6.x86_64 it is no longer a problem.

Comment 26 Harshavardhana 2013-04-30 21:06:44 UTC
RHEL 6.4 has the following fixes - quite possibly fixing the THP issue from RHEL 6.3

=========
- [mm] compaction: clear PG_migrate_skip based on compaction and reclaim activity (Rik van Riel) [713546 783248]
- [mm] compaction: fix bit ranges in {get,clear,set}_pageblock_skip() (Rik van Riel) [713546 783248]
- [mm] compaction: Restart compaction from near where it left off (Rik van Riel) [713546 783248]
- [mm] compaction: Cache if a pageblock was scanned and no pages were isolated (Rik van Riel) [713546 783248]
- [mm] compaction: Abort compaction loop if lock is contended or run too long (Rik van Riel) [713546 783248]
- [mm] compaction: Abort async compaction if locks are contended or taking too long (Rik van Riel) [713546 783248]
- [mm] compaction: introduce sync-light migration for use by compaction (Rik van Riel) [713546 783248]
- [mm] compaction: allow compaction to isolate dirty pages (Rik van Riel) [713546 783248]
- [mm] compaction: make isolate_lru_page() filter-aware again (Rik van Riel) [713546 783248]
- [mm] compaction: make isolate_lru_page() filter-aware (Rik van Riel) [713546 783248]
- [mm] compaction: determine if dirty pages can be migrated without blocking within ->migratepage (Rik van Riel) [713546 783248]
- [mm] compaction: use synchronous compaction for /proc/sys/vm/compact_memory (Rik van Riel) [713546 783248]
=========

Comment 27 Akemi Yagi 2014-02-22 14:33:39 UTC
There is a KB solution article related to this bug report:

https://access.redhat.com/site/solutions/362804