Bug 832252 - cifs_async_writev blocked by limited kmap on i386 with high-mem
cifs_async_writev blocked by limited kmap on i386 with high-mem
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Sachin Prabhu
Jian Li
:
Depends On:
Blocks: 881827
  Show dependency treegraph
 
Reported: 2012-06-14 22:06 EDT by Jian Li
Modified: 2014-03-03 19:08 EST (History)
5 users (show)

See Also:
Fixed In Version: kernel-2.6.32-345.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 01:23:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch -- serialize kmaps in async writev marshalling code (3.64 KB, patch)
2012-06-15 07:01 EDT, Jeff Layton
no flags Details | Diff
patch -- serialize kmaps in async writev marshalling code and cap wsize at available kmap address space (6.28 KB, patch)
2012-07-02 07:41 EDT, Jeff Layton
no flags Details | Diff

  None (edit)
Description Jian Li 2012-06-14 22:06:51 EDT
Description of problem:
When testing cifs IO using fsx, if cifs use wsize of 16M, test task would be blocked. 

Version-Release number of selected component (if applicable):
kernel-2.6.32-278.el6

How reproducible:
/kernel/filesystems/cifs/multihost/bz789058-wsize

Steps to Reproduce:
1.
2.
3.
  
Actual results:
test pass

Expected results:
test timeout.

Additional info:
The machine is i386 arch.

some analysis data from vmcore:

> > > crash> bt
> > > > PID: 2789   TASK: f02edaa0  CPU: 3   COMMAND: "fsx"
> > > >  #0 [eed63cbc] schedule at c083c5b3
> > > >  #1 [eed63d80] kmap_high at c0500ec8
> > > >  #2 [eed63db0] cifs_async_writev at f7fabcd7 [cifs]
> > > >  #3 [eed63df0] cifs_writepages at f7fb7f5c [cifs]
> > > >  #4 [eed63e50] do_writepages at c04f3e32
> > > >  #5 [eed63e54] __filemap_fdatawrite_range at c04e152a
> > > >  #6 [eed63ea4] filemap_fdatawrite at c04e1b3e
> > > >  #7 [eed63eb4] cifs_file_aio_write at f7fa111a [cifs]
> > > >  #8 [eed63ecc] do_sync_write at c052d202
> > > >  #9 [eed63f74] vfs_write at c052d4ee
> > > > #10 [eed63f94] sys_write at c052df4c
> > > > #11 [eed63fb0] ia32_sysenter_target at c0409a98
> > > >     EAX: 00000004  EBX: 00000003  ECX: abd73b73  EDX: 012a65c6
> > > >     DS:  007b      ESI: 012a65c6  ES:  007b      EDI: 00000000
> > > >     SS:  007b      ESP: bf8db178  EBP: bf8db1f8  GS:  0033
> > > >     CS:  0073      EIP: 40000424  ERR: 00000004  EFLAGS: 00000246
> > > > 
> > > > crash> waitq pkmap_map_wait
> > > > PID: 2819   TASK: f0283aa0  CPU: 3   COMMAND: "crond"
> > > > PID: 2815   TASK: eec15aa0  CPU: 3   COMMAND: "crond"
> > > > PID: 2813   TASK: f0335000  CPU: 3   COMMAND: "crond"
> > > > PID: 2811   TASK: f41d4aa0  CPU: 3   COMMAND: "crond"
> > > > PID: 2809   TASK: ef7bb550  CPU: 3   COMMAND: "crond"
> > > > PID: 2807   TASK: c14e0550  CPU: 3   COMMAND: "crond"
> > > > PID: 2805   TASK: c1747aa0  CPU: 3   COMMAND: "crond"
> > > > PID: 2803   TASK: f00ac550  CPU: 3   COMMAND: "crond"
> > > > PID: 2799   TASK: f4225aa0  CPU: 3   COMMAND: "crond"
> > > > PID: 2797   TASK: c1696550  CPU: 3   COMMAND: "crond"
> > > > PID: 2795   TASK: f4220000  CPU: 3   COMMAND: "crond"
> > > > PID: 2793   TASK: f00acaa0  CPU: 3   COMMAND: "crond"
> > > > PID: 2791   TASK: f4a94000  CPU: 3   COMMAND: "crond"
> > > > PID: 49     TASK: f716e000  CPU: 5   COMMAND: "bdi-default"
> > > > PID: 2789   TASK: f02edaa0  CPU: 3   COMMAND: "fsx"
> > > > 
> > > > and cifs_async_writev 's  cifs_writedata has a page array size of 3687.
> > > > 
> > > > crash> struct cifs_writedata eed98000
> > > > struct cifs_writedata {
> > > >   refcount = {
> > > >     refcount = {
> > > >       counter = 1
> > > >     } 
> > > >   },
> > > >   sync_mode = WB_SYNC_ALL,
> > > >   work = {
> > > >     owner = 0x0,
> > > >     flags = 0,
> > > >     ops = 0xf7fca69c,
> > > >     link = {
> > > >       next = 0xeed98014,
> > > >       prev = 0xeed98014
> > > >     } 
> > > >   },
> > > >   cfile = 0xf0083740,
> > > >   offset = 58712064,
> > > >   bytes = 2097152,
> > > >   result = 0,
> > > >   nr_pages = 3687,
> > > >   pages = {0xccd001e0}
> > > > }

And page 0xccd001e0 should belong to high-mem.
Comment 2 Jeff Layton 2012-06-15 07:01:38 EDT
Created attachment 592100 [details]
patch -- serialize kmaps in async writev marshalling code

I think this patch will likely fix the problem. It serializes the code that marshals up the kvec array. Jian, can you test this patch and let me know if it helps?

If so, then I'll need to send out an upstream version which is a bit different. This will probably also need to be rebased on top of Sachin's latest set of backports for 6.4.
Comment 3 Jeff Layton 2012-06-15 15:47:01 EDT
Hmm...scratch that...

Even with that patch (which we do want), we still have to limit the rsize and wsize according to the amount of potential kmap space we have. We don't even
have enough to fill a single wsize request at 16M, AFAICT:

    16 * 1024 * 1024 / 4096 = 4096

...and we're going to have (at most) 1024 available -- sometimes we'll just have 512. I think we're also going to have to cap the wsize at lower values when CONFIG_HIGHMEM is set.

A more robust fix would be to teach cifs how to deal with page arrays without kmapping them into large iovecs, but that's a relatively major overhaul.
Comment 4 Jian Li 2012-06-17 21:39:18 EDT
(In reply to comment #2)
> Created attachment 592100 [details]
> patch -- serialize kmaps in async writev marshalling code
> 
> I think this patch will likely fix the problem. It serializes the code that
> marshals up the kvec array. Jian, can you test this patch and let me know if
> it helps?
> 
> If so, then I'll need to send out an upstream version which is a bit
> different. This will probably also need to be rebased on top of Sachin's
> latest set of backports for 6.4.
Fine, I will set out to.
Comment 5 Jian Li 2012-06-18 02:15:19 EDT
(In reply to comment #3)
** snip**

Hi Jeff, 
you patch couldn't stop blocking tasks, debug info is listed :

crash> waitq pkmap_map_wait
PID: 10428  TASK: c1403000  CPU: 2   COMMAND: "crond"
** snip ** all crond
PID: 10412  TASK: c14d7aa0  CPU: 2   COMMAND: "crond"
PID: 10410  TASK: f4044000  CPU: 2   COMMAND: "rhsmcertd"
PID: 10407  TASK: c16ba000  CPU: 2   COMMAND: "crond"
PID: 10405  TASK: f4015000  CPU: 2   COMMAND: "crond"
PID: 10403  TASK: c1436550  CPU: 3   COMMAND: "fsx"
crash> bt
PID: 10403  TASK: c1436550  CPU: 3   COMMAND: "fsx"
 #0 [e5427cbc] schedule at c083c5b3
 #1 [e5427d80] kmap_high at c0500ec8
 #2 [e5427db0] cifs_async_writev at f7f29cdf [cifs]
 #3 [e5427df0] cifs_writepages at f7f35f6c [cifs]
**snip**
crash> bt
PID: 49     TASK: f716e000  CPU: 2   COMMAND: "bdi-default"
 #0 [f71b9c94] schedule at c083c5b3
 #1 [f71b9d58] __mutex_lock_slowpath at c083d943
 #2 [f71b9d80] mutex_lock at c083d848
 #3 [f71b9d8c] cifs_async_writev at f7f29cbb [cifs]
 #4 [f71b9dcc] cifs_writepages at f7f35f6c [cifs]
crash> mutex cifs_kmap_mutex
struct mutex {
  count = {
    counter = -1
  },   
  wait_lock = {
    raw_lock = {
      slock = 257
    }  
  },   
  wait_list = {
    next = 0xf71b9d64,
    prev = 0xf71b9d64
  },   
**snip**
crash> mutex_waiter 0xf71b9d64
struct mutex_waiter {
  list = {
    next = 0xf7f577e8,
    prev = 0xf7f577e8
  },   
  task = 0xf716e000    ====> bdi_default
crash> cifs_writedata f07d0000
struct cifs_writedata {
  refcount = {
    refcount = {
      counter = 1
    }
  }, 
  sync_mode = WB_SYNC_ALL, 
  work = {
    owner = 0x0, 
    flags = 0, 
    ops = 0xf7f486bc, 
    link = {
      next = 0xf07d0014, 
      prev = 0xf07d0014
    }
  }, 
  cfile = 0xf2290800, 
  offset = 58712064, 
  bytes = 2097152, 
  result = 0, 
  nr_pages = 3687,    ===> again?  coincidence!!
  pages = {0xcb7418a0}
}

lijian
Comment 6 Jeff Layton 2012-06-18 10:54:55 EDT
Yes, I realized that after I posted it and said so in comment #3, sorry if I wasn't clear there...

In any case, for now I think we'll need to cap the rsize/wsize at the available kmap space on these arches. I'll let you know when I have a patch that will implement that cap.
Comment 7 Jeff Layton 2012-07-02 07:41:41 EDT
Created attachment 595703 [details]
patch -- serialize kmaps in async writev marshalling code and cap wsize at available kmap address space

This patch should more or less fix this. It'll cap the wsize at the amount of kmap space you have (LAST_PKMAP * PAGE_CACHE_SIZE). Note that it's still probably a bad idea to go to the max on this, as you won't leave any mappings for other uses.

The real fix for this upstream will be to teach the underlying code how to handle arrays of pages so we never use this much kmap space. That's a bigger, more invasive fix though so something like this patch will still be needed in the interim.
Comment 9 RHEL Product and Program Management 2012-07-09 12:10:47 EDT
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.
Comment 10 Jeff Layton 2012-07-11 09:13:37 EDT
Patches sent upstream:

    http://article.gmane.org/gmane.linux.kernel.cifs/6527
    http://article.gmane.org/gmane.linux.kernel.cifs/6528

I don't expect that they'll see too much resistance, but we should probably wait until they are committed before we backport to RHEL.

I'm also working on a more comprehensive fix for this problem that doesn't rely on limiting these sizes (and should also reduce our usage of kmap space). That won't be ready for a bit though.
Comment 11 Jeff Layton 2012-07-18 14:14:17 EDT
I've also sent a patchset with the "real" fix for this for consideration for 3.6:

    http://article.gmane.org/gmane.linux.kernel.cifs/6627

...once it looks like that's on track for inclusion, I'll plan to do a similar patchset for the async read code.
Comment 12 Jeff Layton 2012-10-05 13:43:01 EDT
Patches to fix this for reads and writes are now merged for 3.7. Reassigning this to Sachin for him to handle integrating this into RHEL6.
Comment 15 Jarod Wilson 2012-12-04 12:31:42 EST
Patch(es)
Comment 18 Jian Li 2013-01-20 23:04:21 EST
This bug is verified on
 2.6.32-355.el6.  In samba, close oplocks, in client, use strictcache or cache=strict.

@@ reproducer@@ 

[root@hp-xw4600-01 ~]# dd if=/mnt/test/test.img of=/dev/zero bs=1M count=1k

^Cdd: reading `/mnt/test/test.img': Host is down
0+0 records in
0+0 records out
0 bytes (0 B) copied, 99.2635 s, 0.0 kB/s
dd: closing input file `/mnt/test/test.img': Bad file descriptor
[root@hp-xw4600-01 ~]# dmesg
** snip **
CIFS VFS: Invalid size SMB length 4 pdu_length 1048639
CIFS VFS: Send error in read = -11
CIFS VFS: Invalid size SMB length 4 pdu_length 1048639
CIFS VFS: Send error in read = -11
[root@hp-xw4600-01 ~]# uname -a 
Linux hp-xw4600-01.rhts.eng.nay.redhat.com 2.6.32-315.el6.x86_64 #1 SMP Fri Sep 28 19:33:39 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-xw4600-01 ~]# grep "/mnt/test" /proc/mounts
//hp-xw4600-01.rhts.eng.nay.redhat.com/test/ /mnt/test cifs rw,relatime,sec=ntlm,unc=\\hp-xw4600-01.rhts.eng.nay.redhat.com\test,username=root,uid=0,noforceuid,gid=0,noforcegid,addr=10.66.86.85,unix,posixpaths,serverino,acl,rsize=16384,wsize=65536,actimeo=1 0 0


@@ verifier @@:
[root@hp-xw4600-01 ~]# mount //`hostname`/test /mnt/test -o password=redhat,strictcache,cache=strict
[root@hp-xw4600-01 ~]# grep "/mnt/test" /proc/mounts
//hp-xw4600-01.rhts.eng.nay.redhat.com/test/ /mnt/test cifs rw,relatime,sec=ntlm,cache=strict,unc=\\hp-xw4600-01.rhts.eng.nay.redhat.com\test,username=root,uid=0,noforceuid,gid=0,noforcegid,addr=10.66.86.85,unix,posixpaths,serverino,acl,rsize=16384,wsize=65536,actimeo=1 0 0
[root@hp-xw4600-01 ~]# uname -a
Linux hp-xw4600-01.rhts.eng.nay.redhat.com 2.6.32-355.el6.x86_64 #1 SMP Tue Jan 15 17:45:38 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-xw4600-01 ~]# dmesg -c > /dev/null
[root@hp-xw4600-01 ~]# dd if=/mnt/test/test.img of=/dev/zero bs=1M count=1k
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.0813 s, 96.9 MB/s
[root@hp-xw4600-01 ~]# dmesg
[root@hp-xw4600-01 ~]#
Comment 19 Jian Li 2013-01-20 23:06:05 EST
PLEASE ignore comment 18. FAULT COMMENT!!
Comment 20 Jian Li 2013-01-21 00:02:13 EST
This bug is verified on 2.6.32-355.el6. When large wsize is used in mount command, system would use a limited wsize instead of the provided.

#mount.cifs //127.0.0.1/bz789058-wsize /mnt/bz789058-wsize -o wsize=18000000,user=root,password=redhat
#grep "/mnt/bz789054" /proc/mounts
//127.0.0.1/bz789058-wsize/ /mnt/bz789058-wsize cifs rw,relatime,sec=ntlm,cache=loose,unc=\\127.0.0.1\bz789058-wsize,username=root,uid=0,noforceuid,gid=0,noforcegid,addr=127.0.0.1,unix,posixpaths,serverino,acl,rsize=16384,wsize=2097152,actimeo=1 0 0


fsx test is done.
Comment 22 errata-xmlrpc 2013-02-21 01:23:15 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html

Note You need to log in before you can comment on or make changes to this bug.