Bug 832252
Summary: | cifs_async_writev blocked by limited kmap on i386 with high-mem | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jian Li <jiali> | ||||||
Component: | kernel | Assignee: | Sachin Prabhu <sprabhu> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Jian Li <jiali> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 6.3 | CC: | cifs-maint, jlayton, nmurray, rwheeler, sprabhu | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | kernel-2.6.32-345.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-02-21 06:23:15 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 881827 | ||||||||
Attachments: |
|
Description
Jian Li
2012-06-15 02:06:51 UTC
Created attachment 592100 [details]
patch -- serialize kmaps in async writev marshalling code
I think this patch will likely fix the problem. It serializes the code that marshals up the kvec array. Jian, can you test this patch and let me know if it helps?
If so, then I'll need to send out an upstream version which is a bit different. This will probably also need to be rebased on top of Sachin's latest set of backports for 6.4.
Hmm...scratch that... Even with that patch (which we do want), we still have to limit the rsize and wsize according to the amount of potential kmap space we have. We don't even have enough to fill a single wsize request at 16M, AFAICT: 16 * 1024 * 1024 / 4096 = 4096 ...and we're going to have (at most) 1024 available -- sometimes we'll just have 512. I think we're also going to have to cap the wsize at lower values when CONFIG_HIGHMEM is set. A more robust fix would be to teach cifs how to deal with page arrays without kmapping them into large iovecs, but that's a relatively major overhaul. (In reply to comment #2) > Created attachment 592100 [details] > patch -- serialize kmaps in async writev marshalling code > > I think this patch will likely fix the problem. It serializes the code that > marshals up the kvec array. Jian, can you test this patch and let me know if > it helps? > > If so, then I'll need to send out an upstream version which is a bit > different. This will probably also need to be rebased on top of Sachin's > latest set of backports for 6.4. Fine, I will set out to. (In reply to comment #3) ** snip** Hi Jeff, you patch couldn't stop blocking tasks, debug info is listed : crash> waitq pkmap_map_wait PID: 10428 TASK: c1403000 CPU: 2 COMMAND: "crond" ** snip ** all crond PID: 10412 TASK: c14d7aa0 CPU: 2 COMMAND: "crond" PID: 10410 TASK: f4044000 CPU: 2 COMMAND: "rhsmcertd" PID: 10407 TASK: c16ba000 CPU: 2 COMMAND: "crond" PID: 10405 TASK: f4015000 CPU: 2 COMMAND: "crond" PID: 10403 TASK: c1436550 CPU: 3 COMMAND: "fsx" crash> bt PID: 10403 TASK: c1436550 CPU: 3 COMMAND: "fsx" #0 [e5427cbc] schedule at c083c5b3 #1 [e5427d80] kmap_high at c0500ec8 #2 [e5427db0] cifs_async_writev at f7f29cdf [cifs] #3 [e5427df0] cifs_writepages at f7f35f6c [cifs] **snip** crash> bt PID: 49 TASK: f716e000 CPU: 2 COMMAND: "bdi-default" #0 [f71b9c94] schedule at c083c5b3 #1 [f71b9d58] __mutex_lock_slowpath at c083d943 #2 [f71b9d80] mutex_lock at c083d848 #3 [f71b9d8c] cifs_async_writev at f7f29cbb [cifs] #4 [f71b9dcc] cifs_writepages at f7f35f6c [cifs] crash> mutex cifs_kmap_mutex struct mutex { count = { counter = -1 }, wait_lock = { raw_lock = { slock = 257 } }, wait_list = { next = 0xf71b9d64, prev = 0xf71b9d64 }, **snip** crash> mutex_waiter 0xf71b9d64 struct mutex_waiter { list = { next = 0xf7f577e8, prev = 0xf7f577e8 }, task = 0xf716e000 ====> bdi_default crash> cifs_writedata f07d0000 struct cifs_writedata { refcount = { refcount = { counter = 1 } }, sync_mode = WB_SYNC_ALL, work = { owner = 0x0, flags = 0, ops = 0xf7f486bc, link = { next = 0xf07d0014, prev = 0xf07d0014 } }, cfile = 0xf2290800, offset = 58712064, bytes = 2097152, result = 0, nr_pages = 3687, ===> again? coincidence!! pages = {0xcb7418a0} } lijian Yes, I realized that after I posted it and said so in comment #3, sorry if I wasn't clear there... In any case, for now I think we'll need to cap the rsize/wsize at the available kmap space on these arches. I'll let you know when I have a patch that will implement that cap. Created attachment 595703 [details]
patch -- serialize kmaps in async writev marshalling code and cap wsize at available kmap address space
This patch should more or less fix this. It'll cap the wsize at the amount of kmap space you have (LAST_PKMAP * PAGE_CACHE_SIZE). Note that it's still probably a bad idea to go to the max on this, as you won't leave any mappings for other uses.
The real fix for this upstream will be to teach the underlying code how to handle arrays of pages so we never use this much kmap space. That's a bigger, more invasive fix though so something like this patch will still be needed in the interim.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Patches sent upstream: http://article.gmane.org/gmane.linux.kernel.cifs/6527 http://article.gmane.org/gmane.linux.kernel.cifs/6528 I don't expect that they'll see too much resistance, but we should probably wait until they are committed before we backport to RHEL. I'm also working on a more comprehensive fix for this problem that doesn't rely on limiting these sizes (and should also reduce our usage of kmap space). That won't be ready for a bit though. I've also sent a patchset with the "real" fix for this for consideration for 3.6: http://article.gmane.org/gmane.linux.kernel.cifs/6627 ...once it looks like that's on track for inclusion, I'll plan to do a similar patchset for the async read code. Patches to fix this for reads and writes are now merged for 3.7. Reassigning this to Sachin for him to handle integrating this into RHEL6. Patch(es) This bug is verified on 2.6.32-355.el6. In samba, close oplocks, in client, use strictcache or cache=strict. @@ reproducer@@ [root@hp-xw4600-01 ~]# dd if=/mnt/test/test.img of=/dev/zero bs=1M count=1k ^Cdd: reading `/mnt/test/test.img': Host is down 0+0 records in 0+0 records out 0 bytes (0 B) copied, 99.2635 s, 0.0 kB/s dd: closing input file `/mnt/test/test.img': Bad file descriptor [root@hp-xw4600-01 ~]# dmesg ** snip ** CIFS VFS: Invalid size SMB length 4 pdu_length 1048639 CIFS VFS: Send error in read = -11 CIFS VFS: Invalid size SMB length 4 pdu_length 1048639 CIFS VFS: Send error in read = -11 [root@hp-xw4600-01 ~]# uname -a Linux hp-xw4600-01.rhts.eng.nay.redhat.com 2.6.32-315.el6.x86_64 #1 SMP Fri Sep 28 19:33:39 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux [root@hp-xw4600-01 ~]# grep "/mnt/test" /proc/mounts //hp-xw4600-01.rhts.eng.nay.redhat.com/test/ /mnt/test cifs rw,relatime,sec=ntlm,unc=\\hp-xw4600-01.rhts.eng.nay.redhat.com\test,username=root,uid=0,noforceuid,gid=0,noforcegid,addr=10.66.86.85,unix,posixpaths,serverino,acl,rsize=16384,wsize=65536,actimeo=1 0 0 @@ verifier @@: [root@hp-xw4600-01 ~]# mount //`hostname`/test /mnt/test -o password=redhat,strictcache,cache=strict [root@hp-xw4600-01 ~]# grep "/mnt/test" /proc/mounts //hp-xw4600-01.rhts.eng.nay.redhat.com/test/ /mnt/test cifs rw,relatime,sec=ntlm,cache=strict,unc=\\hp-xw4600-01.rhts.eng.nay.redhat.com\test,username=root,uid=0,noforceuid,gid=0,noforcegid,addr=10.66.86.85,unix,posixpaths,serverino,acl,rsize=16384,wsize=65536,actimeo=1 0 0 [root@hp-xw4600-01 ~]# uname -a Linux hp-xw4600-01.rhts.eng.nay.redhat.com 2.6.32-355.el6.x86_64 #1 SMP Tue Jan 15 17:45:38 EST 2013 x86_64 x86_64 x86_64 GNU/Linux [root@hp-xw4600-01 ~]# dmesg -c > /dev/null [root@hp-xw4600-01 ~]# dd if=/mnt/test/test.img of=/dev/zero bs=1M count=1k 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 11.0813 s, 96.9 MB/s [root@hp-xw4600-01 ~]# dmesg [root@hp-xw4600-01 ~]# PLEASE ignore comment 18. FAULT COMMENT!! This bug is verified on 2.6.32-355.el6. When large wsize is used in mount command, system would use a limited wsize instead of the provided. #mount.cifs //127.0.0.1/bz789058-wsize /mnt/bz789058-wsize -o wsize=18000000,user=root,password=redhat #grep "/mnt/bz789054" /proc/mounts //127.0.0.1/bz789058-wsize/ /mnt/bz789058-wsize cifs rw,relatime,sec=ntlm,cache=loose,unc=\\127.0.0.1\bz789058-wsize,username=root,uid=0,noforceuid,gid=0,noforcegid,addr=127.0.0.1,unix,posixpaths,serverino,acl,rsize=16384,wsize=2097152,actimeo=1 0 0 fsx test is done. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html |