612839 – read() and write() truncate buffer length

Bug 612839 - read() and write() truncate buffer length

Summary: read() and write() truncate buffer length

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Edward Shishkin
QA Contact:	Zhang Kexin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-09 08:20 UTC by Lachlan McIlroy
Modified:	2018-11-14 19:25 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-06-24 15:18:08 UTC
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	Chris.Bontempi: needinfo+

Attachments	(Terms of Use)
Test program (640 bytes, text/plain) 2010-07-09 08:20 UTC, Lachlan McIlroy	no flags	Details
Patch to remove excessive write length truncation (840 bytes, patch) 2010-08-02 03:43 UTC, Lachlan McIlroy	no flags	Details \| Diff
Patch to allow >2GB reads/writes. (2.16 KB, patch) 2010-08-04 05:51 UTC, Lachlan McIlroy	no flags	Details \| Diff
Link to changeset e28cc71572da38a5a12c1cfe4d7032017adccf69 (88 bytes, text/plain) 2010-10-04 14:32 UTC, Edward Shishkin	no flags	Details
Show Obsolete (2) View All

Description Lachlan McIlroy 2010-07-09 08:20:28 UTC

Created attachment 430558 [details]
Test program

Description of problem:

Here is the implementation of write() in the 2.6.18 kernel. Notice that it does receive count as a size_t, and that it calls vfs_write().

asmlinkage ssize_t sys_write(unsigned int fd, const char __user * buf, size_t count)
{
     struct file *file;
ssize_t ret = -EBADF;
     int fput_needed;
     file = fget_light(fd, &fput_needed);
     if (file) {
             loff_t pos = file_pos_read(file);
             ret = vfs_write(file, buf, count, &pos);
file_pos_write(file, pos);
             fput_light(file, fput_needed);
     }
     return ret;
}

In vfs_write we call rw_verify_area(), the return is stored in an ssize_t.

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
     ssize_t ret;
     if (!(file->f_mode & FMODE_WRITE))
             return -EBADF;
     if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
             return -EINVAL;
     if (unlikely(!access_ok(VERIFY_READ, buf, count)))
             return -EFAULT;
     ret = rw_verify_area(WRITE, file, pos, count);
     if (ret >= 0) {
             count = ret;


Finally the rw_verify_data() function. We notice here that there is a comment describing the truncation of the count...

/*
 * rw_verify_area doesn't like huge counts. We limit
 * them to something that fits in "int" so that others
 * won't have to do range checks all the time.
 */
#define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)

int rw_verify_area(int read_write, struct file *file, loff_t *ppos, size_t count)
{
     struct inode *inode;
     loff_t pos;
     if (unlikely((ssize_t) count < 0))
     goto Einval;
     pos = *ppos;
     if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
             goto Einval;
     inode = file->f_dentry->d_inode;
     if (unlikely(inode->i_flock && MANDATORY_LOCK(inode))) {
             int retval = locks_mandatory_area(
                     read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,
                     inode, file, pos, count);
             if (retval < 0)
                     return retval;
     }
     return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
Einval:
     return -EINVAL;
}

Here we can clearly see "return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;" which returns MAX_RW_COUNT for any count value over MAX_RW_COUNT. Thus we have our limit.

Version-Release number of selected component (if applicable):
Observed with kernel-2.6.18-204.

How reproducible:
Always

Steps to Reproduce:
Run the attached test program.  It creates a file and tries to write a buffer of length INT_MAX (2147483647) bytes but the resultant file only has 2147479552 bytes.

The same problem can be seen in the read() system call with this dd strace:

# strace dd if=/dev/zero of=/dev/null bs=2147483647 count=1
...
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147483647) = 2147479552
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147479552) = 2147479552
...

Comment 2 Lachlan McIlroy 2010-08-02 03:43:18 UTC

Created attachment 435949 [details]
Patch to remove excessive write length truncation

This patch fixes the write length truncation to be precisely INT_MAX bytes (rather than rounding down INT_MAX to a page boundary).  This allows writes of INT_MAX bytes to succeed in one call.

Comment 6 Lachlan McIlroy 2010-08-04 05:51:24 UTC

Created attachment 436437 [details]
Patch to allow >2GB reads/writes.

This patch removes the 2GB limitation for a single read() or write() call on 64 bit architectures.

Comment 7 Lachlan McIlroy 2010-08-06 04:24:30 UTC

Comment on attachment 430558 [details]
Test program

This test program was to verify the initial patch that removed the page rounding and doesn't apply to the new patch.

Comment 8 Lachlan McIlroy 2010-08-06 04:39:10 UTC

To test the latest patch, use:

$ dd if=/dev/zero of=/dev/null bs=2G count=1
0+1 records in
0+1 records out
2147483648 bytes (2.1 GB) copied, 1.55933 s, 1.4 GB/s

$ dd if=/dev/zero of=/dev/null bs=4G count=1
0+1 records in
0+1 records out
4294967296 bytes (2.1 GB) copied, 1.56212 s, 1.4 GB/s

The bytes copied should exactly match the buffer size.  Previously it would be truncated to 2147479552 bytes.  Keep increasing the bs value if memory permits.

Comment 13 Eric Sandeen 2010-08-31 23:48:58 UTC

Edward said he'd take a look at this from the engineering side...

Comment 20 Edward Shishkin 2010-10-04 14:32:44 UTC

Created attachment 451414 [details]
Link to changeset e28cc71572da38a5a12c1cfe4d7032017adccf69

Added a link to the changeset e28cc71572da38a5a12c1cfe4d7032017adccf69

Comment 21 Eric Sandeen 2010-10-04 21:19:40 UTC

Hm, just as a note, the limit went in with this commit:

commit e28cc71572da38a5a12c1cfe4d7032017adccf69
Author: Linus Torvalds <torvalds.org>
Date:   Wed Jan 4 16:20:40 2006 -0800

    Relax the rw_verify_area() error checking.
    
    In particular, allow over-large read- or write-requests to be downgraded
    to a more reasonable range, rather than considering them outright errors.
    
    We want to protect lower layers from (the sadly all too common) overflow
    conditions, but prefer to do so by chopping the requests up, rather than
    just refusing them outright.
    
    Cc: Peter Anvin <hpa>
    Cc: Ulrich Drepper <drepper>
    Cc: Andi Kleen <ak>
    Cc: Al Viro <viro.org.uk>
    Signed-off-by: Linus Torvalds <torvalds>

I wonder what errors were present in lower layers, and whether they would be re-exposed by changing this.  (commit is 2.6.15-era)

Comment 22 Eric Sandeen 2010-10-04 21:32:52 UTC

Oh right, which is just what Edward posted, sorry!

Comment 23 Eric Sandeen 2010-10-04 22:12:49 UTC

Just looking at the history of this INT_MAX limit ...

2832e9366a1fcd6f76957a42157be041240f994e took out a check for count > f_maxcount and replaced with INT_MAX since f_maxcount was hardcoded to that anyway.  So that's where the INT_MAX came from.

f_maxcount was added long ago (2005, bkrev 41f71cbbbAqnp67z79i7SSVQGtmQzg):

Add 'f_maxcount' to allow filesystems to set a per-file maximum IO size.

but it was hardcoded to INT_MAX and never changed until it was removed.

See:
http://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg00617.html

and 

http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg92056.html

So I wonder what this old INT_MAX (used to be f_maxcount) was designed to protect?

Edward, I agree that it'd change that old upstream commit to allow larger count, but proposing it upstream may be the right direction.  It may just be an old limit that needs to be revisited.  Or if there's a valid reason to limit it here, someone will speak up :)

(I assume we have the same behavior in RHEL6 anyway?  And if so we shouldn't fix it in RHEL5 without also fixing RHEL6)

Comment 27 Eric Sandeen 2010-10-05 02:40:44 UTC

One more note - this same behavior exists upstream today, so "fixing" it here would require rhel6 and upstream fixes.

Comment 30 Edward Shishkin 2010-10-05 18:06:12 UTC

(In reply to comment #21)
> Hm, just as a note, the limit went in with this commit:
> 
> commit e28cc71572da38a5a12c1cfe4d7032017adccf69
> Author: Linus Torvalds <torvalds.org>
> Date:   Wed Jan 4 16:20:40 2006 -0800
> 
>     Relax the rw_verify_area() error checking.
> 
>     In particular, allow over-large read- or write-requests to be downgraded
>     to a more reasonable range, rather than considering them outright errors.
> 
>     We want to protect lower layers from (the sadly all too common) overflow
>     conditions, but prefer to do so by chopping the requests up, rather than
>     just refusing them outright.
> 
>     Cc: Peter Anvin <hpa>
>     Cc: Ulrich Drepper <drepper>
>     Cc: Andi Kleen <ak>
>     Cc: Al Viro <viro.org.uk>
>     Signed-off-by: Linus Torvalds <torvalds>
> 
> I wonder what errors were present in lower layers, and whether they would be
> re-exposed by changing this.  (commit is 2.6.15-era)

I guess this is the direct_io subsystem who exposes INT_MAX limitation. No?

Comment 34 QiangGao 2010-11-04 05:22:47 UTC

x86_64:
strace dd if=/dev/zero of=/dev/null  bs=2G count=1:
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147483648) = 2147479552
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147479552) = 2147479552

Comment 46 Eric Sandeen 2011-04-05 16:58:37 UTC

Chris, what info are you looking for, and from who?  I see that you set a needinfo flag on this bug.

Thanks,
-Eric

Comment 54 Eric Sandeen 2011-05-10 17:24:21 UTC

Edward and I spoke with the customer on the phone today, and it sounds like they will be able to modify their IO library to accept these very large IO requests, and issue smaller direct-IO chunks in a loop to fulfill that request.

We stressed that if they had any problems with this approach, that they are welcome to let us know, and we can help them make it work.

We also pointed out that the read() and write() calls do return the actual number of bytes written, so the calling application can in fact know whether or not the entire request was processed, and proceed accordingly.

Thanks,
-Eric

Comment 55 RHEL Program Management 2011-06-20 21:59:16 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 56 Eric Sandeen 2011-06-24 15:18:08 UTC

Given the customer course of action in comment #54 and the strong rejection of this change upstream, I think we will not plan to make changes here.

Note You need to log in before you can comment on or make changes to this bug.