Bug 612839
Summary: | read() and write() truncate buffer length | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Lachlan McIlroy <lmcilroy> | ||||||||||
Component: | kernel | Assignee: | Edward Shishkin <edward> | ||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Zhang Kexin <kzhang> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 5.6 | CC: | Chris.Bontempi, degts, esandeen, ihands, joseph.h.swartz, kgay, kzhang, rnelson, rring, rwheeler, swells, swhiteho, vgaikwad | ||||||||||
Target Milestone: | rc | Flags: | Chris.Bontempi:
needinfo+
|
||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2011-06-24 15:18:08 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Created attachment 435949 [details]
Patch to remove excessive write length truncation
This patch fixes the write length truncation to be precisely INT_MAX bytes (rather than rounding down INT_MAX to a page boundary). This allows writes of INT_MAX bytes to succeed in one call.
Created attachment 436437 [details]
Patch to allow >2GB reads/writes.
This patch removes the 2GB limitation for a single read() or write() call on 64 bit architectures.
Comment on attachment 430558 [details]
Test program
This test program was to verify the initial patch that removed the page rounding and doesn't apply to the new patch.
To test the latest patch, use: $ dd if=/dev/zero of=/dev/null bs=2G count=1 0+1 records in 0+1 records out 2147483648 bytes (2.1 GB) copied, 1.55933 s, 1.4 GB/s $ dd if=/dev/zero of=/dev/null bs=4G count=1 0+1 records in 0+1 records out 4294967296 bytes (2.1 GB) copied, 1.56212 s, 1.4 GB/s The bytes copied should exactly match the buffer size. Previously it would be truncated to 2147479552 bytes. Keep increasing the bs value if memory permits. Edward said he'd take a look at this from the engineering side... Created attachment 451414 [details]
Link to changeset e28cc71572da38a5a12c1cfe4d7032017adccf69
Added a link to the changeset e28cc71572da38a5a12c1cfe4d7032017adccf69
Hm, just as a note, the limit went in with this commit: commit e28cc71572da38a5a12c1cfe4d7032017adccf69 Author: Linus Torvalds <torvalds.org> Date: Wed Jan 4 16:20:40 2006 -0800 Relax the rw_verify_area() error checking. In particular, allow over-large read- or write-requests to be downgraded to a more reasonable range, rather than considering them outright errors. We want to protect lower layers from (the sadly all too common) overflow conditions, but prefer to do so by chopping the requests up, rather than just refusing them outright. Cc: Peter Anvin <hpa> Cc: Ulrich Drepper <drepper> Cc: Andi Kleen <ak> Cc: Al Viro <viro.org.uk> Signed-off-by: Linus Torvalds <torvalds> I wonder what errors were present in lower layers, and whether they would be re-exposed by changing this. (commit is 2.6.15-era) Oh right, which is just what Edward posted, sorry! Just looking at the history of this INT_MAX limit ... 2832e9366a1fcd6f76957a42157be041240f994e took out a check for count > f_maxcount and replaced with INT_MAX since f_maxcount was hardcoded to that anyway. So that's where the INT_MAX came from. f_maxcount was added long ago (2005, bkrev 41f71cbbbAqnp67z79i7SSVQGtmQzg): Add 'f_maxcount' to allow filesystems to set a per-file maximum IO size. but it was hardcoded to INT_MAX and never changed until it was removed. See: http://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg00617.html and http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg92056.html So I wonder what this old INT_MAX (used to be f_maxcount) was designed to protect? Edward, I agree that it'd change that old upstream commit to allow larger count, but proposing it upstream may be the right direction. It may just be an old limit that needs to be revisited. Or if there's a valid reason to limit it here, someone will speak up :) (I assume we have the same behavior in RHEL6 anyway? And if so we shouldn't fix it in RHEL5 without also fixing RHEL6) One more note - this same behavior exists upstream today, so "fixing" it here would require rhel6 and upstream fixes. (In reply to comment #21) > Hm, just as a note, the limit went in with this commit: > > commit e28cc71572da38a5a12c1cfe4d7032017adccf69 > Author: Linus Torvalds <torvalds.org> > Date: Wed Jan 4 16:20:40 2006 -0800 > > Relax the rw_verify_area() error checking. > > In particular, allow over-large read- or write-requests to be downgraded > to a more reasonable range, rather than considering them outright errors. > > We want to protect lower layers from (the sadly all too common) overflow > conditions, but prefer to do so by chopping the requests up, rather than > just refusing them outright. > > Cc: Peter Anvin <hpa> > Cc: Ulrich Drepper <drepper> > Cc: Andi Kleen <ak> > Cc: Al Viro <viro.org.uk> > Signed-off-by: Linus Torvalds <torvalds> > > I wonder what errors were present in lower layers, and whether they would be > re-exposed by changing this. (commit is 2.6.15-era) I guess this is the direct_io subsystem who exposes INT_MAX limitation. No? x86_64: strace dd if=/dev/zero of=/dev/null bs=2G count=1: read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147483648) = 2147479552 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147479552) = 2147479552 Chris, what info are you looking for, and from who? I see that you set a needinfo flag on this bug. Thanks, -Eric Edward and I spoke with the customer on the phone today, and it sounds like they will be able to modify their IO library to accept these very large IO requests, and issue smaller direct-IO chunks in a loop to fulfill that request. We stressed that if they had any problems with this approach, that they are welcome to let us know, and we can help them make it work. We also pointed out that the read() and write() calls do return the actual number of bytes written, so the calling application can in fact know whether or not the entire request was processed, and proceed accordingly. Thanks, -Eric This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug. Given the customer course of action in comment #54 and the strong rejection of this change upstream, I think we will not plan to make changes here. |
Created attachment 430558 [details] Test program Description of problem: Here is the implementation of write() in the 2.6.18 kernel. Notice that it does receive count as a size_t, and that it calls vfs_write(). asmlinkage ssize_t sys_write(unsigned int fd, const char __user * buf, size_t count) { struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_write(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } return ret; } In vfs_write we call rw_verify_area(), the return is stored in an ssize_t. ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) { ssize_t ret; if (!(file->f_mode & FMODE_WRITE)) return -EBADF; if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, buf, count))) return -EFAULT; ret = rw_verify_area(WRITE, file, pos, count); if (ret >= 0) { count = ret; Finally the rw_verify_data() function. We notice here that there is a comment describing the truncation of the count... /* * rw_verify_area doesn't like huge counts. We limit * them to something that fits in "int" so that others * won't have to do range checks all the time. */ #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK) int rw_verify_area(int read_write, struct file *file, loff_t *ppos, size_t count) { struct inode *inode; loff_t pos; if (unlikely((ssize_t) count < 0)) goto Einval; pos = *ppos; if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) goto Einval; inode = file->f_dentry->d_inode; if (unlikely(inode->i_flock && MANDATORY_LOCK(inode))) { int retval = locks_mandatory_area( read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE, inode, file, pos, count); if (retval < 0) return retval; } return count > MAX_RW_COUNT ? MAX_RW_COUNT : count; Einval: return -EINVAL; } Here we can clearly see "return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;" which returns MAX_RW_COUNT for any count value over MAX_RW_COUNT. Thus we have our limit. Version-Release number of selected component (if applicable): Observed with kernel-2.6.18-204. How reproducible: Always Steps to Reproduce: Run the attached test program. It creates a file and tries to write a buffer of length INT_MAX (2147483647) bytes but the resultant file only has 2147479552 bytes. The same problem can be seen in the read() system call with this dd strace: # strace dd if=/dev/zero of=/dev/null bs=2147483647 count=1 ... read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147483647) = 2147479552 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2147479552) = 2147479552 ...