Red Hat Bugzilla – Bug 458880
GFS: O_DIRECT writes fail when mixed with mmap reads
Last modified: 2011-07-25 09:18:05 EDT
Description of problem:
coherency is a new test we've been using on GFS2 to verify cluster coherency between different kinds of system calls with different types of I/O. Upon running these on GFS I found that the following cases are failing when run on a 1k block size file system.
Each one fails on the write system call with "Input/output error."
The I/O generation starts with empty files and writes up to 128k at a time.
d_iogen -I 23617043 -i 120s -f direct -s write -v mmread -p none -T 128k -F 10g:direct-write-mmread
Version-Release number of selected component (if applicable):
kernel-2.6.18-92.el5 (5.2) and kernel-2.6.18-103.el5 (5.3)
Steps to Reproduce:
1. mkfs -t gfs -O -b 1024 -j 4 -p lock_dlm -t tank-cluster:brawl0 /dev/brawl/brawl0
2. mount -t gfs -o debug /dev/brawl/brawl0 /mnt/braw
3. coherency -m /mnt/brawl -S REG
direct-*-mm* coherency tags will fail almost immediately and may hang while d_doio tries to connect to d_iogen which already exited.
No SCSI errors were detected and the file system is still usable.
Writes should not fail with input/output errors.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
I've verified that I can reproduce this with the -92.1.10.el5 kernel. It still requires the 1k block size. Here is the command I'm using to get all of the failures from all of the coherency log files.
[nstraz@try 4.coherency]$ grep -rh "^Can" .
Can not pwrite() 47104 bytes to 336896 on direct-pwrite-mmindirect: Input/output error
Can not write() 36864 bytes to 580608 on direct-write-mmindirect: Input/output error
Can not writev() 121856 bytes to 1336832 on direct-writev-mmread: Input/output error
Can not writev() 43520 bytes to 1308160 on direct-writev-mmindirect: Input/output error
Can not write() 27136 bytes to 320512 on direct-write-mmread: Input/output error
Can not pwrite() 40448 bytes to 305152 on direct-pwrite-mmread: Input/output error
The file name corresponds to how the file was openned, which syscall was used to write, and which syscall was used to verify the write (i.e. read). mmindirect is an mmap read which doesn't use a userspace buffer.
Since this isn't a regression and the developers have been consumed on other problems, am defering this to rhel 5.4 consideration.
We don't have a fix yet; retargeting to 5.6.
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.
I'm very tempted to suggest that we shouldn't fix this for GFS. There would be a fair amount of work involved and I can't see any use case which is ever likely to want to run both mmap and direct I/O to the same file at the same time. It doesn't make any sense.
So I'm going to suggest that we document that it will not work and then move on. Please let me know if there are any objections to this.
Steven, we'd like to add something along the following lines to the docs for GFS (note, does not apply to GFS2):
Performing I/O through a memory mapping and also via direct I/O to the same file at the same time may result in the direct I/O being failed with an I/O error. This occurs because the page invalidation required for the direct I/O can race with a page fault generated through the mapping. This is only a problem when the memory mapped I/O and the direct I/O are both performed on the same node as each other, and to the same file at the same point in time. A workaround is to use file locking to ensure that memory mapped (i.e. page faults) and direct I/O do not occur simultaneously on the same file.
The Oracle database, which is one of the main direct I/O using applications does not memory map the files to which it uses direct I/O and thus is unaffected. In addition, writing to a file that is memory mapped will succeed, as expected, unless there are page faults in flight at that point in time. The mmap system call on its own is safe when direct I/O is in use.
I have added the information in Comment 13 as a note to the current RHEL 5.7 draft of the GFS manual, in the section on direct IO. It can be seen here:
Since we're addressing this through documentation (and since I've already updated the draft documentation), I'm changing the component to reflect that.
The new note is visible on the link provided in Comment 14 so I am moving this to ON_QA.