Bug 151219

Summary:	writes using O_SYNC on ext3 are not POSIX compliant
Product:	Red Hat Enterprise Linux 4	Reporter:	craig harmer <craig>
Component:	kernel	Assignee:	Stephen Tweedie <sct>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.0	CC:	davej, linux26port, riel
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-03-16 15:01:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description craig harmer 2005-03-16 02:51:00 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8b)
Gecko/20050217

Description of problem:
i was comparing the performance of VxFS (the veritas file system) and
ext3 when writing to files with the O_SYNC flag set.  ext3 was faster
than i would have expected if it was performing the writes in a manner
compliant with the persistentcy guarantees of the O_SYNC flag.  i used
VxVM (the veritas volume manager) to trace the i/o's performed by
ext3 and confirmed that it failed to perform the synchronous time
stamp updates required by the flag.  in fact, it was performing the
writes as if the O_DSYNC flag was set.

POSIX defines this behavior and the OpenGroup elaborates on it a bit.
here's the relevant verbiage from the open() specification:

http://www.opengroup.org/onlinepubs/007908799/xsh/open.html

O_SYNC
    Write I/O operations on the file descriptor complete as defined by
synchronised I/O file integrity completion.

O_DSYNC
    Write I/O operations on the file descriptor complete as defined by
synchronised I/O data integrity completion

the terms "synchroniszed I/O file integrity completion" and "... data
integrity completion" are defined in the glossary:

http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_004_000_291

basically, if O_DSYNC is specified then the file data has to be
written syncrhronously to disk as well as any metadata required to
access the data.  if O_SYNC is specified then there's an additional
requirement that time stamps modified by the write be written
synchronously to disk as well.

ext3 is treating O_SYNC writes like O_DSYNC; it doesn't write the
inode synchronously to disk (or to the log) if the only change to the
metadata is a time stamp update.

most other Linux file systems probably have the same problem.

the open(2) man page documents this (sort of):

POSIX provides for three different variants of synchronised I/O, 
cor-responding  to  the  flags  O_SYNC,  O_DSYNC  and  O_RSYNC. 
Currently (2.1.130) these are all synonymous under Linux.

what it doesn't say is that you only get O_DSYNC.

most applications that specify the O_SYNC flag probably don't care
about synchronous time stamp updates.  they simply haven't been
converted to use the newer O_DSYNC flag.  nevertheless, there may be
some applications that do depend on the proper implemention of O_SYNC
semantics as defined by.  and Linux shouldn't claim to conform to the
SVr4, SVID, POSIX, X/OPEN specifications if it doesn't implement this
properly.

Note that vxfs on Linux implements O_SYNC correctly, but has a special
mount option, "-o convosync=dsync" for use with older applications
that specify the O_SYNC flag but care only about persistent writes of
data.

Version-Release number of selected component (if applicable):
kernel-smp-2.6.5-7.97

How reproducible:
Always

Steps to Reproduce:
1. write a C program that opens a file using the O_SYNC flag
2. perform writes to the file that don't change the file size or cause
the allocation of blocks (writes to an existing file)
    

Actual Results:  i/o traces of ext3 showed that it performed
synchronous metadata updates only when the writes changed the size of
the file, and not the only change was to file data.

Expected Results:  the inode or journal should have been written
synchronously as well to make the time stamp changes persistent.

Additional info:

mark hement investigated the Linux kernel code and came to the
conclusion that this bug exists in both the 2.4 and 2.6 kernel.  here
are his comments:

I've check the 2.6 kernel; the behaviour of O_SYNC appears the same as
2.4.  In fact, on 2.6, this is the designed behaviour - at least for
file systems which use many of the generic VMM and VOP functions.  As
VxFS avoids the functions involved, it avoids O_DSYNC for O_SYNC.

On 2.6, for a write, ext3 calls into generic_file_aio_write_nolock()
(which, despite the name, is called for non-AIO writes as well).  This
calls into inode_update_time() to update {cm}time, which marks an
inode as I_DIRTY_SYNC, and causes a ->dirty_inode() call into ext3,
which prepares a transaction for the {cm}time change.

generic_file_aio_write_nolock() then calls
generic_file_buffered_write(), which calls ->commit_write
(ext3_ordered_commit_write()) to 'push' the write.

ext3_ordered_commit_write() calls generic_commit_write(), this marks
the inode's state as dirty (with I_DIRTY_SYNC | I_DIRTY_DATASYNC |
I_DIRTY_PAGES) *iff* the size has increased.

Back in generic_file_buffered_write(), generic_osync_inode() is called
for an O_SYNC.  This calls generic_osync_inode() with OSYNC_METADATA
flag. The OSYNC_METADATA causes buffers associated with the inode (eg;
indirect blocks) to be flushed, but not the inode itself.  But for a
size increase, the inode's state has the I_DIRTY_DATASYNC bit set. 
This causes a call to write_inode_now(), which (after much jumping
around) calls ext3's ->write_inode (ext3_write_inode()).  Finally,
this commits the queued transactions (ext3_force_commit()).

So, on 2.6, to get the correct O_SYNC behaviour for {cm}time updates,
inode_update_time() would need enhancing;
        o Pass in the file structure (for O_SYNC testing).
        o Add code to test for O_SYNC, and sync mount option.
        o Set the I_DIRTY_DATASYNC if O_SYNC required.  Code would
          look something like;
                if (sync_it) {
                        if (IS_SYNC(inode) || (file->f_flags & O_SYNC)) {
                                /* will force a later write_inode()) */
                                mark_inode_dirty(inode);
                        } else {
                                /* only sets I_DIRTY_SYNC */
                                mark_inode_dirty_sync(inode);
                        }
                }

Not sure of the code changes for 2.4, but they'd be something similar
(around the mtime updates in mm/filemap.c).  I haven't check reiserfs,
 or other code paths which should be affected by O_SYNC.

Comment 2 Stephen Tweedie 2005-03-16 15:01:37 UTC

Unfortunately, this change would cause a major performance regression for all
O_SYNC users; and right now Linux simply does not have an O_DSYNC option.  glibc
defines it:

/* These are lesser flavors of partial synchronization that are
   implied by our one flag (O_FSYNC).  */
#if defined __USE_POSIX199309 || defined __USE_UNIX98
# define O_DSYNC        O_SYNC  /* Synchronize data.  */
# define O_RSYNC        O_SYNC  /* Synchronize read operations.  */
#endif
                                                                                
So if this were fixed according to strict POSIX synchronised IO guarantees for
timestamps, all O_*SYNC users would see the performance penalty of timestamp
updates --- even users who are requesting O_DSYNC and have no requirement for
synchronised timestamp updates.

Introducing such a regression in a RHEL4 update is not really an option; neither
is changing the ABI within a release to make O_SYNC and O_DSYNC distinct.  As
such, this is really an upstream issue; if glibc and the kernel have proper
O_DSYNC functionality by the time RHEL5 freezes, then inheriting that work would
be possible then.