Description of Problem: The routine fsync_inode_buffers() should write out the dirty buffers for the specified inode, but there exists a scenario where it can return having failed to do so: bdflush() runs and gets its hands on a dirty buffer. It locks, cleans, and submits it. Before the I/O completes, commit_write() dirties it again. Now fsync_inode_buffers() runs. It sees the buffer as dirty, so it calls ll_rw_block() on it, but its try lock fails, so it does nothing. fsync_inode_buffers() later waits on the buffer. Then the I/O finally completes, and fsync_inode_buffers() returns. That I/O was on a previous version of the block -- the latest data have not necessarily reached the disk yet. Version-Release number of selected component (if applicable): 2.4.9-e.3, e.5 and e.8 How Reproducible: 100% Steps to Reproduce: 1. This is a race condition. We encountered it in work on a shared/clustered filesystem. Corruption occurred and the bug was found through inspection. 2. 3. Actual Results: Occasionally, not all buffers will be flushed when fsync_inode_buffers returns. In a clustered filesystem, this can cause data corruption. Expected Results: All dirty buffers should have been written at least in the form they were at the time of the call on return from fsync_inode_buffers. Additional Information: This causes data corruption for a shared/clustered filesytem which needs to use clusterwide inode locks and expects fsync_inode_buffers() reliably write all the dirty buffers.
FYI, the patch for this bug made it into 2.4.19. Regards, Tim