Bug 89869

Summary: race condition in iput()
Product: [Retired] Red Hat Linux Reporter: Ben Woodard <woodard>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-04-29 08:47:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch to fix the problem none

Description Ben Woodard 2003-04-29 02:56:21 UTC
Description of problem:

Subject: 2.4.20 patch to fix race condition in iput()
Date: Mon, 28 Apr 2003 10:48:39 -0700
From: Dave Peterson <dsp>
To: linux-kernel.org
Cc: viro.theplanet.co.uk, dsp
Version-Release number of selected component (if applicable):

I found a race condition in the 2.4.20 kernel that involves
iput() and the syncing of inodes to disk when their reference
counts drop to 0.  The problem involves the temporary releasing
of inode_lock while the inode is being written to disk.  Once
the write is finished, inode_lock is again acquired with the
assumption that no other task has removed the inode from the
inode_unused list and started to destroy it. However, this
assumption may be violated if another task is executing
kill_super() while the write is in progress.  
To fix the problem, I have appended to this message a 2.4.20 kernel
patch for fs/inode.c.  It alters the behavior of iput() as follows:

    Move the "if (!sb || (sb->s_flags & MS_ACTIVE))" so that it is
    evaluated while still holding inode_lock.  Then add the inode to
    the inode_unused list (provided that that the inode is neither
    locked nor dirty), release inode_lock, and return only if the
    test condition evaluates to true.  Otherwise do not add the
    inode to the inode_unused list.  Instead, remove the inode from
    its hash list so no other tasks can obtain references to it,
    call __iget() to increment the reference count back to 1 and
    add the inode to inode_in_use, and then release inode_lock and
    write the inode to disk.  Since we have a reference to the inode,
    no other task can attempt to destroy it while the write is in
    progress.  Once the write finished, reacquire inode_lock, set the
    inode's reference count to 0, remove the inode from the
    inode_in_use list, and destroy the inode.

In addition, the patch fixes a minor problem in which the bookkeeping
for inodes_stat.nr_unused was being done incorrectly.  It also
cleans up the function sync_one() which is shown here in its original
form:

    static inline void sync_one(struct inode *inode, int sync)
    {
            while (inode->i_state & I_LOCK) {
                    __iget(inode);
                    spin_unlock(&inode_lock);
                    __wait_on_inode(inode);
                    iput(inode);
                    spin_lock(&inode_lock);
            }

            __sync_one(inode, sync);
    }

The logic of this function is inherently flawed in the way that it
manipulates the reference count on the inode.  Presumably, __iget() is
called to prevent the inode from being destroyed after inode_lock is
released.  After __wait_on_inode() returns, iput() is called to release
the reference that was just acquired.  Then, inode_lock is acquired
again.  This is problematic because the call to iput() could cause the
reference count on the inode to drop to 0, resulting in a call to
__sync_one() on an inode that has been destroyed.  The assumption
should be that prior to calling sync_one(), the caller has obtained its
own reference to the inode.  After examining all callers of sync_one()
and verifying that this assumption does in fact hold, I have removed
the calls to __iget() and iput() from sync_one().



How reproducible:
Fairly reproducable under the right situations

Steps to Reproduce:
Here is a scenario
that demonstrates the problem:

    1.  Task A is inside kill_super().  It clears the MS_ACTIVE
        flag of the s_flags field of the super_block struct and
        calls invalidate_inodes().  In invalidate_inodes(), it
        attempts to acquire inode_lock and spins because task B,
        executing inside iput(), just decremented the reference
        count of some inode i to 0 and acquired inode_lock.

    2.  Then the "if (!inode->i_nlink)" test condition evaluates
        to false for task B so it executes the following chunk
        of code:

        01 } else {
        02         if (!list_empty(&inode->i_hash)) {
        03                 if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
        04                         list_del(&inode->i_list);
        05                         list_add(&inode->i_list, &inode_unused);
        06                 }
        07                 inodes_stat.nr_unused++;
        08                 spin_unlock(&inode_lock);
        09                 if (!sb || (sb->s_flags & MS_ACTIVE))
        10                         return;
        11                 write_inode_now(inode, 1);
        12                 spin_lock(&inode_lock);
        13                 inodes_stat.nr_unused--;
        14                 list_del_init(&inode->i_hash);
        15         }
        16         list_del_init(&inode->i_list);
        17         inode->i_state|=I_FREEING;
        18         inodes_stat.nr_inodes--;
        19         spin_unlock(&inode_lock);
        20         if (inode->i_data.nrpages)
        21                 truncate_inode_pages(&inode->i_data, 0);
        22         clear_inode(inode);
        23 }
        24 destroy_inode(inode);

        Now the test condition on line 02 evaluates to true, so
        task B adds the inode to the inode_unused list and then
        releases inode_lock on line 08.

    3.  Now A acquires inode_lock and B spins on inode_lock inside
        write_inode_now();

    4.  Task A calls invalidate_list(), transferring inode i from
        the inode_unused list to its own private throw_away list.

    5.  Task A releases inode_lock, allowing B to acquire inode_lock
        and continue executing.

    6.  A attempts to destroy inode i inside dispose_list() while B
        simultaneously attempts to destroy i, potentially causing
        all sorts of bad things to happen.
    
Actual results:


Expected results:


Additional info:

Comment 1 Ben Woodard 2003-04-29 02:57:26 UTC
Created attachment 91377 [details]
patch to fix the problem

Comment 2 Arjan van de Ven 2003-04-29 08:47:30 UTC
as per Al Viro's mail: this is not a bug.