Bug 647824 - general protection fault in due to misaligned pointer access in filp_close()
Summary: general protection fault in due to misaligned pointer access in filp_close()
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: All
high
high
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-10-29 15:27 UTC by Martin Wilck
Modified: 2018-11-14 16:43 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-07 13:44:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
serial log of EL5.6 (2.6.18-238.1.1.el5) (51.80 KB, text/plain)
2011-02-14 15:05 UTC, Martin Wilck
no flags Details
proposed debug-patch #1 (1.32 KB, patch)
2011-02-14 15:26 UTC, Dave Anderson
no flags Details | Diff
currently used test patch (1.05 KB, patch)
2011-02-18 17:47 UTC, Martin Wilck
no flags Details | Diff
[may be patch] unshare_files()->copy_files() must not change tsk->files lockless (1.71 KB, patch)
2011-03-17 14:35 UTC, Oleg Nesterov
no flags Details | Diff

Description Martin Wilck 2010-10-29 15:27:43 UTC
Description of problem:

Pid: 4531, comm: modprobe Not tainted 2.6.18-194.17.1.el5 #1
RIP: 0010:[<ffffffff80023bd8>] [<ffffffff80023bd8>] filp_close+0x2b/0x64
RSP: 0018:ffff81047b93fed8 EFLAGS: 00010206
RAX: 01ffffffff8029c9 RBX: ffff81087e9e2a81 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff81047d7abc80 RDI: ffff81087e9e2a81
RBP: ffff81047d7abc80 R08: ffff81048e1b5000 R09: ffff81010fc85d40
R10: 00002ba042143870 R11: ffffffff800da536 R12: 0000000000000002
R13: ffff81047fc8e240 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002ba041e906e0 CR3: 0000000000201000 CR4: 00000000000006e0
Process modprobe (pid: 4531, threadinfo ffff81047b93e000, task ffff81047f9f5820)
Stack: ffff81047d7abc80 0000000000000001 0000000000000002 ffffffff80038eb5
ffff81047d7abc80 ffff81047f9f5820 0000000000000001 00007fffb6e4fb30
0000000000000000 ffffffff80015886 0000000000000000 0000000000000000
Call Trace:
[<ffffffff80038eb5>] put_files_struct+0x63/0xae
[<ffffffff80015886>] do_exit+0x31c/0x911
[<ffffffff8004911d>] cpuset_exit+0x0/0x88
[<ffffffff8005d116>] system_call+0x7e/0x83
Version-Release number of selected component (if applicable):

How reproducible:
sporadic, in reboot test with Fujitsu PRIMERGY BX960S1 blade server. It was observed only on systems with iSCSI root so far, but I can't imagine the relation between iSCSI root and this problem.

Steps to Reproduce:
1. reboot test on BX960S1
  
Actual results:
See above

Expected results:
No panic

Additional info:

The fault occurs because %rax = 01ffffffff8029c9 is used as a pointer. The values is obviously a valid pointer address shifted by one (pointing to null_fops, the open file here is obviously /dev/null). The %rax value comes from the address pointed to by %rdi (ffff81087e9e2a81) which is obviously misaligned for a pointer. The %rdi value comes from the following code in close_files(), the xchg() call in line 471:

457         /*
458          * It is safe to dereference the fd table without RCU or
459          * ->file_lock because this is the last reference to the
460          * files structure.
461          */
462         fdt = files_fdtable(files);
463         for (;;) {
464                 unsigned long set;
465                 i = j * __NFDBITS;
466                 if (i >= fdt->max_fdset || i >= fdt->max_fds)
467                         break;
468                 set = fdt->open_fds->fds_bits[j++];
469                 while (set) {
470                         if (set & 1) {
471                                 struct file * file = xchg(&fdt->fd[i], NULL);
472                                 if (file)
473                                         filp_close(file, files);
474                         }
475                         i++;
476                         set >>= 1;
477                 }
478         }
479 }

The comment on top makes me speculate if some backported patches in RHEL5 may perhaps have invalidated the assumption that locking isn't necessary any more (note that files_fdtable() implies an rcu_dereference()).

Comment 2 Martin Wilck 2010-10-29 15:51:59 UTC
We can't exclude a HW problem here, but we saw the problem on 4 different
machines, always with the same stack (always modprobe, but the module being
loaded was not always the same).

Comment 4 RHEL Program Management 2011-01-07 04:22:49 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 5 Suzanne Logcher 2011-01-07 16:17:31 UTC
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 11 Larry Troan 2011-01-11 21:57:33 UTC
Martin,

Engineering notes that this problem occurred with a 5.5.z errata. Have you seen it on the RHEL5.5GA bits? Trying to narrow the window.

Comment 12 Martin Wilck 2011-01-12 13:11:18 UTC
I don't have that information at this time (need to ask a colleague who is sick today). However the problem was seen with the earlier errata kernel 2.6.18-194.8.1.el5 as well. It is hard to give a comprehensive answer on a large set of kernels, because reproducing this problem takes a long time (up to a week of reboot tests).

If you have good reasons to think that it would be helpful to retest with a particular kernel, please tell me. It might also be worth to retry with 5.6. Please advise if 5.5GA, 5.6RC, or something else would be most promising - we need to make a good choice.

Comment 13 Dave Anderson 2011-01-18 13:37:25 UTC
Ideally the issue should be bisected to a specific release, but given the
problem with reproducing it, probably the second best solution would be to
attempt to reproduce it with a RHEL 5.6 release -- especially given that any
fix would have to go there first.

Comment 14 Martin Wilck 2011-01-19 12:45:19 UTC
Because the investigation has been inactive on all parts for some time, we are currently trying to reproduce the problem with the unmodified system (still running 2.6.19-194.8.1.el5), instrumented with kdump. Once we captured a dump, we will upgrade to 5.6.

Question: is KDUMP DUMPLEVEL 31 sufficient here, or do we need a more complete vmcore image?

Comment 15 Dave Anderson 2011-01-19 13:51:33 UTC
(In reply to comment #14)
> Because the investigation has been inactive on all parts for some time, we are
> currently trying to reproduce the problem with the unmodified system (still
> running 2.6.19-194.8.1.el5), instrumented with kdump. Once we captured a dump,
> we will upgrade to 5.6.
> 
> Question: is KDUMP DUMPLEVEL 31 sufficient here, or do we need a more complete
> vmcore image?

That's what I would have suggested first, but unfortunately in all of the
provided logs, the crash occurs well before the kdump service is enabled.

Comment 16 Martin Wilck 2011-01-19 14:02:24 UTC
You are right, of course, I missed that :-/

So we will test with 5.6, and if the issue is reproduced, we'll try the debug kernel and try to make progress with kdb.

Comment 17 Dave Anderson 2011-01-19 14:14:08 UTC
kdb?

Comment 18 Martin Wilck 2011-01-19 14:52:29 UTC
I was hoping there was kdb support in the RHEL debug kernel. If it's not, please advise how else we could make progress in the analysis of this problem.

Comment 19 Dave Anderson 2011-01-19 15:02:37 UTC
Other than building kernels with printk's, I don't know.

It would be interesting first to know whether the original fdtable 
(not the RCU copy) contained the bad file structure pointer in its
array, and then work backwards from there.  But given the timing of
getting it to reproduce, it's probably going to be very slow progress
figuring out where things went wrong.

Comment 20 Dave Anderson 2011-01-19 16:01:49 UTC
The modprobe task's "tsk->files" points to a files_struct:

  struct files_struct {
      atomic_t count;
      struct fdtable *fdt; 
      struct fdtable {
          unsigned int max_fds;
          int max_fdset;
          struct file **fd;
          fd_set *close_on_exec;
          fd_set *open_fds;
          struct rcu_head rcu;
          struct files_struct *free_files;
          struct fdtable *next;
      } fdtab;
      spinlock_t file_lock;
      int next_fd;
      struct embedded_fd_set close_on_exec_init;
      struct embedded_fd_set open_fds_init;
      struct file *fd_array[64];
  }

The "fdt" pointer above either points to the embedded "fdtab"
fdtable above, or points to an external fdtable structure.

If "fdt" points to the embedded fdtable, then the "fd" pointer
in that table points to the "fd_array[64]" list of file
structure pointers.  And one of the fd_array[64] pointers is
the corrupted pointer used by close_files(). 

If "fdt" points to an external fdtable structure, then that
external structure's "fd" pointer points yet another external
array of file structure pointers.  And one of those external 
file structure pointers is the corrupted pointer used by 
close_files().

I'm not sure whether "modprobe" uses the embedded or an 
external fdtable.

Complicating matters slightly is that the fdtable used
by close_files() is an RCU copy of the original fdtable.

Complicating matters even further is that perhaps there's
a corruption associated with the "open_fds" fd_set bitmask 
that causes "set" to be incorrectly set, and that causing
the use of an invalid file structure pointer?:

    fdt = files_fdtable(files);
    for (;;) {
            unsigned long set;
            i = j * __NFDBITS;
            if (i >= fdt->max_fdset || i >= fdt->max_fds)
                    break;
            set = fdt->open_fds->fds_bits[j++];
            while (set) {
                    if (set & 1) {
                            struct file * file = xchg(&fdt->fd[i], NULL);
                            if (file)
                                    filp_close(file, files);
                    }
                    i++;
                    set >>= 1;
            }
    }

So anyway, what I was suggesting is to at least verify what type
of fdtable is used (embedded or external), and to dump the 
file structure array if the calling task is named "modprobe"?

What do you think?

Comment 21 Martin Wilck 2011-02-14 15:05:31 UTC
Created attachment 478646 [details]
serial log of EL5.6 (2.6.18-238.1.1.el5)

The issue has been now reproduced on RHEL5.6. The panic attached occured after 82 reboots. 

On the same system, in the same test scenario, we see an alternating error condition where the NMI watchdog causes a panic in cache_alloc_refill(). More material on this can be find in case 00320375. This was original problem observed in this test scenario - for quite some time it hadn't been reproduced, and the filp_close() problem had been observed instead.

Comment 22 Dave Anderson 2011-02-14 15:26:33 UTC
Created attachment 478649 [details]
proposed debug-patch #1

OK, so at least we can rule out any Z-stream issues -- that's a start.

But we're still rummaging around in the dark.  I had put together the attached
patch, which will at least narrow it down to whether we should be looking
at embedded tables or not.  It also would display the file struct address
being passed to filp_close().

I wonder whether it's worth cycling through and dumping all of the file
struct pointer prior to passing them one-at-a-time to filp_close()?

Comment 23 Martin Wilck 2011-02-14 16:10:39 UTC
Thanks for the patch. I'll try it ASAP. 

I am just a little worried if it might make the problem go away by changing the timing. I had considered to check the "file" pointer after the xchg() call and if it's misaligned (((long) file & 3) != 0), dump the files_struct and fd table. However let's start with your approach.

Comment 24 Dave Anderson 2011-02-14 16:19:49 UTC
(In reply to comment #23)
> Thanks for the patch. I'll try it ASAP. 

Don't do anything yet -- I'm building a test kernel now, I'll test it, and
if it's reasonable, I'll post it on my people.redhat.com page for you to test.
 
> I am just a little worried if it might make the problem go away by changing the
> timing. I had considered to check the "file" pointer after the xchg() call and
> if it's misaligned (((long) file & 3) != 0), dump the files_struct and fd
> table. However let's start with your approach.

That may be very true -- but it's just as likely that the corruption occurred
long before it even made it into the exit() path.  My second patch will at
least narrow that question down.

Comment 25 Dave Anderson 2011-02-14 17:03:38 UTC
Ok, the files are here: http://people.redhat.com/anderson/bz647824

  kernel-2.6.18-243.el5.bz647824.2.src.rpm
  kernel-2.6.18-243.el5.bz647824.2.x86_64.rpm
  kernel-debuginfo-2.6.18-243.el5.bz647824.2.x86_64.rpm
  kernel-debuginfo-common-2.6.18-243.el5.bz647824.2.x86_64.rpm
  linux-kernel-test.patch

The linux-kernel-test.patch is different from the one that I posted
earlier -- this one dumps all open files prior to entering the loop
where each one is issued to filp_close().

FWIW, there are modprobe instances using both "embedded" and 
external data structure usage, for examples:

[733] files: ffff8100338be0c0 files->fdt: ffff810033577cc0 &files->fdtab: ffff8100338be0d0 
[733] fdt->fd: ffff810033718000 &files->fd_array: ffff8100338be158 
[733] fdt->open_fds: ffff81003386b5c0 &files->open_fds_init: ffff8100338be150 
[733] fd[0]: ffff8100360d1180
[733] fd[1]: ffff8100360d1180
[733] fd[2]: ffff8100360d1180
[733] fd[255]: ffff8100369ab6c0

[1500] files: ffff81003457ac80 files->fdt: ffff81003457ac90 &files->fdtab: ffff81003457ac90 (embedded)
[1500] fdt->fd: ffff81003457ad18 &files->fd_array: ffff81003457ad18 (embedded)
[1500] fdt->open_fds: ffff81003457ad10 &files->open_fds_init: ffff81003457ad10 (embedded)
[1500] fd[0]: ffff8100360d37c0
[1500] fd[1]: ffff8100360d37c0
[1500] fd[2]: ffff8100360d37c0

Anyway, if the corruption has already occurred by the time the exit path
is reached, we should see it in the open files list.

Comment 26 Martin Wilck 2011-02-15 10:44:28 UTC
Reboot test is running with patched kernel now.

Comment 27 Martin Wilck 2011-02-16 10:53:07 UTC
I have good and bad news. The good news is that the error occured with the patched kernel after ~80 reboots. The bad news is that in the error case, the debug output is invisible because the console loglevel was too low. Now I changed the config as follows:

/etc/sysconfig/init: LOGLEVEL=8
/etc/sysconfig/syslog: KLOGD_OPTIONS="-x -c 8"

I am seeing much more "fdt" messages now. Restarted the reboot test.

Comment 28 Martin Wilck 2011-02-16 14:21:20 UTC
The test is still running. Let me make a few remarks.

If I read the code correctly, the default for newly created processes is to use the embedded fdtable. Only if the max fd number goes to 64 (actually, BITS_PER_LONG) or higher, expand_fdtable() is called to allocate a larger table. When processes are forked, they inherit the file descriptor table, so a child forked from a parent with external fdtable will also have an external fdtable.

Besides files->fdt, the individual elements fdt->fd[i] are normally also protected by RCU mechanisms. It strikes me that none of these references appear to be protected by rcu_read_lock() / rcu_read_unlock(), I thought that this is mandatory (http://lwn.net/Articles/262464/). It seems the locking is done through files_lock instead, *except in close_files()*. That makes me think (again) that taking files_lock might already solve our problem.

The statistics of my current run show that only 264 of 14223 modprobe instances (<2%) run with the embedded fdtable. It seems weird that modprobe uses up so many fd's all the time - I regard it as more likely that these large fdtables are inherited from parent processes. Actually, I find it awkward that at the early stage during boot there are any processes using fd's > 63 (assuming that only large deamons with many sockets can really use so many). I suspect that the culprit here is bash, which has the habit to always allocate fd 255 (http://lkml.indiana.edu/hypermail/linux/kernel/9803.3/0173.html, http://lkml.indiana.edu/hypermail/linux/kernel/9910.3/0377.html). (Side note: this seems to me like a serious waste of resources on basically every Linux system out there, simply caused by bash's convention).

Here are a few more observations: When we saw the error, the file descriptor causing the problem was always 2 (stderr), and it always pointed to /dev/null (the file->fop address in RAX can be identified as null_fops). The pointer was always off by exactly 1, as if the memory address had been incremented (it certainly doesn't look like a random overwrite). I am not sure what we can infer from that, but to me it looks like a race condition with fdt and/or fdt->fd[i].

Comment 29 Dave Anderson 2011-02-16 14:54:15 UTC
> Besides files->fdt, the individual elements fdt->fd[i] are normally also
> protected by RCU mechanisms. It strikes me that none of these references
> appear to be protected by rcu_read_lock() / rcu_read_unlock(), I thought
> that this is > mandatory (http://lwn.net/Articles/262464/). It seems the
> locking is done > through files_lock instead, *except in close_files()*.
> That makes me think (again) that taking files_lock might already solve
> our problem.

Right, I saw commit d11c563dd20ff35da5652c3e1c989d9e10e1d6d0,
but it sounds like the changes to close_files() and put_files_struct()
applied only for quieting an RCU-lockdep message rather than protecting
anything:

  commit d11c563dd20ff35da5652c3e1c989d9e10e1d6d0
  Author: Paul E. McKenney <paulmck.ibm.com>
  Date:   Mon Feb 22 17:04:50 2010 -0800

    sched: Use lockdep-based checking on rcu_dereference()
    
    Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.
    
    Signed-off-by: Paul E. McKenney <paulmck.ibm.com>
    Cc: laijs.com
    Cc: dipankar.com
    Cc: mathieu.desnoyers
    Cc: josh
    Cc: dvhltc.com
    Cc: niv.com
    Cc: peterz
    Cc: rostedt
    Cc: Valdis.Kletnieks
    Cc: dhowells
    LKML-Reference: <1266887105-1528-6-git-send-email-paulmck.ibm.com>
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar <mingo>

diff --git a/kernel/exit.c b/kernel/exit.c
index 546774a..45ed043 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -85,7 +85,9 @@ static void __exit_signal(struct task_struct *tsk)
 	BUG_ON(!sig);
 	BUG_ON(!atomic_read(&sig->count));
 
-	sighand = rcu_dereference(tsk->sighand);
+	sighand = rcu_dereference_check(tsk->sighand,
+					rcu_read_lock_held() ||
+					lockdep_is_held(&tasklist_lock));
 	spin_lock(&sighand->siglock);
 
 	posix_cpu_timers_exit(tsk);
@@ -170,8 +172,10 @@ void release_task(struct task_struct * p)
 repeat:
 	tracehook_prepare_release_task(p);
 	/* don't need to get the RCU readlock here - the process is dead and
-	 * can't be modifying its own credentials */
+	 * can't be modifying its own credentials. But shut RCU-lockdep up */
+	rcu_read_lock();
 	atomic_dec(&__task_cred(p)->user->processes);
+	rcu_read_unlock();
 
 	proc_flush_task(p);
 
@@ -473,9 +477,11 @@ static void close_files(struct files_struct * files)
 	/*
 	 * It is safe to dereference the fd table without RCU or
 	 * ->file_lock because this is the last reference to the
-	 * files structure.
+	 * files structure.  But use RCU to shut RCU-lockdep up.
 	 */
+	rcu_read_lock();
 	fdt = files_fdtable(files);
+	rcu_read_unlock();
 	for (;;) {
 		unsigned long set;
 		i = j * __NFDBITS;
@@ -521,10 +527,12 @@ void put_files_struct(struct files_struct *files)
 		 * at the end of the RCU grace period. Otherwise,
 		 * you can free files immediately.
 		 */
+		rcu_read_lock();
 		fdt = files_fdtable(files);
 		if (fdt != &files->fdtab)
 			kmem_cache_free(files_cachep, files);
 		free_fdtable(fdt);
+		rcu_read_unlock();
 	}
 }
 
But it certainly would be worth testing with those rcu_read_lock's
in place.

Dave

Comment 30 Oleg Nesterov 2011-02-16 15:49:06 UTC
(In reply to comment #29)
>
> but it sounds like the changes to close_files() and put_files_struct()
> applied only for quieting an RCU-lockdep message rather than protecting
> anything:

Agreed. And given that rhel5 disable CONFIG_PREEMPT, rcu_read_lock() is nop.

Comment 31 Martin Wilck 2011-02-16 17:11:34 UTC
(In reply to comment #30)
> Agreed. And given that rhel5 disable CONFIG_PREEMPT, rcu_read_lock() is nop.

Yes, but the sources should also work with CONFIG_PREEMT enabled.

Anyway, my idea was to try spin_lock(&files->file_lock) in close_files(), and see if it solves our problem. However, let's wait for the current test run first.

Comment 32 Martin Wilck 2011-02-17 08:16:46 UTC
The system has survived 180 reboots with Dave's logging patch and the config changes described in comment #27. That makes sense because timing changes during the critical code path are now (with slow serial console active) much stronger than before.

Comment 33 Martin Wilck 2011-02-18 17:42:55 UTC
After more than 400 reboots with Dave's patch, I decided it makes no sense to continue. The added logging makes the problem go away.

Comment 34 Martin Wilck 2011-02-18 17:47:59 UTC
Created attachment 479580 [details]
currently used test patch

I am now using this patch. It takes the file_lock before touching the files data structures. When a misaligned pointer is detected, it prints some information and calls BUG().

I will be out of office until March 7. Until then we should know if taking the file_lock solves this problem.

Comment 35 Martin Wilck 2011-03-15 12:13:05 UTC
With the kernel with attachment #479580 [details], the has completed 1433 reboots without hitting this problem. That suggests that the patch actually helps. Please review it.

Comment 36 Dave Anderson 2011-03-15 15:04:08 UTC
It may help -- but do you agree that it's masking rather than fixing 
the problem?

Another entity is tinkering with the data after it should have no
further references, and that would seem to be the issue at hand.

I also wonder whether this can be reproduced with a RHEL6 or
upstream kernel?

Oleg, what are your thoughts on the matter?

Comment 37 Martin Wilck 2011-03-15 15:21:20 UTC
(In reply to comment #36)
> It may help -- but do you agree that it's masking rather than fixing 
> the problem?

Not quite. IMO it is the right thing to do this under lock protection, see below. 

My main cause is to get rid of this panic. Finding a working patch has taken 3 months - finding the real cause of the problem (if you're right and there is one) might easily take much longer. I don't have that much time.

> Another entity is tinkering with the data after it should have no
> further references, and that would seem to be the issue at hand.

You are referring to the comment "it is safe to dereference...", right? Why are you certain that comment is correct? All operations on the fd sets are done using RCU operations, I fail to see why there can't be any asynchronous stuff still running while the process is exiting. Perhaps the comment used to be correct some time in the past, but some changes are making it fail now? Maybe those changes are specific to the RHEL5 kernel? I've got no idea. Even if the comment is correct, I don't understand why it would be harmful to take the lock.

> I also wonder whether this can be reproduced with a RHEL6 or
> upstream kernel?

We haven't observed it there yet. That doesn't mean it can't happen. It is sporadic on RHEL5, so some condition that makes it less likely might already cause it not to be observed in typical reboot test scenarios (we usually stop after 1000 successful reboots).

> Oleg, what are your thoughts on the matter?

Comment 38 Dave Anderson 2011-03-15 15:55:27 UTC
> You are referring to the comment "it is safe to dereference...", right?
> Why are you certain that comment is correct?

I'm not certain -- other than the fact that it still says that, and has
no such lock protection, upstream.
 
> All operations on the fd sets are done
> using RCU operations, I fail to see why there can't be any asynchronous stuff
> still running while the process is exiting. Perhaps the comment used to be
> correct some time in the past, but some changes are making it fail now? Maybe
> those changes are specific to the RHEL5 kernel? I've got no idea. 

Maybe so -- that's also why I'm deferring this to Oleg.  

> Even if the comment is correct, I don't understand why it would be 
> harmful to take the lock.

I'm not sure either -- but I'd worry that since the locking seems to have
"fixed" the problem, that after the file_lock is taken in close_files(), 
file-related data may get free'd by __fput(), and so when the lock is
subsequently released, what's to prevent the "other" entity from
immediately touching stuff that's been freed?

It's just that your patch is just a hard-sell at this point -- and not
my call -- for getting it ACK'd internally.  And since Oleg is the expert
in this area, that's why I'm soliciting his opinion.

Comment 39 Oleg Nesterov 2011-03-15 16:04:50 UTC
(In reply to comment #36)
>
> Another entity is tinkering with the data after it should have no
> further references, and that would seem to be the issue at hand.
>
> I also wonder whether this can be reproduced with a RHEL6 or
> upstream kernel?

Yes, agreed.
 
> Oleg, what are your thoughts on the matter?

Certainly I am not the right person to ask ;) I am not expert.
And I can't understand why this patch makes the difference.

I do not even understand why close_files() does xchg(NULL).

Comment 40 Oleg Nesterov 2011-03-15 16:40:22 UTC
(In reply to comment #38)
> And since Oleg is the expert
> in this area, that's why I'm soliciting his opinion.

No, I am not. I'll try to grep more... Perhaps it makes sense
to look into call_usermodehelper() paths, but I doubt I'll find
something interesring.

But there is something I can't understand rom the very beginning.
OK, it seems that the kernel crashes trying to call filp->f_op->flush().
The address is RAX = 01ffffffff8029c9.

From #1:

   %rax = 01ffffffff8029c9 is used as a pointer.
   The values is obviously a valid pointer address shifted by one
   (pointing to null_fops)

could you explain what "shifted by one" actually means? IOW, what
is the numeric value of null_fops?

   The %rax value comes from the address pointed to by %rdi
   (ffff81087e9e2a81)

So, rdi contains the address of ->f_op?

Hmm. I do not have the full asm,
but https://bugzilla.redhat.com/attachment.cgi?id=478646
shows Code: 48 8b 40 68 48 85 c0 74 07 ff d0 41 89 c4 eb 03 45 31 e4 48

This is:

   0:	48 8b 40 68          	mov    0x68(%rax),%rax
   4:	48 85 c0             	test   %rax,%rax
   7:	74 07                	je     0x10
   9:	ff d0                	callq  *%rax
   b:	41 89 c4             	mov    %eax,%r12d
   e:	eb 03                	jmp    0x13
  10:	45 31 e4             	xor    %r12d,%r12d
  13:	48                   	rex.W

but 0x68(%rax) means ->release, not ->flush. Confused.

Comment 41 Oleg Nesterov 2011-03-15 17:04:13 UTC
(In reply to comment #40)
> 
> No, I am not. I'll try to grep more...

Yes, I know very little about this fdt code. And probably I am wrong.
But so far it seems to me that close_file() has all rights to do
BUG_ON(spin_is_locked(files->file_lock)). Who else can play (and change!)
the struct we are going to free.

Strange...

Comment 42 Oleg Nesterov 2011-03-15 17:14:00 UTC
(In reply to comment #39)
> 
> I do not even understand why close_files() does xchg(NULL).

sorry for noise... at least now I see why we need  xchg ;)

Comment 43 Dave Anderson 2011-03-15 17:44:53 UTC
(In reply to comment #42)
> (In reply to comment #39)
> > 
> > I do not even understand why close_files() does xchg(NULL).
> 
> sorry for noise... at least now I see why we need  xchg ;)

Well, I know far less about this code than you do -- so why
does it need to use xchg()?

> but 0x68(%rax) means ->release, not ->flush. Confused.

Actually 0x68 is the offset of the file_operations->flush operation:

crash> file_operations -o
struct file_operations {
   [0x0] struct module *owner;
   [0x8] loff_t (*llseek)(struct file *, loff_t, int);
  [0x10] ssize_t (*read)(struct file *, char *, size_t, loff_t *);
  [0x18] ssize_t (*aio_read)(struct kiocb *, char *, size_t, loff_t);
  [0x20] ssize_t (*write)(struct file *, const char *, size_t, loff_t *);
  [0x28] ssize_t (*aio_write)(struct kiocb *, const char *, size_t, loff_t);
  [0x30] int (*readdir)(struct file *, void *, filldir_t);
  [0x38] unsigned int (*poll)(struct file *, struct poll_table_struct *);
  [0x40] int (*ioctl)(struct inode *, struct file *, unsigned int, long unsigned int);
  [0x48] long int (*unlocked_ioctl)(struct file *, unsigned int, long unsigned int);
  [0x50] long int (*compat_ioctl)(struct file *, unsigned int, long unsigned int);
  [0x58] int (*mmap)(struct file *, struct vm_area_struct *);
  [0x60] int (*open)(struct inode *, struct file *);
  [0x68] int (*flush)(struct file *, fl_owner_t);
  [0x70] int (*release)(struct inode *, struct file *);
  [0x78] int (*fsync)(struct file *, struct dentry *, int);
  [0x80] int (*aio_fsync)(struct kiocb *, int);
  ...

Comment 44 Dave Anderson 2011-03-15 17:58:25 UTC
> could you explain what "shifted by one" actually means? IOW, what
> is the numeric value of null_fops?

If "RAX: 01ffffffff8029c9" is true, then in that kernel, "null_fops"
is that address shifted by 1-byte, or 0xffffffff8029c900.  Here, from
the 2.6.18-194.17.1.el5 vmlinux file:

  $ nm -Bn /tmp/vmlinux | grep null_fops
  ffffffff8029c900 r null_fops
  $

Comment 45 Dave Anderson 2011-03-15 18:06:16 UTC
And from that other kernel oops trace example:

> Pid: 4951, comm: modprobe Not tainted 2.6.18-238.1.1.el5 #1
> RIP: 0010:[<ffffffff80023c09>]  [<ffffffff80023c09>] filp_close+0x2b/0x64
> RSP: 0018:ffff810478bfbed8  EFLAGS: 00010206
> RAX: 01ffffffff802a44

In the 2.6.18-238.1.1.el5 kernel:

  $ nm -Bn /tmp/vmlinux | grep null_fops
  ffffffff802a44e0 r null_fops
  $

Comment 46 Oleg Nesterov 2011-03-15 18:27:14 UTC
(In reply to comment #43)
>
> > sorry for noise... at least now I see why we need  xchg ;)
> 
> Well, I know far less about this code than you do -- so why
> does it need to use xchg()?

Cough. I hate myself. No, I do not see why it is needed. Exactly
because we are the last owner, we are going to free it.

> > but 0x68(%rax) means ->release, not ->flush. Confused.
> 
> Actually 0x68 is the offset of the file_operations->flush operation:

Ah. Indeed, I was looking into upstream sources, thanks. So, at
least this is clear.

Comment 47 Dave Anderson 2011-03-17 13:51:33 UTC
(In reply to comment #35)
> With the kernel with attachment #479580 [details], 
> the has completed 1433 reboots without hitting this 
> problem. That suggests that the patch actually helps.
> Please review it.

(In reply to comment #41)
> Yes, I know very little about this fdt code. And 
> probably I am wrong.  But so far it seems to me 
> that close_file() has all rights to do 
> BUG_ON(spin_is_locked(files->file_lock)). Who else 
> can play (and change!) the struct we are going to free.

Clearly we're stumbling around in the dark with no idea 
why there is another entity that is tinkering around with 
an exiting modprobe's open files data while it's in the 
process of closing them all.

Furthermore, as Oleg mentioned, there should be no situation 
where the files_struct.file_lock should be locked upon entry 
to close_files().  The modprobe task is single-threaded, and
the files_struct and associated fdtable are unique to that
task. 

Martin's last patch put a spin_lock(&files->lock_file) at the
beginning and end of close_files().  Empirically that patch makes
it appear that the spin_lock either kept the offender away until
close_files() was done with its work, or perhaps it allowed the
offender run first?

In either case, let's prove it...

In a first-step attempt to track down the external offender,
I've built a kernel with the following patch, which:

(1) adds a new "debug" field to the files_struct.
(2) upon entry to close_files(), it does a spin_trylock() on
    the files->file_lock, and if the lock is already held, then
    BUG() immediately.
(3) otherwise, if we're a "modprobe", then set the files->debug 
    field, and hold the lock.
(4) or if we're not a "modprobe", just release the lock.
(5) after close_files() is complete, the files_struct is normally
    freed by either put_files_struct(), or by free_fdtable_rcu()
    if the associated fdtable is embedded; in either case, if
    the "files->debug" field is set, the files_struct is not freed.

Therefore, all files_struct's that are used by modprobe tasks 
will remain locked and un-freed forever when the task exits.
Doing this should hopefully yield a clue as to who the external 
offender is.  

Granted, if we see step (2) occur, then we'll have to add additional 
code to try to track down who currently owns the lock.  But it 
still would be very interesting on its own.   

But if the offender tries to take the lock *after* step (2),
then eventually the NMI watchdog should fire, and the context
of the offender would be caught in the act.  

I've rebooted the test kernel in a loop over-night with no
problems, so it should work in the normal case.

The kernel is here: http://people.redhat.com/anderson/bz647824

  kernel-2.6.18-247.el5.bz647824.1.src.rpm
  kernel-2.6.18-247.el5.bz647824.1.x86_64.rpm
  kernel-debuginfo-2.6.18-247.el5.bz647824.1.x86_64.rpm
  kernel-debuginfo-common-2.6.18-247.el5.bz647824.1.x86_64.rpm
  linux-kernel-test.patch

And here's the patch:

--- linux-2.6.18.x86_64/include/linux/file.h.orig
+++ linux-2.6.18.x86_64/include/linux/file.h
@@ -60,6 +60,9 @@ struct files_struct {
 	struct embedded_fd_set close_on_exec_init;
 	struct embedded_fd_set open_fds_init;
 	struct file * fd_array[NR_OPEN_DEFAULT];
+#ifndef __GENKSYMS__
+	unsigned long debug;
+#endif
 };
 
 #define files_fdtable(files) (rcu_dereference((files)->fdt))
--- linux-2.6.18.x86_64/kernel/exit.c.orig
+++ linux-2.6.18.x86_64/kernel/exit.c
@@ -468,6 +468,13 @@ static void close_files(struct files_str
 	 * ->file_lock because this is the last reference to the
 	 * files structure.
 	 */
+	if (!spin_trylock(&files->file_lock))
+		BUG();
+	else if (strcmp(current->comm, "modprobe") == 0)
+		files->debug = 0xdeadbeef;
+	else
+		spin_unlock(&files->file_lock);
+
 	fdt = files_fdtable(files);
 	for (;;) {
 		unsigned long set;
@@ -515,8 +522,12 @@ void fastcall put_files_struct(struct fi
 		fdt = files_fdtable(files);
 		if (fdt == &files->fdtab)
 			fdt->free_files = files;
-		else
-			kmem_cache_free(files_cachep, files);
+		else {
+			if (files->debug)
+				printk("put_files_struct: %p %lx\n", files, files->debug);
+			else
+				kmem_cache_free(files_cachep, files);
+		}
 		free_fdtable(fdt);
 	}
 }
--- linux-2.6.18.x86_64/fs/file.c.orig
+++ linux-2.6.18.x86_64/fs/file.c
@@ -124,7 +124,11 @@ static void free_fdtable_rcu(struct rcu_
 		 * and the files structure itself was getting destroyed.
 		 * It is now safe to free the files structure.
 		 */
-		kmem_cache_free(files_cachep, fdt->free_files);
+		if (fdt->free_files->debug)
+			printk("free_fdtable_rcu: %p %lx\n", 
+				fdt->free_files, fdt->free_files->debug);
+		else
+			kmem_cache_free(files_cachep, fdt->free_files);
 		return;
 	}
 	if (fdt->max_fdset <= EMBEDDED_FD_SET_SIZE &&
--- linux-2.6.18.x86_64/kernel/fork.c.orig
+++ linux-2.6.18.x86_64/kernel/fork.c
@@ -752,6 +752,7 @@ static struct files_struct *alloc_files(
 		goto out;
 
 	atomic_set(&newf->count, 1);
+	newf->debug = 0;
 
 	spin_lock_init(&newf->file_lock);
 	newf->next_fd = 0;

Comment 48 Oleg Nesterov 2011-03-17 14:35:21 UTC
Created attachment 486027 [details]
[may be patch] unshare_files()->copy_files() must not change tsk->files lockless

Well, it seems I found the bug...

The problem is, I doubt it can explain this particular problem,
the race is almost theoretical. But it would be nice to try this
patch anyway, imho.

We do not have another theory so far. And this race _can_ explain
double free/use, at least in theory.

Comment 49 Oleg Nesterov 2011-03-17 14:50:36 UTC
(In reply to comment #48)
>
> Created attachment 486027 [details]
> [may be patch] unshare_files()->copy_files() must not change tsk->files
> lockless
> 
> Well, it seems I found the bug...
> 
> The problem is, I doubt it can explain this particular problem,
> the race is almost theoretical. But it would be nice to try this
> patch anyway, imho.
> 
> We do not have another theory so far. And this race _can_ explain
> double free/use, at least in theory.

Hmm. I attached the patch with the changelog, but it turns out
https://bugzilla.redhat.com/attachment.cgi?id=486027&action=diff
do not want to show it. I am copying it here...

[RHEL5 PATCH] unshare_files()->copy_files() must not change tsk->files lockless

Unless we know task->files == NULL, we should never change ->files
lockless. Otherwise we can race with, say, get_files_struct() doing

	files = task->files;
	// ------ WINDOW ---------
	if (files)
		atomic_inc(&files->count);

If the caller of unshare_files() comes in the window above and does
the necessary put_files_struct(), atomic_inc() can increment ->count
after atomic_dec_and_test() == T.

Note:

	- This is the minimal hack. The real fix should change
	  unshare_files/callers to use unshare_fd()

	- The usage of unshare_files() in exec paths is strange,
	  it is called twice. Say, load_elf_binary() calls it,
	  then it calls flush_old_exec() which does unshare too.
	  I do not see any serious problem, but at least this
	  means that "atomic_read(&files->count) == 1" fastpath
	  never works, we always do copy_files().

	  unshare_files() should be called by do_execve() like
	  upstream does, but this requires many changes.

Comment 50 Dave Anderson 2011-03-17 14:57:24 UTC
Martin,

  Can you first build and test a kernel with just Oleg's patch?

Comment 51 Martin Wilck 2011-03-17 15:06:55 UTC
Thanks to both of you for the analysis and patches.

Meanwhile, I have been pursuing another trace. Before I started the test run
mentioned in comment #35, I also had updated the BIOS of the machine (in the
meantime I had encountered another problem known to be caused by BIOS bug,
which made me realize the BIOS wasn't up-to-date). After I finished the test
run from comment #35 I figured that I should try to reproduce the problem with
the new BIOS. This test is now running - the original RHEL kernel
2.6.18-238.1.1.el5 has completed 350 reboots already on this machine without
failing (it had failed after ~80 reboots  before, see comment #21).

Sooo, it is at least possible that this whole thing wasn't a RHEL kernel problem but a BIOS issue after all. I have not the slightest idea what the relation to the BIOS might be, the stack trace certainly doesn't suggest so. Let's hold on until early next week. By that time we should know whether the problem is still
reproducable (if the kernel reaches ~1000 reboots without failing). Before the BIOS update, the problem was nicely reproducable typically with <100 reboots.

Martin

Comment 52 Oleg Nesterov 2011-03-17 15:14:43 UTC
(In reply to comment #47)
>
> Furthermore, as Oleg mentioned, there should be no situation 
> where the files_struct.file_lock should be locked upon entry 
> to close_files().  The modprobe task is single-threaded, and
> the files_struct and associated fdtable are unique to that
> task. 

Yes... Once again, I never looked at this code before, I can
be wrong. But, according to grep, every caller of
spin_lock(files->file_lock) should have a reference to files.
Either the caller uses current->files, or it does get_files_struct().
And the only caller of close_files() is put_files_struct() which
does  atomic_dec_and_test().

And. In any case, if  fdt != &files->fdtab, we are going to
kmem_cache_free(files) immediately. This means that nobody
else must even try to take this lock. rcu protects fdtable,
not files_struct.

I found the race which can lead to use-after-free, but I doubt
it can explain this particular problem. So probably it makes
sense to try this patch as well.

One note...

> In a first-step attempt to track down the external offender,
> I've built a kernel with the following patch, which:
> 
> (1) adds a new "debug" field to the files_struct.
> (2) upon entry to close_files(), it does a spin_trylock() on
>     the files->file_lock, and if the lock is already held, then
>     BUG() immediately.
> (3) otherwise, if we're a "modprobe", then set the files->debug 
>     field, and hold the lock.
> (4) or if we're not a "modprobe", just release the lock.

(4) doesn't look right. If modprobe races with somebody else, it
should drop the lock. This decreases the chance modprobe can
notice the !spin_trylock(). Not that I think it makes sense to
redo the patch, just a nit.

> (5) after close_files() is complete, the files_struct is normally
>     freed by either put_files_struct(), or by free_fdtable_rcu()
>     if the associated fdtable is embedded; in either case, if
>     the "files->debug" field is set, the files_struct is not freed.

Looks reasonable...

Comment 53 Martin Wilck 2011-03-17 15:37:42 UTC
> And. In any case, if  fdt != &files->fdtab, we are going to
> kmem_cache_free(files) immediately. This means that nobody
> else must even try to take this lock. rcu protects fdtable,
> not files_struct.

AFAICS, rcu doesn't protect anything at all, because none of the files_fdtable() calls I have seen are properly nested with rcu_read_lock() / rcu_read_unlock(). My hypothesis was that, for some reason, the RHEL5 kernel still uses files->file_lock for protecting the fdtable, rather than RCU mechanisms. Therefore it occured to me that taking the lock in close_files might not be the worst idea, after all.

However, please let's just wait for the outcome of the test in comment #51 first.

Comment 54 Dave Anderson 2011-03-17 15:47:00 UTC
> However, please let's just wait for the outcome of the test in comment #51
> first.

No argument there...

But Oleg:

> (4) doesn't look right. If modprobe races with somebody else, it
> should drop the lock. This decreases the chance modprobe can
> notice the !spin_trylock(). Not that I think it makes sense to
> redo the patch, just a nit.

I wasn't sure whether it would be helpful if non-modprobe tasks held
the lock throughout close_files() -- but I thought it was enough that 
they at least try to take the lock.   And I didn't think that
non-modprobe tasks should leak/freeze the files_struct.  

But since the test only consists of a reboot-loop, we could easily:

(1) have *all* tasks hold the lock throughout close_files(), and
(2) leak/freeze *all* files_structs regardless whether they are
    modprobe tasks.

Is that what you mean?

Comment 55 Oleg Nesterov 2011-03-17 16:04:29 UTC
(In reply to comment #53)
>
> > And. In any case, if  fdt != &files->fdtab, we are going to
> > kmem_cache_free(files) immediately. This means that nobody
> > else must even try to take this lock. rcu protects fdtable,
> > not files_struct.
> 
> AFAICS, rcu doesn't protect anything at all, because none of the
> files_fdtable() calls I have seen are properly nested with rcu_read_lock() /
> rcu_read_unlock().

It does, see fget() for example.

> My hypothesis was that, for some reason, the RHEL5 kernel
> still uses files->file_lock for protecting the fdtable, rather than RCU
> mechanisms.

This doesn't matter. If we hold files->file_lock, we should have
the reference to file. Otherwise it can go way under us.

And,

> Therefore it occured to me that taking the lock in close_files
> might not be the worst idea, after all.

This can help if "something else" takes ->file_lock after close_files()
returns.

Comment 56 Oleg Nesterov 2011-03-17 16:09:15 UTC
(In reply to comment #54)
>
> I wasn't sure whether it would be helpful if non-modprobe tasks held
> the lock throughout close_files() -- but I thought it was enough that 
> they at least try to take the lock.

Yes... but I don't really understand what yopu mean ;)

> But since the test only consists of a reboot-loop, we could easily:
> 
> (1) have *all* tasks hold the lock throughout close_files(), and
> (2) leak/freeze *all* files_structs regardless whether they are
>     modprobe tasks.
 
Agreed, looks even better. And mark it "freed" unconditionally.

Comment 57 Martin Wilck 2011-03-18 11:40:15 UTC
(In reply to comment #55)

> > AFAICS, rcu doesn't protect anything at all, because none of the
> > files_fdtable() calls I have seen are properly nested with rcu_read_lock() /
> > rcu_read_unlock().
> 
> It does, see fget() for example.

I see it done for fcheck_files(), but there are a lot of uses of file_fdtable() where the RCU protection is at least not obvious.

Comment 58 Oleg Nesterov 2011-03-18 13:51:38 UTC
(In reply to comment #57)
> (In reply to comment #55)
> 
> > > AFAICS, rcu doesn't protect anything at all, because none of the
> > > files_fdtable() calls I have seen are properly nested with rcu_read_lock() /
> > > rcu_read_unlock().
> > 
> > It does, see fget() for example.
> 
> I see it done for fcheck_files(), but there are a lot of uses of file_fdtable()
> where the RCU protection is at least not obvious.

Of course. But I think you missed the point. Or I misunderstood.
(but let me repeat once again: I do not pretend I am expert ;)

Any user of file_fdtable(files) should have a reference to "files"
to begin with. And files_struct is refcounted, it is not protected
by RCU. If atomic_dec_and_test(files->count) succeeds, nobody else
should ever look at this memory. In particular, nobody else should
take or just dereference files->files_lock.

If spin_lock() really helps, this can only mean it masks the problem.
We should find the reason. And: one of the possible reasons is double
free.

Comment 59 Dave Anderson 2011-03-18 14:28:33 UTC
(In reply to comment #53)
> However, please let's just wait for the outcome of the test in
> comment #51 first.

If you still see the failure after the BIOS update,
I've updated the debug kernel to collaborate Oleg's
patch to copy_files() plus his suggestion to fortify 
my last patch.

It's essentially 3 debug kernels in one, depending
upon the "patch" level, which can only have values
of 1 (the default), 2 or 3, and can be set on the 
kernel boot command line like this:

title Red Hat Enterprise Linux Client (2.6.18-247.el5.bz647824.3)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-247.el5.bz647824.3 ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 rhgb quiet patch=3
        initrd /initrd-2.6.18-247.el5.bz647824.3.img

o patch=1  (default -- does not need to be explicitly set)

  Upon entering close_files(), each files_struct is locked
  and is never freed.  If the files_struct is already locked 
  upon entering close_files(), it BUG()'s.

o patch=2

  This will only invoke Oleg's patch to copy_files().

o patch=3

  This will invoke both my debug code and Oleg's patch.

The kernel is at http://people.redhat.com/anderson/bz647824 :

  kernel-2.6.18-247.el5.bz647824.3.src.rpm
  kernel-2.6.18-247.el5.bz647824.3.x86_64.rpm
  kernel-debuginfo-2.6.18-247.el5.bz647824.3.x86_64.rpm
  kernel-debuginfo-common-2.6.18-247.el5.bz647824.3.x86_64.rpm
  linux-kernel-test.patch

This is the patch:

--- linux-2.6.18.x86_64/arch/x86_64/kernel/setup.c.orig
+++ linux-2.6.18.x86_64/arch/x86_64/kernel/setup.c
@@ -303,6 +303,8 @@ static int fullarg(char *p, char *arg)
 	return !memcmp(p, arg, l) && (p[l] == 0 || isspace(p[l]));
 }
 
+int patchlevel = 1;
+
 static __init void parse_cmdline_early (char ** cmdline_p)
 {
 	char c = ' ', *to = command_line, *from = COMMAND_LINE;
@@ -408,6 +410,11 @@ static __init void parse_cmdline_early (
 		if (fullarg(from,"oops=panic"))
 			panic_on_oops = 1;
 
+		if (fullarg(from, "patch=2"))
+			patchlevel = 2;
+		if (fullarg(from, "patch=3"))
+			patchlevel = 3;
+
 		if (!memcmp(from, "noexec=", 7))
 			nonx_setup(from + 7);
 
--- linux-2.6.18.x86_64/kernel/exit.c.orig
+++ linux-2.6.18.x86_64/kernel/exit.c
@@ -460,6 +460,7 @@ static void close_files(struct files_str
 {
 	int i, j;
 	struct fdtable *fdt;
+	extern int patchlevel;
 
 	j = 0;
 
@@ -468,6 +469,8 @@ static void close_files(struct files_str
 	 * ->file_lock because this is the last reference to the
 	 * files structure.
 	 */
+	if ((patchlevel != 2) && !spin_trylock(&files->file_lock))
+		BUG();
 	fdt = files_fdtable(files);
 	for (;;) {
 		unsigned long set;
@@ -503,6 +506,7 @@ struct files_struct *get_files_struct(st
 void fastcall put_files_struct(struct files_struct *files)
 {
 	struct fdtable *fdt;
+	extern int patchlevel;
 
 	if (atomic_dec_and_test(&files->count)) {
 		close_files(files);
@@ -515,8 +519,12 @@ void fastcall put_files_struct(struct fi
 		fdt = files_fdtable(files);
 		if (fdt == &files->fdtab)
 			fdt->free_files = files;
-		else
-			kmem_cache_free(files_cachep, files);
+		else {
+			if (patchlevel == 2)
+				kmem_cache_free(files_cachep, files);
+			else
+				printk("put_files_struct: %p\n", files);
+		}
 		free_fdtable(fdt);
 	}
 }
--- linux-2.6.18.x86_64/kernel/fork.c.orig
+++ linux-2.6.18.x86_64/kernel/fork.c
@@ -876,6 +876,7 @@ static int copy_files(unsigned long clon
 {
 	struct files_struct *oldf, *newf;
 	int error = 0;
+	extern int patchlevel;
 
 	/*
 	 * A background process may not have any files ...
@@ -894,7 +895,9 @@ static int copy_files(unsigned long clon
 	 * This works because we cache current->files (old) as oldf. Don't
 	 * break this.
 	 */
+	if (patchlevel >= 2) task_lock(tsk);
 	tsk->files = NULL;
+	if (patchlevel >= 2) task_unlock(tsk);
 	newf = dup_fd(oldf, &error);
 	if (!newf)
 		goto out;
--- linux-2.6.18.x86_64/fs/file.c.orig
+++ linux-2.6.18.x86_64/fs/file.c
@@ -113,6 +113,7 @@ static void free_fdtable_rcu(struct rcu_
 	struct fdtable *fdt = container_of(rcu, struct fdtable, rcu);
 	int fdset_size, fdarray_size;
 	struct fdtable_defer *fddef;
+	extern int patchlevel;
 
 	BUG_ON(!fdt);
 	fdset_size = fdt->max_fdset / 8;
@@ -124,7 +125,10 @@ static void free_fdtable_rcu(struct rcu_
 		 * and the files structure itself was getting destroyed.
 		 * It is now safe to free the files structure.
 		 */
-		kmem_cache_free(files_cachep, fdt->free_files);
+		if (patchlevel == 2)
+			kmem_cache_free(files_cachep, fdt->free_files);
+		else
+			printk("free_fdtable_rcu: %p\n", fdt->free_files);
 		return;
 	}
 	if (fdt->max_fdset <= EMBEDDED_FD_SET_SIZE &&

Comment 60 Martin Wilck 2011-03-21 14:49:29 UTC
1100 reboots completed over the weekend. I am ashamed. In order to be really sure, I'll ask QA to do the same test on a different piece of hardware. I will get back to you.

Comment 61 Martin Wilck 2011-04-07 13:44:17 UTC
Confirmed on another system. After the BIOS upgrade, there are no more problems. I apologize for having wasted your time.

Comment 62 Dave Anderson 2011-04-07 14:08:13 UTC
(In reply to comment #61)
> Confirmed on another system. After the BIOS upgrade, there are no more
> problems. I apologize for having wasted your time.

Thanks Martin, that's good news...

Thinking about it, if it were just a simple matter of two entities 
touching the same data structure unlocked and simultaneously,
it's hard to logically explain the "shifting" issue of the corruption
that was seen.

Comment 63 Martin Wilck 2011-04-08 08:30:28 UTC
It is totally impossible for me to explain that shift (or add-one-to-an-otherwise-correct-pointer) issue with the known BIOS problems that were fixed in the new version, either. This problem definitely qualifies as mysterious.


Note You need to log in before you can comment on or make changes to this bug.