Description of problem: Pid: 4531, comm: modprobe Not tainted 2.6.18-194.17.1.el5 #1 RIP: 0010:[<ffffffff80023bd8>] [<ffffffff80023bd8>] filp_close+0x2b/0x64 RSP: 0018:ffff81047b93fed8 EFLAGS: 00010206 RAX: 01ffffffff8029c9 RBX: ffff81087e9e2a81 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff81047d7abc80 RDI: ffff81087e9e2a81 RBP: ffff81047d7abc80 R08: ffff81048e1b5000 R09: ffff81010fc85d40 R10: 00002ba042143870 R11: ffffffff800da536 R12: 0000000000000002 R13: ffff81047fc8e240 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002ba041e906e0 CR3: 0000000000201000 CR4: 00000000000006e0 Process modprobe (pid: 4531, threadinfo ffff81047b93e000, task ffff81047f9f5820) Stack: ffff81047d7abc80 0000000000000001 0000000000000002 ffffffff80038eb5 ffff81047d7abc80 ffff81047f9f5820 0000000000000001 00007fffb6e4fb30 0000000000000000 ffffffff80015886 0000000000000000 0000000000000000 Call Trace: [<ffffffff80038eb5>] put_files_struct+0x63/0xae [<ffffffff80015886>] do_exit+0x31c/0x911 [<ffffffff8004911d>] cpuset_exit+0x0/0x88 [<ffffffff8005d116>] system_call+0x7e/0x83 Version-Release number of selected component (if applicable): How reproducible: sporadic, in reboot test with Fujitsu PRIMERGY BX960S1 blade server. It was observed only on systems with iSCSI root so far, but I can't imagine the relation between iSCSI root and this problem. Steps to Reproduce: 1. reboot test on BX960S1 Actual results: See above Expected results: No panic Additional info: The fault occurs because %rax = 01ffffffff8029c9 is used as a pointer. The values is obviously a valid pointer address shifted by one (pointing to null_fops, the open file here is obviously /dev/null). The %rax value comes from the address pointed to by %rdi (ffff81087e9e2a81) which is obviously misaligned for a pointer. The %rdi value comes from the following code in close_files(), the xchg() call in line 471: 457 /* 458 * It is safe to dereference the fd table without RCU or 459 * ->file_lock because this is the last reference to the 460 * files structure. 461 */ 462 fdt = files_fdtable(files); 463 for (;;) { 464 unsigned long set; 465 i = j * __NFDBITS; 466 if (i >= fdt->max_fdset || i >= fdt->max_fds) 467 break; 468 set = fdt->open_fds->fds_bits[j++]; 469 while (set) { 470 if (set & 1) { 471 struct file * file = xchg(&fdt->fd[i], NULL); 472 if (file) 473 filp_close(file, files); 474 } 475 i++; 476 set >>= 1; 477 } 478 } 479 } The comment on top makes me speculate if some backported patches in RHEL5 may perhaps have invalidated the assumption that locking isn't necessary any more (note that files_fdtable() implies an rcu_dereference()).
We can't exclude a HW problem here, but we saw the problem on 4 different machines, always with the same stack (always modprobe, but the module being loaded was not always the same).
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
Martin, Engineering notes that this problem occurred with a 5.5.z errata. Have you seen it on the RHEL5.5GA bits? Trying to narrow the window.
I don't have that information at this time (need to ask a colleague who is sick today). However the problem was seen with the earlier errata kernel 2.6.18-194.8.1.el5 as well. It is hard to give a comprehensive answer on a large set of kernels, because reproducing this problem takes a long time (up to a week of reboot tests). If you have good reasons to think that it would be helpful to retest with a particular kernel, please tell me. It might also be worth to retry with 5.6. Please advise if 5.5GA, 5.6RC, or something else would be most promising - we need to make a good choice.
Ideally the issue should be bisected to a specific release, but given the problem with reproducing it, probably the second best solution would be to attempt to reproduce it with a RHEL 5.6 release -- especially given that any fix would have to go there first.
Because the investigation has been inactive on all parts for some time, we are currently trying to reproduce the problem with the unmodified system (still running 2.6.19-194.8.1.el5), instrumented with kdump. Once we captured a dump, we will upgrade to 5.6. Question: is KDUMP DUMPLEVEL 31 sufficient here, or do we need a more complete vmcore image?
(In reply to comment #14) > Because the investigation has been inactive on all parts for some time, we are > currently trying to reproduce the problem with the unmodified system (still > running 2.6.19-194.8.1.el5), instrumented with kdump. Once we captured a dump, > we will upgrade to 5.6. > > Question: is KDUMP DUMPLEVEL 31 sufficient here, or do we need a more complete > vmcore image? That's what I would have suggested first, but unfortunately in all of the provided logs, the crash occurs well before the kdump service is enabled.
You are right, of course, I missed that :-/ So we will test with 5.6, and if the issue is reproduced, we'll try the debug kernel and try to make progress with kdb.
kdb?
I was hoping there was kdb support in the RHEL debug kernel. If it's not, please advise how else we could make progress in the analysis of this problem.
Other than building kernels with printk's, I don't know. It would be interesting first to know whether the original fdtable (not the RCU copy) contained the bad file structure pointer in its array, and then work backwards from there. But given the timing of getting it to reproduce, it's probably going to be very slow progress figuring out where things went wrong.
The modprobe task's "tsk->files" points to a files_struct: struct files_struct { atomic_t count; struct fdtable *fdt; struct fdtable { unsigned int max_fds; int max_fdset; struct file **fd; fd_set *close_on_exec; fd_set *open_fds; struct rcu_head rcu; struct files_struct *free_files; struct fdtable *next; } fdtab; spinlock_t file_lock; int next_fd; struct embedded_fd_set close_on_exec_init; struct embedded_fd_set open_fds_init; struct file *fd_array[64]; } The "fdt" pointer above either points to the embedded "fdtab" fdtable above, or points to an external fdtable structure. If "fdt" points to the embedded fdtable, then the "fd" pointer in that table points to the "fd_array[64]" list of file structure pointers. And one of the fd_array[64] pointers is the corrupted pointer used by close_files(). If "fdt" points to an external fdtable structure, then that external structure's "fd" pointer points yet another external array of file structure pointers. And one of those external file structure pointers is the corrupted pointer used by close_files(). I'm not sure whether "modprobe" uses the embedded or an external fdtable. Complicating matters slightly is that the fdtable used by close_files() is an RCU copy of the original fdtable. Complicating matters even further is that perhaps there's a corruption associated with the "open_fds" fd_set bitmask that causes "set" to be incorrectly set, and that causing the use of an invalid file structure pointer?: fdt = files_fdtable(files); for (;;) { unsigned long set; i = j * __NFDBITS; if (i >= fdt->max_fdset || i >= fdt->max_fds) break; set = fdt->open_fds->fds_bits[j++]; while (set) { if (set & 1) { struct file * file = xchg(&fdt->fd[i], NULL); if (file) filp_close(file, files); } i++; set >>= 1; } } So anyway, what I was suggesting is to at least verify what type of fdtable is used (embedded or external), and to dump the file structure array if the calling task is named "modprobe"? What do you think?
Created attachment 478646 [details] serial log of EL5.6 (2.6.18-238.1.1.el5) The issue has been now reproduced on RHEL5.6. The panic attached occured after 82 reboots. On the same system, in the same test scenario, we see an alternating error condition where the NMI watchdog causes a panic in cache_alloc_refill(). More material on this can be find in case 00320375. This was original problem observed in this test scenario - for quite some time it hadn't been reproduced, and the filp_close() problem had been observed instead.
Created attachment 478649 [details] proposed debug-patch #1 OK, so at least we can rule out any Z-stream issues -- that's a start. But we're still rummaging around in the dark. I had put together the attached patch, which will at least narrow it down to whether we should be looking at embedded tables or not. It also would display the file struct address being passed to filp_close(). I wonder whether it's worth cycling through and dumping all of the file struct pointer prior to passing them one-at-a-time to filp_close()?
Thanks for the patch. I'll try it ASAP. I am just a little worried if it might make the problem go away by changing the timing. I had considered to check the "file" pointer after the xchg() call and if it's misaligned (((long) file & 3) != 0), dump the files_struct and fd table. However let's start with your approach.
(In reply to comment #23) > Thanks for the patch. I'll try it ASAP. Don't do anything yet -- I'm building a test kernel now, I'll test it, and if it's reasonable, I'll post it on my people.redhat.com page for you to test. > I am just a little worried if it might make the problem go away by changing the > timing. I had considered to check the "file" pointer after the xchg() call and > if it's misaligned (((long) file & 3) != 0), dump the files_struct and fd > table. However let's start with your approach. That may be very true -- but it's just as likely that the corruption occurred long before it even made it into the exit() path. My second patch will at least narrow that question down.
Ok, the files are here: http://people.redhat.com/anderson/bz647824 kernel-2.6.18-243.el5.bz647824.2.src.rpm kernel-2.6.18-243.el5.bz647824.2.x86_64.rpm kernel-debuginfo-2.6.18-243.el5.bz647824.2.x86_64.rpm kernel-debuginfo-common-2.6.18-243.el5.bz647824.2.x86_64.rpm linux-kernel-test.patch The linux-kernel-test.patch is different from the one that I posted earlier -- this one dumps all open files prior to entering the loop where each one is issued to filp_close(). FWIW, there are modprobe instances using both "embedded" and external data structure usage, for examples: [733] files: ffff8100338be0c0 files->fdt: ffff810033577cc0 &files->fdtab: ffff8100338be0d0 [733] fdt->fd: ffff810033718000 &files->fd_array: ffff8100338be158 [733] fdt->open_fds: ffff81003386b5c0 &files->open_fds_init: ffff8100338be150 [733] fd[0]: ffff8100360d1180 [733] fd[1]: ffff8100360d1180 [733] fd[2]: ffff8100360d1180 [733] fd[255]: ffff8100369ab6c0 [1500] files: ffff81003457ac80 files->fdt: ffff81003457ac90 &files->fdtab: ffff81003457ac90 (embedded) [1500] fdt->fd: ffff81003457ad18 &files->fd_array: ffff81003457ad18 (embedded) [1500] fdt->open_fds: ffff81003457ad10 &files->open_fds_init: ffff81003457ad10 (embedded) [1500] fd[0]: ffff8100360d37c0 [1500] fd[1]: ffff8100360d37c0 [1500] fd[2]: ffff8100360d37c0 Anyway, if the corruption has already occurred by the time the exit path is reached, we should see it in the open files list.
Reboot test is running with patched kernel now.
I have good and bad news. The good news is that the error occured with the patched kernel after ~80 reboots. The bad news is that in the error case, the debug output is invisible because the console loglevel was too low. Now I changed the config as follows: /etc/sysconfig/init: LOGLEVEL=8 /etc/sysconfig/syslog: KLOGD_OPTIONS="-x -c 8" I am seeing much more "fdt" messages now. Restarted the reboot test.
The test is still running. Let me make a few remarks. If I read the code correctly, the default for newly created processes is to use the embedded fdtable. Only if the max fd number goes to 64 (actually, BITS_PER_LONG) or higher, expand_fdtable() is called to allocate a larger table. When processes are forked, they inherit the file descriptor table, so a child forked from a parent with external fdtable will also have an external fdtable. Besides files->fdt, the individual elements fdt->fd[i] are normally also protected by RCU mechanisms. It strikes me that none of these references appear to be protected by rcu_read_lock() / rcu_read_unlock(), I thought that this is mandatory (http://lwn.net/Articles/262464/). It seems the locking is done through files_lock instead, *except in close_files()*. That makes me think (again) that taking files_lock might already solve our problem. The statistics of my current run show that only 264 of 14223 modprobe instances (<2%) run with the embedded fdtable. It seems weird that modprobe uses up so many fd's all the time - I regard it as more likely that these large fdtables are inherited from parent processes. Actually, I find it awkward that at the early stage during boot there are any processes using fd's > 63 (assuming that only large deamons with many sockets can really use so many). I suspect that the culprit here is bash, which has the habit to always allocate fd 255 (http://lkml.indiana.edu/hypermail/linux/kernel/9803.3/0173.html, http://lkml.indiana.edu/hypermail/linux/kernel/9910.3/0377.html). (Side note: this seems to me like a serious waste of resources on basically every Linux system out there, simply caused by bash's convention). Here are a few more observations: When we saw the error, the file descriptor causing the problem was always 2 (stderr), and it always pointed to /dev/null (the file->fop address in RAX can be identified as null_fops). The pointer was always off by exactly 1, as if the memory address had been incremented (it certainly doesn't look like a random overwrite). I am not sure what we can infer from that, but to me it looks like a race condition with fdt and/or fdt->fd[i].
> Besides files->fdt, the individual elements fdt->fd[i] are normally also > protected by RCU mechanisms. It strikes me that none of these references > appear to be protected by rcu_read_lock() / rcu_read_unlock(), I thought > that this is > mandatory (http://lwn.net/Articles/262464/). It seems the > locking is done > through files_lock instead, *except in close_files()*. > That makes me think (again) that taking files_lock might already solve > our problem. Right, I saw commit d11c563dd20ff35da5652c3e1c989d9e10e1d6d0, but it sounds like the changes to close_files() and put_files_struct() applied only for quieting an RCU-lockdep message rather than protecting anything: commit d11c563dd20ff35da5652c3e1c989d9e10e1d6d0 Author: Paul E. McKenney <paulmck.ibm.com> Date: Mon Feb 22 17:04:50 2010 -0800 sched: Use lockdep-based checking on rcu_dereference() Update the rcu_dereference() usages to take advantage of the new lockdep-based checking. Signed-off-by: Paul E. McKenney <paulmck.ibm.com> Cc: laijs.com Cc: dipankar.com Cc: mathieu.desnoyers Cc: josh Cc: dvhltc.com Cc: niv.com Cc: peterz Cc: rostedt Cc: Valdis.Kletnieks Cc: dhowells LKML-Reference: <1266887105-1528-6-git-send-email-paulmck.ibm.com> [ -v2: fix allmodconfig missing symbol export build failure on x86 ] Signed-off-by: Ingo Molnar <mingo> diff --git a/kernel/exit.c b/kernel/exit.c index 546774a..45ed043 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -85,7 +85,9 @@ static void __exit_signal(struct task_struct *tsk) BUG_ON(!sig); BUG_ON(!atomic_read(&sig->count)); - sighand = rcu_dereference(tsk->sighand); + sighand = rcu_dereference_check(tsk->sighand, + rcu_read_lock_held() || + lockdep_is_held(&tasklist_lock)); spin_lock(&sighand->siglock); posix_cpu_timers_exit(tsk); @@ -170,8 +172,10 @@ void release_task(struct task_struct * p) repeat: tracehook_prepare_release_task(p); /* don't need to get the RCU readlock here - the process is dead and - * can't be modifying its own credentials */ + * can't be modifying its own credentials. But shut RCU-lockdep up */ + rcu_read_lock(); atomic_dec(&__task_cred(p)->user->processes); + rcu_read_unlock(); proc_flush_task(p); @@ -473,9 +477,11 @@ static void close_files(struct files_struct * files) /* * It is safe to dereference the fd table without RCU or * ->file_lock because this is the last reference to the - * files structure. + * files structure. But use RCU to shut RCU-lockdep up. */ + rcu_read_lock(); fdt = files_fdtable(files); + rcu_read_unlock(); for (;;) { unsigned long set; i = j * __NFDBITS; @@ -521,10 +527,12 @@ void put_files_struct(struct files_struct *files) * at the end of the RCU grace period. Otherwise, * you can free files immediately. */ + rcu_read_lock(); fdt = files_fdtable(files); if (fdt != &files->fdtab) kmem_cache_free(files_cachep, files); free_fdtable(fdt); + rcu_read_unlock(); } } But it certainly would be worth testing with those rcu_read_lock's in place. Dave
(In reply to comment #29) > > but it sounds like the changes to close_files() and put_files_struct() > applied only for quieting an RCU-lockdep message rather than protecting > anything: Agreed. And given that rhel5 disable CONFIG_PREEMPT, rcu_read_lock() is nop.
(In reply to comment #30) > Agreed. And given that rhel5 disable CONFIG_PREEMPT, rcu_read_lock() is nop. Yes, but the sources should also work with CONFIG_PREEMT enabled. Anyway, my idea was to try spin_lock(&files->file_lock) in close_files(), and see if it solves our problem. However, let's wait for the current test run first.
The system has survived 180 reboots with Dave's logging patch and the config changes described in comment #27. That makes sense because timing changes during the critical code path are now (with slow serial console active) much stronger than before.
After more than 400 reboots with Dave's patch, I decided it makes no sense to continue. The added logging makes the problem go away.
Created attachment 479580 [details] currently used test patch I am now using this patch. It takes the file_lock before touching the files data structures. When a misaligned pointer is detected, it prints some information and calls BUG(). I will be out of office until March 7. Until then we should know if taking the file_lock solves this problem.
With the kernel with attachment #479580 [details], the has completed 1433 reboots without hitting this problem. That suggests that the patch actually helps. Please review it.
It may help -- but do you agree that it's masking rather than fixing the problem? Another entity is tinkering with the data after it should have no further references, and that would seem to be the issue at hand. I also wonder whether this can be reproduced with a RHEL6 or upstream kernel? Oleg, what are your thoughts on the matter?
(In reply to comment #36) > It may help -- but do you agree that it's masking rather than fixing > the problem? Not quite. IMO it is the right thing to do this under lock protection, see below. My main cause is to get rid of this panic. Finding a working patch has taken 3 months - finding the real cause of the problem (if you're right and there is one) might easily take much longer. I don't have that much time. > Another entity is tinkering with the data after it should have no > further references, and that would seem to be the issue at hand. You are referring to the comment "it is safe to dereference...", right? Why are you certain that comment is correct? All operations on the fd sets are done using RCU operations, I fail to see why there can't be any asynchronous stuff still running while the process is exiting. Perhaps the comment used to be correct some time in the past, but some changes are making it fail now? Maybe those changes are specific to the RHEL5 kernel? I've got no idea. Even if the comment is correct, I don't understand why it would be harmful to take the lock. > I also wonder whether this can be reproduced with a RHEL6 or > upstream kernel? We haven't observed it there yet. That doesn't mean it can't happen. It is sporadic on RHEL5, so some condition that makes it less likely might already cause it not to be observed in typical reboot test scenarios (we usually stop after 1000 successful reboots). > Oleg, what are your thoughts on the matter?
> You are referring to the comment "it is safe to dereference...", right? > Why are you certain that comment is correct? I'm not certain -- other than the fact that it still says that, and has no such lock protection, upstream. > All operations on the fd sets are done > using RCU operations, I fail to see why there can't be any asynchronous stuff > still running while the process is exiting. Perhaps the comment used to be > correct some time in the past, but some changes are making it fail now? Maybe > those changes are specific to the RHEL5 kernel? I've got no idea. Maybe so -- that's also why I'm deferring this to Oleg. > Even if the comment is correct, I don't understand why it would be > harmful to take the lock. I'm not sure either -- but I'd worry that since the locking seems to have "fixed" the problem, that after the file_lock is taken in close_files(), file-related data may get free'd by __fput(), and so when the lock is subsequently released, what's to prevent the "other" entity from immediately touching stuff that's been freed? It's just that your patch is just a hard-sell at this point -- and not my call -- for getting it ACK'd internally. And since Oleg is the expert in this area, that's why I'm soliciting his opinion.
(In reply to comment #36) > > Another entity is tinkering with the data after it should have no > further references, and that would seem to be the issue at hand. > > I also wonder whether this can be reproduced with a RHEL6 or > upstream kernel? Yes, agreed. > Oleg, what are your thoughts on the matter? Certainly I am not the right person to ask ;) I am not expert. And I can't understand why this patch makes the difference. I do not even understand why close_files() does xchg(NULL).
(In reply to comment #38) > And since Oleg is the expert > in this area, that's why I'm soliciting his opinion. No, I am not. I'll try to grep more... Perhaps it makes sense to look into call_usermodehelper() paths, but I doubt I'll find something interesring. But there is something I can't understand rom the very beginning. OK, it seems that the kernel crashes trying to call filp->f_op->flush(). The address is RAX = 01ffffffff8029c9. From #1: %rax = 01ffffffff8029c9 is used as a pointer. The values is obviously a valid pointer address shifted by one (pointing to null_fops) could you explain what "shifted by one" actually means? IOW, what is the numeric value of null_fops? The %rax value comes from the address pointed to by %rdi (ffff81087e9e2a81) So, rdi contains the address of ->f_op? Hmm. I do not have the full asm, but https://bugzilla.redhat.com/attachment.cgi?id=478646 shows Code: 48 8b 40 68 48 85 c0 74 07 ff d0 41 89 c4 eb 03 45 31 e4 48 This is: 0: 48 8b 40 68 mov 0x68(%rax),%rax 4: 48 85 c0 test %rax,%rax 7: 74 07 je 0x10 9: ff d0 callq *%rax b: 41 89 c4 mov %eax,%r12d e: eb 03 jmp 0x13 10: 45 31 e4 xor %r12d,%r12d 13: 48 rex.W but 0x68(%rax) means ->release, not ->flush. Confused.
(In reply to comment #40) > > No, I am not. I'll try to grep more... Yes, I know very little about this fdt code. And probably I am wrong. But so far it seems to me that close_file() has all rights to do BUG_ON(spin_is_locked(files->file_lock)). Who else can play (and change!) the struct we are going to free. Strange...
(In reply to comment #39) > > I do not even understand why close_files() does xchg(NULL). sorry for noise... at least now I see why we need xchg ;)
(In reply to comment #42) > (In reply to comment #39) > > > > I do not even understand why close_files() does xchg(NULL). > > sorry for noise... at least now I see why we need xchg ;) Well, I know far less about this code than you do -- so why does it need to use xchg()? > but 0x68(%rax) means ->release, not ->flush. Confused. Actually 0x68 is the offset of the file_operations->flush operation: crash> file_operations -o struct file_operations { [0x0] struct module *owner; [0x8] loff_t (*llseek)(struct file *, loff_t, int); [0x10] ssize_t (*read)(struct file *, char *, size_t, loff_t *); [0x18] ssize_t (*aio_read)(struct kiocb *, char *, size_t, loff_t); [0x20] ssize_t (*write)(struct file *, const char *, size_t, loff_t *); [0x28] ssize_t (*aio_write)(struct kiocb *, const char *, size_t, loff_t); [0x30] int (*readdir)(struct file *, void *, filldir_t); [0x38] unsigned int (*poll)(struct file *, struct poll_table_struct *); [0x40] int (*ioctl)(struct inode *, struct file *, unsigned int, long unsigned int); [0x48] long int (*unlocked_ioctl)(struct file *, unsigned int, long unsigned int); [0x50] long int (*compat_ioctl)(struct file *, unsigned int, long unsigned int); [0x58] int (*mmap)(struct file *, struct vm_area_struct *); [0x60] int (*open)(struct inode *, struct file *); [0x68] int (*flush)(struct file *, fl_owner_t); [0x70] int (*release)(struct inode *, struct file *); [0x78] int (*fsync)(struct file *, struct dentry *, int); [0x80] int (*aio_fsync)(struct kiocb *, int); ...
> could you explain what "shifted by one" actually means? IOW, what > is the numeric value of null_fops? If "RAX: 01ffffffff8029c9" is true, then in that kernel, "null_fops" is that address shifted by 1-byte, or 0xffffffff8029c900. Here, from the 2.6.18-194.17.1.el5 vmlinux file: $ nm -Bn /tmp/vmlinux | grep null_fops ffffffff8029c900 r null_fops $
And from that other kernel oops trace example: > Pid: 4951, comm: modprobe Not tainted 2.6.18-238.1.1.el5 #1 > RIP: 0010:[<ffffffff80023c09>] [<ffffffff80023c09>] filp_close+0x2b/0x64 > RSP: 0018:ffff810478bfbed8 EFLAGS: 00010206 > RAX: 01ffffffff802a44 In the 2.6.18-238.1.1.el5 kernel: $ nm -Bn /tmp/vmlinux | grep null_fops ffffffff802a44e0 r null_fops $
(In reply to comment #43) > > > sorry for noise... at least now I see why we need xchg ;) > > Well, I know far less about this code than you do -- so why > does it need to use xchg()? Cough. I hate myself. No, I do not see why it is needed. Exactly because we are the last owner, we are going to free it. > > but 0x68(%rax) means ->release, not ->flush. Confused. > > Actually 0x68 is the offset of the file_operations->flush operation: Ah. Indeed, I was looking into upstream sources, thanks. So, at least this is clear.
(In reply to comment #35) > With the kernel with attachment #479580 [details], > the has completed 1433 reboots without hitting this > problem. That suggests that the patch actually helps. > Please review it. (In reply to comment #41) > Yes, I know very little about this fdt code. And > probably I am wrong. But so far it seems to me > that close_file() has all rights to do > BUG_ON(spin_is_locked(files->file_lock)). Who else > can play (and change!) the struct we are going to free. Clearly we're stumbling around in the dark with no idea why there is another entity that is tinkering around with an exiting modprobe's open files data while it's in the process of closing them all. Furthermore, as Oleg mentioned, there should be no situation where the files_struct.file_lock should be locked upon entry to close_files(). The modprobe task is single-threaded, and the files_struct and associated fdtable are unique to that task. Martin's last patch put a spin_lock(&files->lock_file) at the beginning and end of close_files(). Empirically that patch makes it appear that the spin_lock either kept the offender away until close_files() was done with its work, or perhaps it allowed the offender run first? In either case, let's prove it... In a first-step attempt to track down the external offender, I've built a kernel with the following patch, which: (1) adds a new "debug" field to the files_struct. (2) upon entry to close_files(), it does a spin_trylock() on the files->file_lock, and if the lock is already held, then BUG() immediately. (3) otherwise, if we're a "modprobe", then set the files->debug field, and hold the lock. (4) or if we're not a "modprobe", just release the lock. (5) after close_files() is complete, the files_struct is normally freed by either put_files_struct(), or by free_fdtable_rcu() if the associated fdtable is embedded; in either case, if the "files->debug" field is set, the files_struct is not freed. Therefore, all files_struct's that are used by modprobe tasks will remain locked and un-freed forever when the task exits. Doing this should hopefully yield a clue as to who the external offender is. Granted, if we see step (2) occur, then we'll have to add additional code to try to track down who currently owns the lock. But it still would be very interesting on its own. But if the offender tries to take the lock *after* step (2), then eventually the NMI watchdog should fire, and the context of the offender would be caught in the act. I've rebooted the test kernel in a loop over-night with no problems, so it should work in the normal case. The kernel is here: http://people.redhat.com/anderson/bz647824 kernel-2.6.18-247.el5.bz647824.1.src.rpm kernel-2.6.18-247.el5.bz647824.1.x86_64.rpm kernel-debuginfo-2.6.18-247.el5.bz647824.1.x86_64.rpm kernel-debuginfo-common-2.6.18-247.el5.bz647824.1.x86_64.rpm linux-kernel-test.patch And here's the patch: --- linux-2.6.18.x86_64/include/linux/file.h.orig +++ linux-2.6.18.x86_64/include/linux/file.h @@ -60,6 +60,9 @@ struct files_struct { struct embedded_fd_set close_on_exec_init; struct embedded_fd_set open_fds_init; struct file * fd_array[NR_OPEN_DEFAULT]; +#ifndef __GENKSYMS__ + unsigned long debug; +#endif }; #define files_fdtable(files) (rcu_dereference((files)->fdt)) --- linux-2.6.18.x86_64/kernel/exit.c.orig +++ linux-2.6.18.x86_64/kernel/exit.c @@ -468,6 +468,13 @@ static void close_files(struct files_str * ->file_lock because this is the last reference to the * files structure. */ + if (!spin_trylock(&files->file_lock)) + BUG(); + else if (strcmp(current->comm, "modprobe") == 0) + files->debug = 0xdeadbeef; + else + spin_unlock(&files->file_lock); + fdt = files_fdtable(files); for (;;) { unsigned long set; @@ -515,8 +522,12 @@ void fastcall put_files_struct(struct fi fdt = files_fdtable(files); if (fdt == &files->fdtab) fdt->free_files = files; - else - kmem_cache_free(files_cachep, files); + else { + if (files->debug) + printk("put_files_struct: %p %lx\n", files, files->debug); + else + kmem_cache_free(files_cachep, files); + } free_fdtable(fdt); } } --- linux-2.6.18.x86_64/fs/file.c.orig +++ linux-2.6.18.x86_64/fs/file.c @@ -124,7 +124,11 @@ static void free_fdtable_rcu(struct rcu_ * and the files structure itself was getting destroyed. * It is now safe to free the files structure. */ - kmem_cache_free(files_cachep, fdt->free_files); + if (fdt->free_files->debug) + printk("free_fdtable_rcu: %p %lx\n", + fdt->free_files, fdt->free_files->debug); + else + kmem_cache_free(files_cachep, fdt->free_files); return; } if (fdt->max_fdset <= EMBEDDED_FD_SET_SIZE && --- linux-2.6.18.x86_64/kernel/fork.c.orig +++ linux-2.6.18.x86_64/kernel/fork.c @@ -752,6 +752,7 @@ static struct files_struct *alloc_files( goto out; atomic_set(&newf->count, 1); + newf->debug = 0; spin_lock_init(&newf->file_lock); newf->next_fd = 0;
Created attachment 486027 [details] [may be patch] unshare_files()->copy_files() must not change tsk->files lockless Well, it seems I found the bug... The problem is, I doubt it can explain this particular problem, the race is almost theoretical. But it would be nice to try this patch anyway, imho. We do not have another theory so far. And this race _can_ explain double free/use, at least in theory.
(In reply to comment #48) > > Created attachment 486027 [details] > [may be patch] unshare_files()->copy_files() must not change tsk->files > lockless > > Well, it seems I found the bug... > > The problem is, I doubt it can explain this particular problem, > the race is almost theoretical. But it would be nice to try this > patch anyway, imho. > > We do not have another theory so far. And this race _can_ explain > double free/use, at least in theory. Hmm. I attached the patch with the changelog, but it turns out https://bugzilla.redhat.com/attachment.cgi?id=486027&action=diff do not want to show it. I am copying it here... [RHEL5 PATCH] unshare_files()->copy_files() must not change tsk->files lockless Unless we know task->files == NULL, we should never change ->files lockless. Otherwise we can race with, say, get_files_struct() doing files = task->files; // ------ WINDOW --------- if (files) atomic_inc(&files->count); If the caller of unshare_files() comes in the window above and does the necessary put_files_struct(), atomic_inc() can increment ->count after atomic_dec_and_test() == T. Note: - This is the minimal hack. The real fix should change unshare_files/callers to use unshare_fd() - The usage of unshare_files() in exec paths is strange, it is called twice. Say, load_elf_binary() calls it, then it calls flush_old_exec() which does unshare too. I do not see any serious problem, but at least this means that "atomic_read(&files->count) == 1" fastpath never works, we always do copy_files(). unshare_files() should be called by do_execve() like upstream does, but this requires many changes.
Martin, Can you first build and test a kernel with just Oleg's patch?
Thanks to both of you for the analysis and patches. Meanwhile, I have been pursuing another trace. Before I started the test run mentioned in comment #35, I also had updated the BIOS of the machine (in the meantime I had encountered another problem known to be caused by BIOS bug, which made me realize the BIOS wasn't up-to-date). After I finished the test run from comment #35 I figured that I should try to reproduce the problem with the new BIOS. This test is now running - the original RHEL kernel 2.6.18-238.1.1.el5 has completed 350 reboots already on this machine without failing (it had failed after ~80 reboots before, see comment #21). Sooo, it is at least possible that this whole thing wasn't a RHEL kernel problem but a BIOS issue after all. I have not the slightest idea what the relation to the BIOS might be, the stack trace certainly doesn't suggest so. Let's hold on until early next week. By that time we should know whether the problem is still reproducable (if the kernel reaches ~1000 reboots without failing). Before the BIOS update, the problem was nicely reproducable typically with <100 reboots. Martin
(In reply to comment #47) > > Furthermore, as Oleg mentioned, there should be no situation > where the files_struct.file_lock should be locked upon entry > to close_files(). The modprobe task is single-threaded, and > the files_struct and associated fdtable are unique to that > task. Yes... Once again, I never looked at this code before, I can be wrong. But, according to grep, every caller of spin_lock(files->file_lock) should have a reference to files. Either the caller uses current->files, or it does get_files_struct(). And the only caller of close_files() is put_files_struct() which does atomic_dec_and_test(). And. In any case, if fdt != &files->fdtab, we are going to kmem_cache_free(files) immediately. This means that nobody else must even try to take this lock. rcu protects fdtable, not files_struct. I found the race which can lead to use-after-free, but I doubt it can explain this particular problem. So probably it makes sense to try this patch as well. One note... > In a first-step attempt to track down the external offender, > I've built a kernel with the following patch, which: > > (1) adds a new "debug" field to the files_struct. > (2) upon entry to close_files(), it does a spin_trylock() on > the files->file_lock, and if the lock is already held, then > BUG() immediately. > (3) otherwise, if we're a "modprobe", then set the files->debug > field, and hold the lock. > (4) or if we're not a "modprobe", just release the lock. (4) doesn't look right. If modprobe races with somebody else, it should drop the lock. This decreases the chance modprobe can notice the !spin_trylock(). Not that I think it makes sense to redo the patch, just a nit. > (5) after close_files() is complete, the files_struct is normally > freed by either put_files_struct(), or by free_fdtable_rcu() > if the associated fdtable is embedded; in either case, if > the "files->debug" field is set, the files_struct is not freed. Looks reasonable...
> And. In any case, if fdt != &files->fdtab, we are going to > kmem_cache_free(files) immediately. This means that nobody > else must even try to take this lock. rcu protects fdtable, > not files_struct. AFAICS, rcu doesn't protect anything at all, because none of the files_fdtable() calls I have seen are properly nested with rcu_read_lock() / rcu_read_unlock(). My hypothesis was that, for some reason, the RHEL5 kernel still uses files->file_lock for protecting the fdtable, rather than RCU mechanisms. Therefore it occured to me that taking the lock in close_files might not be the worst idea, after all. However, please let's just wait for the outcome of the test in comment #51 first.
> However, please let's just wait for the outcome of the test in comment #51 > first. No argument there... But Oleg: > (4) doesn't look right. If modprobe races with somebody else, it > should drop the lock. This decreases the chance modprobe can > notice the !spin_trylock(). Not that I think it makes sense to > redo the patch, just a nit. I wasn't sure whether it would be helpful if non-modprobe tasks held the lock throughout close_files() -- but I thought it was enough that they at least try to take the lock. And I didn't think that non-modprobe tasks should leak/freeze the files_struct. But since the test only consists of a reboot-loop, we could easily: (1) have *all* tasks hold the lock throughout close_files(), and (2) leak/freeze *all* files_structs regardless whether they are modprobe tasks. Is that what you mean?
(In reply to comment #53) > > > And. In any case, if fdt != &files->fdtab, we are going to > > kmem_cache_free(files) immediately. This means that nobody > > else must even try to take this lock. rcu protects fdtable, > > not files_struct. > > AFAICS, rcu doesn't protect anything at all, because none of the > files_fdtable() calls I have seen are properly nested with rcu_read_lock() / > rcu_read_unlock(). It does, see fget() for example. > My hypothesis was that, for some reason, the RHEL5 kernel > still uses files->file_lock for protecting the fdtable, rather than RCU > mechanisms. This doesn't matter. If we hold files->file_lock, we should have the reference to file. Otherwise it can go way under us. And, > Therefore it occured to me that taking the lock in close_files > might not be the worst idea, after all. This can help if "something else" takes ->file_lock after close_files() returns.
(In reply to comment #54) > > I wasn't sure whether it would be helpful if non-modprobe tasks held > the lock throughout close_files() -- but I thought it was enough that > they at least try to take the lock. Yes... but I don't really understand what yopu mean ;) > But since the test only consists of a reboot-loop, we could easily: > > (1) have *all* tasks hold the lock throughout close_files(), and > (2) leak/freeze *all* files_structs regardless whether they are > modprobe tasks. Agreed, looks even better. And mark it "freed" unconditionally.
(In reply to comment #55) > > AFAICS, rcu doesn't protect anything at all, because none of the > > files_fdtable() calls I have seen are properly nested with rcu_read_lock() / > > rcu_read_unlock(). > > It does, see fget() for example. I see it done for fcheck_files(), but there are a lot of uses of file_fdtable() where the RCU protection is at least not obvious.
(In reply to comment #57) > (In reply to comment #55) > > > > AFAICS, rcu doesn't protect anything at all, because none of the > > > files_fdtable() calls I have seen are properly nested with rcu_read_lock() / > > > rcu_read_unlock(). > > > > It does, see fget() for example. > > I see it done for fcheck_files(), but there are a lot of uses of file_fdtable() > where the RCU protection is at least not obvious. Of course. But I think you missed the point. Or I misunderstood. (but let me repeat once again: I do not pretend I am expert ;) Any user of file_fdtable(files) should have a reference to "files" to begin with. And files_struct is refcounted, it is not protected by RCU. If atomic_dec_and_test(files->count) succeeds, nobody else should ever look at this memory. In particular, nobody else should take or just dereference files->files_lock. If spin_lock() really helps, this can only mean it masks the problem. We should find the reason. And: one of the possible reasons is double free.
(In reply to comment #53) > However, please let's just wait for the outcome of the test in > comment #51 first. If you still see the failure after the BIOS update, I've updated the debug kernel to collaborate Oleg's patch to copy_files() plus his suggestion to fortify my last patch. It's essentially 3 debug kernels in one, depending upon the "patch" level, which can only have values of 1 (the default), 2 or 3, and can be set on the kernel boot command line like this: title Red Hat Enterprise Linux Client (2.6.18-247.el5.bz647824.3) root (hd0,0) kernel /vmlinuz-2.6.18-247.el5.bz647824.3 ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 rhgb quiet patch=3 initrd /initrd-2.6.18-247.el5.bz647824.3.img o patch=1 (default -- does not need to be explicitly set) Upon entering close_files(), each files_struct is locked and is never freed. If the files_struct is already locked upon entering close_files(), it BUG()'s. o patch=2 This will only invoke Oleg's patch to copy_files(). o patch=3 This will invoke both my debug code and Oleg's patch. The kernel is at http://people.redhat.com/anderson/bz647824 : kernel-2.6.18-247.el5.bz647824.3.src.rpm kernel-2.6.18-247.el5.bz647824.3.x86_64.rpm kernel-debuginfo-2.6.18-247.el5.bz647824.3.x86_64.rpm kernel-debuginfo-common-2.6.18-247.el5.bz647824.3.x86_64.rpm linux-kernel-test.patch This is the patch: --- linux-2.6.18.x86_64/arch/x86_64/kernel/setup.c.orig +++ linux-2.6.18.x86_64/arch/x86_64/kernel/setup.c @@ -303,6 +303,8 @@ static int fullarg(char *p, char *arg) return !memcmp(p, arg, l) && (p[l] == 0 || isspace(p[l])); } +int patchlevel = 1; + static __init void parse_cmdline_early (char ** cmdline_p) { char c = ' ', *to = command_line, *from = COMMAND_LINE; @@ -408,6 +410,11 @@ static __init void parse_cmdline_early ( if (fullarg(from,"oops=panic")) panic_on_oops = 1; + if (fullarg(from, "patch=2")) + patchlevel = 2; + if (fullarg(from, "patch=3")) + patchlevel = 3; + if (!memcmp(from, "noexec=", 7)) nonx_setup(from + 7); --- linux-2.6.18.x86_64/kernel/exit.c.orig +++ linux-2.6.18.x86_64/kernel/exit.c @@ -460,6 +460,7 @@ static void close_files(struct files_str { int i, j; struct fdtable *fdt; + extern int patchlevel; j = 0; @@ -468,6 +469,8 @@ static void close_files(struct files_str * ->file_lock because this is the last reference to the * files structure. */ + if ((patchlevel != 2) && !spin_trylock(&files->file_lock)) + BUG(); fdt = files_fdtable(files); for (;;) { unsigned long set; @@ -503,6 +506,7 @@ struct files_struct *get_files_struct(st void fastcall put_files_struct(struct files_struct *files) { struct fdtable *fdt; + extern int patchlevel; if (atomic_dec_and_test(&files->count)) { close_files(files); @@ -515,8 +519,12 @@ void fastcall put_files_struct(struct fi fdt = files_fdtable(files); if (fdt == &files->fdtab) fdt->free_files = files; - else - kmem_cache_free(files_cachep, files); + else { + if (patchlevel == 2) + kmem_cache_free(files_cachep, files); + else + printk("put_files_struct: %p\n", files); + } free_fdtable(fdt); } } --- linux-2.6.18.x86_64/kernel/fork.c.orig +++ linux-2.6.18.x86_64/kernel/fork.c @@ -876,6 +876,7 @@ static int copy_files(unsigned long clon { struct files_struct *oldf, *newf; int error = 0; + extern int patchlevel; /* * A background process may not have any files ... @@ -894,7 +895,9 @@ static int copy_files(unsigned long clon * This works because we cache current->files (old) as oldf. Don't * break this. */ + if (patchlevel >= 2) task_lock(tsk); tsk->files = NULL; + if (patchlevel >= 2) task_unlock(tsk); newf = dup_fd(oldf, &error); if (!newf) goto out; --- linux-2.6.18.x86_64/fs/file.c.orig +++ linux-2.6.18.x86_64/fs/file.c @@ -113,6 +113,7 @@ static void free_fdtable_rcu(struct rcu_ struct fdtable *fdt = container_of(rcu, struct fdtable, rcu); int fdset_size, fdarray_size; struct fdtable_defer *fddef; + extern int patchlevel; BUG_ON(!fdt); fdset_size = fdt->max_fdset / 8; @@ -124,7 +125,10 @@ static void free_fdtable_rcu(struct rcu_ * and the files structure itself was getting destroyed. * It is now safe to free the files structure. */ - kmem_cache_free(files_cachep, fdt->free_files); + if (patchlevel == 2) + kmem_cache_free(files_cachep, fdt->free_files); + else + printk("free_fdtable_rcu: %p\n", fdt->free_files); return; } if (fdt->max_fdset <= EMBEDDED_FD_SET_SIZE &&
1100 reboots completed over the weekend. I am ashamed. In order to be really sure, I'll ask QA to do the same test on a different piece of hardware. I will get back to you.
Confirmed on another system. After the BIOS upgrade, there are no more problems. I apologize for having wasted your time.
(In reply to comment #61) > Confirmed on another system. After the BIOS upgrade, there are no more > problems. I apologize for having wasted your time. Thanks Martin, that's good news... Thinking about it, if it were just a simple matter of two entities touching the same data structure unlocked and simultaneously, it's hard to logically explain the "shifting" issue of the corruption that was seen.
It is totally impossible for me to explain that shift (or add-one-to-an-otherwise-correct-pointer) issue with the known BIOS problems that were fixed in the new version, either. This problem definitely qualifies as mysterious.