Description of problem: There is a kernel OOPS and crash dump in certain processes during I/O operations. There is little in common except the top of the backtrace, which routinely shows errors being thrown in do_page_fault. Version-Release number of selected component (if applicable): 2.6.18-92.1.17.el5 stock Linux kernel (version of crash is 4.0-5.0.3) How reproducible: Not very. It has happened during a backup process and during an rsync, so it appears to happen during times of high I/O usage. There's no apparent regularity and there's no high load otherwise on the machine, and there's no one module which appears to be tickling the bug. Actual results: Kernel OOPS and crash dump. Expected results: No kernel OOPS or crash dump. Additional info: Here's some basic info using the crash dump utility (just the initial crash output and a backtrace) from two such dumps which happened today. (rsync): [root@aardvark scratch]# crash /boot/System.map-2.6.18-92.1.17.el5 /usr/lib/debug/lib/modules/2.6.18-92.1.17.el5/vmlinux /scratch/2008-12-11-13\:17/vmcore ... SYSTEM MAP: /boot/System.map-2.6.18-92.1.17.el5 DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.17.el5/vmlinux (2.6.18-92.1.17.el5) DUMPFILE: /scratch/2008-12-11-13:17/vmcore CPUS: 2 DATE: Thu Dec 11 13:16:33 2008 UPTIME: 3 days, 22:27:13 LOAD AVERAGE: 0.25, 0.37, 0.29 TASKS: 97 NODENAME: aardvark.cs.wisc.edu RELEASE: 2.6.18-92.1.17.el5 VERSION: #1 SMP Wed Oct 22 04:19:07 EDT 2008 MACHINE: i686 (2992 Mhz) MEMORY: 1 GB PANIC: "Oops: 0000 [#1]" (check log for details) PID: 5358 COMMAND: "rsync" TASK: dd2c4000 [THREAD_INFO: dad60000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 5358 TASK: dd2c4000 CPU: 0 COMMAND: "rsync" #0 [dad60b80] crash_kexec at c0444202 #1 [dad60bc4] die at c04064ae #2 [dad60bf4] do_page_fault at c060ac56 #3 [dad60c2c] error_code (via page_fault) at c0405a6f EAX: 0000003e EBX: 30363773 ECX: 7776e92c EDX: 00000b12 EBP: 00000007 DS: 007b ESI: dad60cd0 ES: 007b EDI: dad60d2c CS: 0060 EIP: f961b9ec ERR: ffffffff EFLAGS: 00010202 #4 [dad60c60] afs_FindVCache at f961b9ec #5 [dad60c74] afs_GetVCache at f961da12 #6 [dad60ce4] afs_lookup at f962716e #7 [dad60d7c] afs_linux_lookup at f964a677 #8 [dad60dec] do_lookup at c047c33c #9 [dad60e10] __link_path_walk at c047e0f0 #10 [dad60e64] link_path_walk at c047e5f2 #11 [dad60ecc] do_path_lookup at c047e9bf #12 [dad60ee8] __user_walk_fd at c047f103 #13 [dad60f00] vfs_lstat_fd at c0478cba #14 [dad60f64] sys_lstat64 at c0478d26 #15 [dad60fb8] system_call at c0404ef8 EAX: 000000c4 EBX: bfbe5470 ECX: bfbe5390 EDX: 00715ff4 DS: 007b ESI: bfbe5390 ES: 007b EDI: 0000000a SS: 007b ESP: bfbe30f0 EBP: bfbe30f8 CS: 0073 EIP: 00825402 ERR: 000000c4 EFLAGS: 00000212 (sushiremote, our backups process): [root@bombus-vestalis crash]# crash /boot/System.map-2.6.18-92.1.17.el5 /usr/lib/debug/lib/modules/2.6.18-92.1.17.el5/vmlinux 2008-12-11-03\:09/vmcore ... SYSTEM MAP: /boot/System.map-2.6.18-92.1.17.el5 DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.18-92.1.17.el5/vmlinux (2.6.18-92.1.17.el5) DUMPFILE: 2008-12-11-03:09/vmcore CPUS: 2 DATE: Thu Dec 11 03:08:43 2008 UPTIME: 17:03:33 LOAD AVERAGE: 4.53, 1.47, 0.57 TASKS: 115 NODENAME: bombus-vestalis.cs.wisc.edu RELEASE: 2.6.18-92.1.17.el5 VERSION: #1 SMP Wed Oct 22 04:19:07 EDT 2008 MACHINE: i686 (3400 Mhz) MEMORY: 1 GB PANIC: "Oops: 0000 [#1]" (check log for details) PID: 12881 COMMAND: "sushiremote" TASK: dcd41000 [THREAD_INFO: ec7f0000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 12881 TASK: dcd41000 CPU: 0 COMMAND: "sushiremote" #0 [ec7f0eb8] crash_kexec at c0444202 #1 [ec7f0efc] die at c04064ae #2 [ec7f0f2c] do_page_fault at c060ac56 #3 [ec7f0f64] error_code (via page_fault) at c0405a6f EAX: 000001b6 EBX: 00008241 ECX: 000001a4 EDX: 000000e8 EBP: 00000004 DS: 007b ESI: ffffff9c ES: 007b EDI: 00000003 CS: 0060 EIP: c046fb67 ERR: ffffffff EFLAGS: 00010207 #4 [ec7f0f98] do_sys_open at c046fb67 #5 [ec7f0fb0] sys_open at c046fbe9 #6 [ec7f0fb8] system_call at c0404ef8 EAX: 00000005 EBX: 0967a0e0 ECX: 00008241 EDX: 000001b6 DS: 007b ESI: 00008241 ES: 007b EDI: 0029dff4 SS: 007b ESP: bfae9a78 EBP: bfae9a98 CS: 0073 EIP: 00b15402 ERR: 00000005 EFLAGS: 00000206
We've had about 10 crash dumps in the past two days. They all show the OOPS error and do_page_fault as #2 on the backtrace. Some of them crashed when running vi and vim. We've also had quite a few kernel panics in this same time. Unfortunately these hosts don't crash dump, but I suspect a similar problem. This might be a bigger problem than I initially thought.
One of my co-workers sent in a bug report to the OpenAFS folks. I can't say for certain, but it might be related to this. http://rt.central.org/rt/Ticket/Display.html?id=123995 A few more have crashed in the past couple days. vmcores are available upon request.
Point of interest, this is still happening on the RHEL 5.3 kernel (2.6.18-128.el5)
Could you please attach the vmcore and the full dmesg log?
Created attachment 357665 [details] The dmesg output from a crashdump'd host
I am unable to upload the 2 GB vmcore file. Please advise.
Created attachment 357837 [details] The vmcore from a recent crashdump (2 GB) The kernel is kernel 2.6.18-128.2.1.el5 on an i686, but this has happened for several kernels.
Because Danny is in China, he won't be seeing this for several more hours, so I took the liberty of looking at the vmcore. There's a bunch of AFS-related soft lockup error messages, which in all probability have something to do with this eventual page fault BUG: BUG: unable to handle kernel NULL pointer dereference at virtual address 0000000b The crash occurred in do_sys_open(): crash> bt PID: 22328 TASK: dbdeb000 CPU: 0 COMMAND: "sushiremote" #0 [eb62beb8] crash_kexec at c0442d4e #1 [eb62befc] die at c04064c6 #2 [eb62bf2c] do_page_fault at c0611537 #3 [eb62bf64] error_code (via page_fault) at c0405a87 EAX: 000001b6 EBX: 00008241 ECX: 000001b6 EDX: 0000007c EBP: 00000006 DS: 007b ESI: ffffff9c ES: 007b EDI: 00000003 CS: 0060 EIP: c04712df ERR: ffffffff EFLAGS: 00010207 #4 [eb62bf98] do_sys_open at c04712df #5 [eb62bfb0] sys_open at c0471361 #6 [eb62bfb8] system_call at c0404f10 EAX: 00000005 EBX: 0a148548 ECX: 00008241 EDX: 000001b6 DS: 007b ESI: 00008241 ES: 007b EDI: 00379ff4 SS: 007b ESP: bf8bc8c8 EBP: bf8bc8e8 CS: 0073 EIP: 00a2d402 ERR: 00000005 EFLAGS: 00000206 crash> at address c04712df: crash> dis -rl c04712df /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1156 0xc047128b <do_sys_open>: push %ebp 0xc047128c <do_sys_open+1>: push %edi 0xc047128d <do_sys_open+2>: push %esi 0xc047128e <do_sys_open+3>: mov %eax,%esi 0xc0471290 <do_sys_open+5>: push %ebx 0xc0471291 <do_sys_open+6>: mov %edx,%eax 0xc0471293 <do_sys_open+8>: sub $0x4,%esp 0xc0471296 <do_sys_open+11>: mov %ecx,%ebx /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1157 0xc0471298 <do_sys_open+13>: call 0xc047f47e <getname> /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1160 0xc047129d <do_sys_open+18>: cmp $0xfffff000,%eax 0xc04712a2 <do_sys_open+23>: mov %eax,%ebp /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1157 0xc04712a4 <do_sys_open+25>: mov %eax,(%esp) /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1160 0xc04712a7 <do_sys_open+28>: ja 0xc0471331 <do_sys_open+166> /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1161 0xc04712ad <do_sys_open+34>: call 0xc0470f87 <get_unused_fd> /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1162 0xc04712b2 <do_sys_open+39>: test %eax,%eax /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1161 0xc04712b4 <do_sys_open+41>: mov %eax,%ebp /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1162 0xc04712b6 <do_sys_open+43>: js 0xc0471329 <do_sys_open+158> /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1163 0xc04712b8 <do_sys_open+45>: pushl 0x18(%esp) 0xc04712bc <do_sys_open+49>: mov %ebx,%ecx 0xc04712be <do_sys_open+51>: mov 0x4(%esp),%edx 0xc04712c2 <do_sys_open+55>: mov %esi,%eax 0xc04712c4 <do_sys_open+57>: call 0xc047125a <do_filp_open> 0xc04712c9 <do_sys_open+62>: mov %eax,%edi /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1164 0xc04712cb <do_sys_open+64>: cmp $0xfffff000,%edi 0xc04712d1 <do_sys_open+70>: pop %eax 0xc04712d2 <do_sys_open+71>: jbe 0xc04712df <do_sys_open+84> /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1165 0xc04712d4 <do_sys_open+73>: mov %ebp,%eax 0xc04712d6 <do_sys_open+75>: mov %edi,%ebp 0xc04712d8 <do_sys_open+77>: call 0xc0470ea5 <put_unused_fd> 0xc04712dd <do_sys_open+82>: jmp 0xc0471329 <do_sys_open+158> /usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/fs/open.c: 1168 0xc04712df <do_sys_open+84>: mov 0x8(%edi),%eax crash> where the call to do_filp_open() returned a value of 00000003, as seen in EDI, and therefore the IS_ERR() macro presumed it was a valid file pointer. do_filp_open() should only return either: (1) a valid kernel virtual address of a file structure, or (2) a negative errno value. Those two types of return values are determined by the IS_ERR() macro. #define MAX_ERRNO 4095 #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO) static inline long IS_ERR(const void *ptr) { return IS_ERR_VALUE((unsigned long)ptr); } If the return value is greater-than-or-equal-to fffff001 (-4095) it's considered an error, otherwise it's accepted as legitimate pointer value. So a bogus return value of 00000003 will be accepted as a kernel virtual address. The page fault occurred when the 00000003 was subsequently used as a file pointer address, when the call to fsnotify_open() tried to dereference "f->f_dentry", which is the bad address of 0000000b, or 00000003 + the f_dentry offset of 8: long do_sys_open(int dfd, const char __user *filename, int flags, int mode) { char *tmp = getname(filename); int fd = PTR_ERR(tmp); if (!IS_ERR(tmp)) { fd = get_unused_fd(); if (fd >= 0) { struct file *f = do_filp_open(dfd, tmp, flags, mode); if (IS_ERR(f)) { put_unused_fd(fd); fd = PTR_ERR(f); } else { fsnotify_open(f->f_dentry); fd_install(fd, f); } } putname(tmp); } return fd; } So the question is: how was it possible that do_filp_open() returned something other than: (1) a negative errno value, or (2) a legitimate file pointer? Looking at the possible return values from do_filp_open(): static struct file *do_filp_open(int dfd, const char *filename, int flags, int mode) { int namei_flags, error; struct nameidata nd; namei_flags = flags; if ((namei_flags+1) & O_ACCMODE) namei_flags++; error = open_namei(dfd, filename, namei_flags, mode, &nd); if (!error) return nameidata_to_filp(&nd, flags); return ERR_PTR(error); } open_namei() can only return 0 or a negative error number. So if it did return an error, then do_filp_open() would just pass it back unchanged, and the IS_ERR() would catch it correctly. So open_namei() must have returned a 0 (ok), and nameidata_to_filp() must have been called -- and in turn must have returned the bogus 00000003: struct file *nameidata_to_filp(struct nameidata *nd, int flags) { struct file *filp; /* Pick up the filp from the open intent */ filp = nd->intent.open.file; /* Has the filesystem initialised the file for us? */ if (filp->f_dentry == NULL) filp = __dentry_open(nd->dentry, nd->mnt, flags, filp, NULL); else path_release(nd); return filp; } So the "filp" pointer above taken from nd->intent.open.file must have been an accessible kernel address because it was read OK, but it must have had a NULL f_dentry pointer. So __dentry_open() must have been called, and *it* must have returned the 00000003 back to filp. Note that the original filp is passed to __dentry_open() above, and if it were to be returned by __dentry_open(), it would get the same address back, i.e. from argument/pointer "f" below: static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags, struct file *f, int (*open)(struct inode *, struct file *)) { struct inode *inode; int error; f->f_flags = flags; f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; inode = dentry->d_inode; if (f->f_mode & FMODE_WRITE) { error = get_write_access(inode); if (error) goto cleanup_file; } f->f_mapping = inode->i_mapping; f->f_dentry = dentry; f->f_vfsmnt = mnt; f->f_pos = 0; f->f_op = fops_get(inode->i_fop); file_move(f, &inode->i_sb->s_files); if (!open && f->f_op) open = f->f_op->open; if (open) { error = open(inode, f); if (error) goto cleanup_all; } f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC); file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping); /* NB: we're sure to have correct a_ops only after f_op->open */ if (f->f_flags & O_DIRECT) { if (!f->f_mapping->a_ops || ((!f->f_mapping->a_ops->direct_IO) && (!f->f_mapping->a_ops->get_xip_page))) { fput(f); f = ERR_PTR(-EINVAL); } } return f; cleanup_all: fops_put(f->f_op); if (f->f_mode & FMODE_WRITE) put_write_access(inode); file_kill(f); f->f_dentry = NULL; f->f_vfsmnt = NULL; cleanup_file: put_filp(f); dput(dentry); mntput(mnt); return ERR_PTR(error); } So __dentry_open() must have passed through the "cleanup_all" or "cleanup_file" labels, where ERR_PTR(error) seems to have returned the 00000003. But 00000003 cannot have come from get_write_access(), because that function can only return 0 or -ETXTBSY. So that leaves the filesystem-dependent "open" call, because the calling function nameidata_to_filp() had passed in a NULL "open" argument: if (!open && f->f_op) open = f->f_op->open; if (open) { error = open(inode, f); if (error) goto cleanup_all; } And here's where the AFS filesystem comes into play. The nameidata structure was a stack variable and it got overwritten. But we can probably deduce that the file being opened was in the afs filesystem: crash> bt PID: 22328 TASK: dbdeb000 CPU: 0 COMMAND: "sushiremote" #0 [eb62beb8] crash_kexec at c0442d4e #1 [eb62befc] die at c04064c6 #2 [eb62bf2c] do_page_fault at c0611537 #3 [eb62bf64] error_code (via page_fault) at c0405a87 EAX: 000001b6 EBX: 00008241 ECX: 000001b6 EDX: 0000007c EBP: 00000006 DS: 007b ESI: ffffff9c ES: 007b EDI: 00000003 CS: 0060 EIP: c04712df ERR: ffffffff EFLAGS: 00010207 #4 [eb62bf98] do_sys_open at c04712df #5 [eb62bfb0] sys_open at c0471361 #6 [eb62bfb8] system_call at c0404f10 EAX: 00000005 EBX: 0a148548 ECX: 00008241 EDX: 000001b6 DS: 007b ESI: 00008241 ES: 007b EDI: 00379ff4 SS: 007b ESP: bf8bc8c8 EBP: bf8bc8e8 CS: 0073 EIP: 00a2d402 ERR: 00000005 EFLAGS: 00000206 crash> The filename being opened is in the task's userspace, and was passed into the kernel in the EBX register: crash> rd -u 0a148548 50 a148548: 732f732f 69687375 756f6d2f 6f70746e /s/sushi/mountpo a148558: 73746e69 762e622f 312e6c6f 32303532 ints/b.vol.12502 a148568: 35353932 30322f33 302e3930 35312e38 29553/2009.08.15 a148578: 7366413a 632e703a 6f646e6f 77732e72 :Afs:p.condor.sw a148588: 77656e2d 3030322f 31383039 38312e35 -new/20090815.18 a148598: 39323232 6474732f 00727265 007a672e 2229/stderr..gz. a1485a8: 00230000 00000071 00000000 00000000 ..#.q........... a1485b8: 00000000 00000000 00000000 00000000 ................ a1485c8: 00000000 00000000 093f6164 0937643c ........da?.<d7. a1485d8: 0937643c 00000000 00000000 00000000 <d7............. a1485e8: 00000000 00000000 0000003c 00000000 ........<....... a1485f8: 00000000 00000000 00000000 00000000 ................ a148608: 00000000 00000000 ........ crash> I'm guessing that "/s" is perhaps linked to "/afs/cs.wisc.edu/s", given that the task has another file opened in similarly-named directory tree: crash> files PID: 22328 TASK: dbdeb000 CPU: 0 COMMAND: "sushiremote" ROOT: / CWD: /afs/cs.wisc.edu/u/s/u/sushi FD FILE DENTRY INODE TYPE PATH 0 f787b0c0 d893e2b0 de8cb6e8 SOCK socket:/[1448774] 1 f787b0c0 d893e2b0 de8cb6e8 SOCK socket:/[1448774] 2 ddbc22c0 d893e118 cd45e0e8 SOCK socket:/[1448776] 3 cb522a80 d9fe4d50 e96bbb28 SOCK socket:/[1449031] 4 f7080980 e4a14668 d8111040 REG /afs/cs.wisc.edu/s/sushi-0.3.16/mountpoints/b.vol.1250229553/2009.08.15:Afs:p.condor.sw-new/20090815.182229/data0.gz 5 d3212680 e1a48bb8 e12c8e50 PIPE 7 e9f3f200 e444d1a0 ca137370 PIPE crash> And taking the "data0.gz" file structure's file struct at f7080980, its f_op.f_open leads to afs_linux_open(): crash> struct file.f_op f7080980 f_op = 0xf95e0460, crash> sym 0xf95e0460 f95e0460 (d) afs_file_fops crash> file_operations.open f95e0460 open = 0xf95bbca7, crash> sym 0xf95bbca7 f95bbca7 (t) afs_linux_open crash> So it appears that afs_linux_open() returned the 00000003, but there's no way we can tell given that it's a third-party module. Danny, can you verify my logic above?
Thanks Dave for the excellent analysis, I totally agree with you. This is a known security issue in OpenAFS, please refer http://www.openafs.org/security/OPENAFS-SA-2009-002.txt and take the patch for openAFS Since this is an openAFS issue, not kernel bugs, close the bug for rhel.