Bug 644726
| Summary: | panic in find_ge_pid() due to race between lseek() and readdir() on /proc | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Lachlan McIlroy <lmcilroy> | ||||||||
| Component: | kernel | Assignee: | Lachlan McIlroy <lmcilroy> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Boris Ranto <branto> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 5.5 | CC: | branto, kzhang, vgaikwad | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 646711 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2011-01-13 21:57:41 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 646711 | ||||||||||
| Attachments: |
|
||||||||||
Created attachment 454973 [details]
Patch to avoid race condition
This patch avoids the race condition by having default_llseek() acquire the inode mutex so that it serialises with vfs_readdir(). I'm not sure if this is the best approach but it does work for this test case. It might be better to acquire the mutex in vfs_llseek().
This bug has been fixed upstream via this commit:
commit ab91261f5c43f196ec7ff1d113847b87b7606b26
Author: Arnd Bergmann <arnd>
Date: Wed Jul 7 22:55:17 2010 +0200
vfs: don't use BKL in default_llseek
There are currently 191 users of default_llseek.
Nine of these are in device drivers that use the
big kernel lock. None of these ever touch
file->f_pos outside of llseek or file_pos_write.
Consequently, we never rely on the BKL
in the default_llseek function and can
replace that with i_mutex, which is also
used in generic_file_llseek.
Signed-off-by: Arnd Bergmann <arnd>
diff --git a/fs/read_write.c b/fs/read_write.c
index 74e3658..fd09f61 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -124,7 +124,7 @@ loff_t default_llseek(struct file *file, loff_t offset, int origin)
{
loff_t retval;
- lock_kernel();
+ mutex_lock(&file->f_dentry->d_inode->i_mutex);
switch (origin) {
case SEEK_END:
offset += i_size_read(file->f_path.dentry->d_inode);
@@ -145,7 +145,7 @@ loff_t default_llseek(struct file *file, loff_t offset, int origin)
retval = offset;
}
out:
- unlock_kernel();
+ mutex_unlock(&file->f_dentry->d_inode->i_mutex);
return retval;
}
EXPORT_SYMBOL(default_llseek);
Created attachment 460473 [details]
Patch to add an llseek operation with mutex locking to procfs
This patch moves the inode mutex lock into procfs by creating a llseek file operation specific to procfs. This eliminates the risk that acquiring the inode mutex in default_llseek() may cause problems for other users of that function.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-233.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |
Created attachment 454505 [details] Reproducer test case. Description of problem: System panics with the following when doing a readdir() on /proc: stack segment: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:08.0/0000:0a:00.0/0000:0b:04.0/0000:0d:00.1/irq CPU 0 Modules linked in: mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl sunrpc bonding ipv6 xfrm_nalgo crypto_api loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod cdrom sg e1000e(U) 8021q hpilo serio_raw pcspkr bnx2 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 9020, comm: Agent Tainted: G 2.6.18-194.el5 #1 RIP: 0010:[<ffffffff8009f4cf>] [<ffffffff8009f4cf>] find_ge_pid+0x3d/0x86 RSP: 0018:ffff8100bca53e78 EFLAGS: 00010207 RAX: 001ffffffffffff0 RBX: 001ffffffffffff0 RCX: 0000000000000000 RDX: 0000000000007f2c RSI: f958000000000000 RDI: 5600000000000000 RBP: 001fffff80315c10 R08: 00000000ffffff2b R09: 000000000000000a R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: ffff810220020120 R14: ffffffff80025855 R15: ffff8100bca53f38 FS: 0000000045605940(0063) GS:ffffffff803cb000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002aaaac079000 CR3: 00000000a8bcb000 CR4: 00000000000006e0 Process Agent (pid: 9020, threadinfo ffff8100bca52000, task ffff8100b6bd5040) Stack: 00000000ffffff2b 00000000fffffffe ffff81021e91a1c0 ffffffff80106799 ffff81021e91a1c0 ffffffff80048172 ffff8100bca53f38 ffffffff80025855 ffff8100bca53f38 ffff81021e91a1c0 00000000fffffffe ffff810220020068 Call Trace: [<ffffffff80106799>] next_tgid+0xa/0x41 [<ffffffff80048172>] proc_pid_readdir+0xe5/0x10b [<ffffffff80025855>] filldir+0x0/0xb7 [<ffffffff80025855>] filldir+0x0/0xb7 [<ffffffff80035556>] vfs_readdir+0x77/0xa9 [<ffffffff80038e0b>] sys_getdents+0x75/0xbd [<ffffffff8005e229>] tracesys+0x71/0xe0 [<ffffffff8005e28d>] tracesys+0xd5/0xe0 Code: 48 8b 7d 08 48 85 ff 74 25 48 63 d2 be 00 80 00 00 e8 c7 2b RIP [<ffffffff8009f4cf>] find_ge_pid+0x3d/0x86 RSP <ffff8100bca53e78> The cause of the problem is a race between lseek() (triggered by a rewinddir()) and readdir() where the lseek() resets the file pointer while proc_pid_readdir() is using it. Version-Release number of selected component (if applicable): Reported on 2.6.18-194.el5 Reproduced on 2.6.18-227.el5 How reproducible: Everytime test case is run Steps to Reproduce: Run the test program, "readdir", and the system should panic within a few seconds. Additional info: vmcore available on megatron: crash> sys KERNEL: /cores/20101018220110/work/vmlinux DUMPFILE: /cores/20101018220110/work/vmcore CPUS: 16 DATE: Mon Oct 18 09:48:42 2010 UPTIME: 10 days, 23:04:37 LOAD AVERAGE: 9.15, 5.71, 3.59 TASKS: 903 NODENAME: <hostname> RELEASE: 2.6.18-194.el5 VERSION: #1 SMP Tue Mar 16 21:52:39 EDT 2010 MACHINE: x86_64 (2400 Mhz) MEMORY: 15.8 GB PANIC: "" crash> bt PID: 9020 TASK: ffff8100b6bd5040 CPU: 0 COMMAND: "Agent" #0 [ffffffff80449de0] crash_kexec at ffffffff800aeb6b #1 [ffffffff80449ea0] __die at ffffffff80066157 #2 [ffffffff80449ee0] die at ffffffff8006cce5 #3 [ffffffff80449f10] do_stack_segment at ffffffff8006d632 #4 [ffffffff80449f50] stack_segment at ffffffff8005f221 [exception RIP: find_ge_pid+61] RIP: ffffffff8009f4cf RSP: ffff8100bca53e78 RFLAGS: 00010207 RAX: 001ffffffffffff0 RBX: 001ffffffffffff0 RCX: 0000000000000000 RDX: 0000000000007f2c RSI: f958000000000000 RDI: 5600000000000000 RBP: 001fffff80315c10 R8: 00000000ffffff2b R9: 000000000000000a R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: ffff810220020120 R14: ffffffff80025855 R15: ffff8100bca53f38 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <STACKFAULT exception stack> --- #5 [ffff8100bca53e78] find_ge_pid at ffffffff8009f4cf #6 [ffff8100bca53e90] next_tgid at ffffffff80106799 #7 [ffff8100bca53ea0] proc_pid_readdir at ffffffff80048172 #8 [ffff8100bca53ef0] vfs_readdir at ffffffff80035556 #9 [ffff8100bca53f30] sys_getdents at ffffffff80038e0b #10 [ffff8100bca53f80] tracesys at ffffffff8005e28d (via system_call) RIP: 00000037f5a9680b RSP: 0000000045604ea0 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: ffffffff8005e28d RCX: ffffffffffffffff RDX: 0000000000008000 RSI: 00002aaaac037f98 RDI: 000000000000000b RBP: 0000000000000003 R8: 00002aaaac037f98 R9: 0000000000080000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: ffffffffffffffa0 R14: 00002aaaac037f60 R15: 0000000000001000 ORIG_RAX: 000000000000004e CS: 0033 SS: 002b crash> file ffff81021e91a1c0 struct file { f_u = { fu_list = { next = 0xffff81041a8073c0, prev = 0xffff81041ffeccf8 }, fu_rcuhead = { next = 0xffff81041a8073c0, func = 0xffff81041ffeccf8 } }, f_dentry = 0xffff8102270f1300, f_vfsmnt = 0xffff810107796780, f_op = 0xffffffff80321880, f_count = { counter = 2 }, f_flags = 100352, f_mode = 13, f_pos = 44, ... f_pos is 44 but proc_pid_readdir() expects it to be >= 257. In proc_pid_readdir(); tgid = filp->f_pos - TGID_OFFSET; for (task = next_tgid(tgid); and that results in find_ge_pid() accessing an array with a negative index. So why is f_pos 44? crash> dentry 0xffff8102270f1300 struct dentry { d_count = { counter = 373 }, d_flags = 16, d_lock = { raw_lock = { slock = 1 } }, d_inode = 0xffff810220020068, ... crash> proc_inode -o struct proc_inode { [0] struct pid *pid; [8] int fd; union { int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **); int (*proc_read)(struct task_struct *, char *); [16] } op; [24] struct proc_dir_entry *pde; [32] struct inode vfs_inode; } SIZE: 592 crash> px 0xffff810220020068-32 $3 = 0xffff810220020048 crash> proc_inode 0xffff810220020048 struct proc_inode { pid = 0x0, fd = 0, op = { proc_get_link = 0, proc_read = 0 }, pde = 0xffffffff803216e0, ... crash> proc_dir_entry 0xffffffff803216e0 struct proc_dir_entry { low_ino = 1, namelen = 5, name = 0xffffffff802b4917 "/proc", mode = 16749, nlink = 15, uid = 0, gid = 0, size = 0, proc_iops = 0xffffffff803217c0, proc_fops = 0xffffffff80321880, get_info = 0, owner = 0x0, next = 0x0, parent = 0xffffffff803216e0, subdir = 0xffff81041e6de4c0, data = 0x0, read_proc = 0, write_proc = 0, count = { counter = 1 }, deleted = 0, set = 0x0 } crash> list proc_dir_entry.next -s proc_dir_entry.name 0xffff81041e6de4c0 ffff81041e6de4c0 name = 0xffff81041e6de550 "mpt", ffff8102273ed880 name = 0xffff8102273ed910 "ipmi", ffff81041ff6e480 name = 0xffff81041ff6e510 "scsi", ffff81021fa34d80 name = 0xffff81021fa34e10 "mdstat", ffff81041f8f8ec0 name = 0xffff81041f8f8f50 "ide", ffff81041fb3c080 name = 0xffff81041fb3c110 "crypto", ffff81021f022180 name = 0xffff81021f022210 "key-users", ffff81041fb3c180 name = 0xffff81041fb3c210 "keys", ffff81021f7216c0 name = 0xffff81021f721750 "swaps", ffff81041f58c280 name = 0xffff81041f58c310 "kallsyms", ffff81041f58c380 name = 0xffff81041f58c410 "dma", ffff81021fe048c0 name = 0xffff81021fe04950 "iomem", ffff81041f58c580 name = 0xffff81041f58c610 "ioports", ffff81021f413180 name = 0xffff81021f413210 "misc", ffff81041f9c0c80 name = 0xffff81041f9c0d10 "acpi", ffff81041f9c0880 name = 0xffff81041f9c0910 "fb", ffff81021f80c9c0 name = 0xffff81021f80ca50 "mtrr", ffff81021f80a0c0 name = 0xffff81021f80a150 "irq", ffff8102200235c0 name = 0xffff810220023650 "bus", ffff81010779b5c0 name = 0xffff81010779b650 "tty", ffff81010779b6c0 name = 0xffff81010779b750 "driver", ffff8102200231c0 name = 0xffff810220023250 "fs", ffff81010779b8c0 name = 0xffff81010779b950 "sys", ffff810220022e80 name = 0xffff810220022f10 "sysvipc", ffff810220022d80 name = 0xffff810220022e10 "net", ffff81010779bac0 name = 0xffff81010779bb50 "sysrq-trigger", ffff810220022c80 name = 0xffff810220022d10 "vmcore", ffff81010779bbc0 name = 0xffff81010779bc50 "kcore", ffff810220022b80 name = 0xffff810220022c10 "schedstat", ffff81010779bcc0 name = 0xffff81010779bd50 "modules", ffff810220022a80 name = 0xffff810220022b10 "diskstats", ffff81010779bdc0 name = 0xffff81010779be50 "zoneinfo", ffff810220022980 name = 0xffff810220022a10 "vmstat", ffff81010779bec0 name = 0xffff81010779bf50 "buddyinfo", ffff810220022880 name = 0xffff810220022910 "slabinfo", ffff810107769480 name = 0xffff810107769510 "interrupts", ffff810220022780 name = 0xffff810220022810 "stat", ffff810107769580 name = 0xffff810107769610 "partitions", ffff810220022680 name = 0xffff810220022710 "cpuinfo", ffff810107769680 name = 0xffff810107769710 "devices", ffff810220022580 name = 0xffff810220022610 "kmsg", ffff810220022480 name = 0xffff810220022510 "mounts", ffff810107769780 name = 0xffff810107769810 "execdomains", ffff810220022380 name = 0xffff810220022410 "locks", ffff810107769880 name = 0xffff810107769910 "cmdline", ffff810220022280 name = 0xffff810220022310 "filesystems", ffff810107769980 name = 0xffff810107769a10 "version", ffff810220022180 name = 0xffff810220022210 "meminfo", ffff810220022080 name = 0xffff810220022110 "uptime", ffff810107769a80 name = 0xffff810107769b10 "loadavg", crash> list proc_dir_entry.next 0xffff81041e6de4c0 | wc 50 50 850 50 entries but f_pos is only 44. It appears it didn't scan all the entries. Maybe the file cursor was reset while it was scanning. But proc_root_readdir() should still have set it to 256. The stack trace does not indicate that proc_root_readdir() was called - it should be between these two functions: #7 [ffff8100bca53ea0] proc_pid_readdir at ffffffff80048172 #8 [ffff8100bca53ef0] vfs_readdir at ffffffff80035556 crash> file ffff81021e91a1c0 struct file { ... f_op = 0xffffffff80321880, ... crash> sym 0xffffffff80321880 ffffffff80321880 (d) proc_root_operations crash> file_operations 0xffffffff80321880 struct file_operations { owner = 0x0, llseek = 0, read = 0xffffffff800f2dc2 <generic_read_dir>, aio_read = 0, write = 0, aio_write = 0, readdir = 0xffffffff8004df01 <proc_root_readdir>, ... So vfs_readdir should have called proc_root_readdir(). proc_root_readdir() acquires the BKL while scanning the initial entries (but releases it before calling proc_pid_readdir()) and vfs_readdir() acquires the inode mutex around the call to proc_root_readdir(). Will these locks prevent a race against a rewinddir() (lseek) operation? Since the llseek operation is NULL in the file operations structure above, it will use default_llseek(). That acquires the BKL but not the inode mutex. This creates a window between when proc_root_readdir() drops the BKL and when proc_pid_readdir() uses f_pos. The two obvious options to fixing this are: 1. Move the BKL in proc_root_readdir() to cover the call to proc_pid_readdir() 2. Add inode mutex locking in default_llseek() or vfs_llseek()