Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 644726

Summary: panic in find_ge_pid() due to race between lseek() and readdir() on /proc
Product: Red Hat Enterprise Linux 5 Reporter: Lachlan McIlroy <lmcilroy>
Component: kernelAssignee: Lachlan McIlroy <lmcilroy>
Status: CLOSED ERRATA QA Contact: Boris Ranto <branto>
Severity: high Docs Contact:
Priority: high    
Version: 5.5CC: branto, kzhang, vgaikwad
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 646711 (view as bug list) Environment:
Last Closed: 2011-01-13 21:57:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 646711    
Attachments:
Description Flags
Reproducer test case.
none
Patch to avoid race condition
none
Patch to add an llseek operation with mutex locking to procfs none

Description Lachlan McIlroy 2010-10-20 04:28:02 UTC
Created attachment 454505 [details]
Reproducer test case.

Description of problem:
System panics with the following when doing a readdir() on /proc:

stack segment: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:08.0/0000:0a:00.0/0000:0b:04.0/0000:0d:00.1/irq
CPU 0 
Modules linked in: mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl sunrpc bonding ipv6 xfrm_nalgo crypto_api loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod cdrom sg e1000e(U) 8021q hpilo serio_raw pcspkr bnx2 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 9020, comm: Agent Tainted: G      2.6.18-194.el5 #1
RIP: 0010:[<ffffffff8009f4cf>]  [<ffffffff8009f4cf>] find_ge_pid+0x3d/0x86
RSP: 0018:ffff8100bca53e78  EFLAGS: 00010207
RAX: 001ffffffffffff0 RBX: 001ffffffffffff0 RCX: 0000000000000000
RDX: 0000000000007f2c RSI: f958000000000000 RDI: 5600000000000000
RBP: 001fffff80315c10 R08: 00000000ffffff2b R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
R13: ffff810220020120 R14: ffffffff80025855 R15: ffff8100bca53f38
FS:  0000000045605940(0063) GS:ffffffff803cb000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaaac079000 CR3: 00000000a8bcb000 CR4: 00000000000006e0
Process Agent (pid: 9020, threadinfo ffff8100bca52000, task ffff8100b6bd5040)
Stack:  00000000ffffff2b 00000000fffffffe ffff81021e91a1c0 ffffffff80106799
 ffff81021e91a1c0 ffffffff80048172 ffff8100bca53f38 ffffffff80025855
 ffff8100bca53f38 ffff81021e91a1c0 00000000fffffffe ffff810220020068
Call Trace:
 [<ffffffff80106799>] next_tgid+0xa/0x41
 [<ffffffff80048172>] proc_pid_readdir+0xe5/0x10b
 [<ffffffff80025855>] filldir+0x0/0xb7
 [<ffffffff80025855>] filldir+0x0/0xb7
 [<ffffffff80035556>] vfs_readdir+0x77/0xa9
 [<ffffffff80038e0b>] sys_getdents+0x75/0xbd
 [<ffffffff8005e229>] tracesys+0x71/0xe0
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0


Code: 48 8b 7d 08 48 85 ff 74 25 48 63 d2 be 00 80 00 00 e8 c7 2b 
RIP  [<ffffffff8009f4cf>] find_ge_pid+0x3d/0x86
 RSP <ffff8100bca53e78>

The cause of the problem is a race between lseek() (triggered by a rewinddir()) and readdir() where the lseek() resets the file pointer while proc_pid_readdir() is using it.

Version-Release number of selected component (if applicable):
Reported on 2.6.18-194.el5
Reproduced on 2.6.18-227.el5

How reproducible:
Everytime test case is run

Steps to Reproduce:
Run the test program, "readdir", and the system should panic within a few seconds.
  
Additional info:
vmcore available on megatron:

crash> sys
      KERNEL: /cores/20101018220110/work/vmlinux
    DUMPFILE: /cores/20101018220110/work/vmcore
        CPUS: 16
        DATE: Mon Oct 18 09:48:42 2010
      UPTIME: 10 days, 23:04:37
LOAD AVERAGE: 9.15, 5.71, 3.59
       TASKS: 903
    NODENAME: <hostname>
     RELEASE: 2.6.18-194.el5
     VERSION: #1 SMP Tue Mar 16 21:52:39 EDT 2010
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 15.8 GB
       PANIC: ""

crash> bt
PID: 9020   TASK: ffff8100b6bd5040  CPU: 0   COMMAND: "Agent"
 #0 [ffffffff80449de0] crash_kexec at ffffffff800aeb6b
 #1 [ffffffff80449ea0] __die at ffffffff80066157
 #2 [ffffffff80449ee0] die at ffffffff8006cce5
 #3 [ffffffff80449f10] do_stack_segment at ffffffff8006d632
 #4 [ffffffff80449f50] stack_segment at ffffffff8005f221
    [exception RIP: find_ge_pid+61]
    RIP: ffffffff8009f4cf  RSP: ffff8100bca53e78  RFLAGS: 00010207
    RAX: 001ffffffffffff0  RBX: 001ffffffffffff0  RCX: 0000000000000000
    RDX: 0000000000007f2c  RSI: f958000000000000  RDI: 5600000000000000
    RBP: 001fffff80315c10   R8: 00000000ffffff2b   R9: 000000000000000a
    R10: 0000000000000000  R11: 0000000000000206  R12: 0000000000000000
    R13: ffff810220020120  R14: ffffffff80025855  R15: ffff8100bca53f38
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <STACKFAULT exception stack> ---
 #5 [ffff8100bca53e78] find_ge_pid at ffffffff8009f4cf
 #6 [ffff8100bca53e90] next_tgid at ffffffff80106799
 #7 [ffff8100bca53ea0] proc_pid_readdir at ffffffff80048172
 #8 [ffff8100bca53ef0] vfs_readdir at ffffffff80035556
 #9 [ffff8100bca53f30] sys_getdents at ffffffff80038e0b
#10 [ffff8100bca53f80] tracesys at ffffffff8005e28d (via system_call)
    RIP: 00000037f5a9680b  RSP: 0000000045604ea0  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: ffffffff8005e28d  RCX: ffffffffffffffff
    RDX: 0000000000008000  RSI: 00002aaaac037f98  RDI: 000000000000000b
    RBP: 0000000000000003   R8: 00002aaaac037f98   R9: 0000000000080000
    R10: 0000000000000000  R11: 0000000000000206  R12: 0000000000000000
    R13: ffffffffffffffa0  R14: 00002aaaac037f60  R15: 0000000000001000
    ORIG_RAX: 000000000000004e  CS: 0033  SS: 002b

crash> file ffff81021e91a1c0
struct file {
  f_u = {
    fu_list = {
      next = 0xffff81041a8073c0, 
      prev = 0xffff81041ffeccf8
    }, 
    fu_rcuhead = {
      next = 0xffff81041a8073c0, 
      func = 0xffff81041ffeccf8
    }
  }, 
  f_dentry = 0xffff8102270f1300, 
  f_vfsmnt = 0xffff810107796780, 
  f_op = 0xffffffff80321880, 
  f_count = {
    counter = 2
  }, 
  f_flags = 100352, 
  f_mode = 13, 
  f_pos = 44, 
...

f_pos is 44 but proc_pid_readdir() expects it to be >= 257.

In proc_pid_readdir();

tgid = filp->f_pos - TGID_OFFSET;
	for (task = next_tgid(tgid);

and that results in find_ge_pid() accessing an array with a negative index.

So why is f_pos 44?

crash> dentry 0xffff8102270f1300
struct dentry {
  d_count = {
    counter = 373
  }, 
  d_flags = 16, 
  d_lock = {
    raw_lock = {
      slock = 1
    }
  }, 
  d_inode = 0xffff810220020068, 
...

crash> proc_inode -o
struct proc_inode {
    [0] struct pid *pid;
    [8] int fd;
        union {
            int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
            int (*proc_read)(struct task_struct *, char *);
   [16] } op;
   [24] struct proc_dir_entry *pde;
   [32] struct inode vfs_inode;
}
SIZE: 592

crash> px 0xffff810220020068-32
$3 = 0xffff810220020048

crash> proc_inode 0xffff810220020048
struct proc_inode {
  pid = 0x0, 
  fd = 0, 
  op = {
    proc_get_link = 0, 
    proc_read = 0
  }, 
  pde = 0xffffffff803216e0, 
...

crash> proc_dir_entry 0xffffffff803216e0
struct proc_dir_entry {
  low_ino = 1, 
  namelen = 5, 
  name = 0xffffffff802b4917 "/proc", 
  mode = 16749, 
  nlink = 15, 
  uid = 0, 
  gid = 0, 
  size = 0, 
  proc_iops = 0xffffffff803217c0, 
  proc_fops = 0xffffffff80321880, 
  get_info = 0, 
  owner = 0x0, 
  next = 0x0, 
  parent = 0xffffffff803216e0, 
  subdir = 0xffff81041e6de4c0, 
  data = 0x0, 
  read_proc = 0, 
  write_proc = 0, 
  count = {
    counter = 1
  }, 
  deleted = 0, 
  set = 0x0
}

crash> list proc_dir_entry.next -s proc_dir_entry.name 0xffff81041e6de4c0
ffff81041e6de4c0
  name = 0xffff81041e6de550 "mpt", 
ffff8102273ed880
  name = 0xffff8102273ed910 "ipmi", 
ffff81041ff6e480
  name = 0xffff81041ff6e510 "scsi", 
ffff81021fa34d80
  name = 0xffff81021fa34e10 "mdstat", 
ffff81041f8f8ec0
  name = 0xffff81041f8f8f50 "ide", 
ffff81041fb3c080
  name = 0xffff81041fb3c110 "crypto", 
ffff81021f022180
  name = 0xffff81021f022210 "key-users", 
ffff81041fb3c180
  name = 0xffff81041fb3c210 "keys", 
ffff81021f7216c0
  name = 0xffff81021f721750 "swaps", 
ffff81041f58c280
  name = 0xffff81041f58c310 "kallsyms", 
ffff81041f58c380
  name = 0xffff81041f58c410 "dma", 
ffff81021fe048c0
  name = 0xffff81021fe04950 "iomem", 
ffff81041f58c580
  name = 0xffff81041f58c610 "ioports", 
ffff81021f413180
  name = 0xffff81021f413210 "misc", 
ffff81041f9c0c80
  name = 0xffff81041f9c0d10 "acpi", 
ffff81041f9c0880
  name = 0xffff81041f9c0910 "fb", 
ffff81021f80c9c0
  name = 0xffff81021f80ca50 "mtrr", 
ffff81021f80a0c0
  name = 0xffff81021f80a150 "irq", 
ffff8102200235c0
  name = 0xffff810220023650 "bus", 
ffff81010779b5c0
  name = 0xffff81010779b650 "tty", 
ffff81010779b6c0
  name = 0xffff81010779b750 "driver", 
ffff8102200231c0
  name = 0xffff810220023250 "fs", 
ffff81010779b8c0
  name = 0xffff81010779b950 "sys", 
ffff810220022e80
  name = 0xffff810220022f10 "sysvipc", 
ffff810220022d80
  name = 0xffff810220022e10 "net", 
ffff81010779bac0
  name = 0xffff81010779bb50 "sysrq-trigger", 
ffff810220022c80
  name = 0xffff810220022d10 "vmcore", 
ffff81010779bbc0
  name = 0xffff81010779bc50 "kcore", 
ffff810220022b80
  name = 0xffff810220022c10 "schedstat", 
ffff81010779bcc0
  name = 0xffff81010779bd50 "modules", 
ffff810220022a80
  name = 0xffff810220022b10 "diskstats", 
ffff81010779bdc0
  name = 0xffff81010779be50 "zoneinfo", 
ffff810220022980
  name = 0xffff810220022a10 "vmstat", 
ffff81010779bec0
  name = 0xffff81010779bf50 "buddyinfo", 
ffff810220022880
  name = 0xffff810220022910 "slabinfo", 
ffff810107769480
  name = 0xffff810107769510 "interrupts", 
ffff810220022780
  name = 0xffff810220022810 "stat", 
ffff810107769580
  name = 0xffff810107769610 "partitions", 
ffff810220022680
  name = 0xffff810220022710 "cpuinfo", 
ffff810107769680
  name = 0xffff810107769710 "devices", 
ffff810220022580
  name = 0xffff810220022610 "kmsg", 
ffff810220022480
  name = 0xffff810220022510 "mounts", 
ffff810107769780
  name = 0xffff810107769810 "execdomains", 
ffff810220022380
  name = 0xffff810220022410 "locks", 
ffff810107769880
  name = 0xffff810107769910 "cmdline", 
ffff810220022280
  name = 0xffff810220022310 "filesystems", 
ffff810107769980
  name = 0xffff810107769a10 "version", 
ffff810220022180
  name = 0xffff810220022210 "meminfo", 
ffff810220022080
  name = 0xffff810220022110 "uptime", 
ffff810107769a80
  name = 0xffff810107769b10 "loadavg", 

crash> list proc_dir_entry.next 0xffff81041e6de4c0 | wc
     50      50     850

50 entries but f_pos is only 44.  It appears it didn't scan all the entries.  Maybe the file cursor was reset while it was scanning.  But proc_root_readdir() should still have set it to 256.

The stack trace does not indicate that proc_root_readdir() was called - it should be between these two functions:

 #7 [ffff8100bca53ea0] proc_pid_readdir at ffffffff80048172
 #8 [ffff8100bca53ef0] vfs_readdir at ffffffff80035556

crash> file ffff81021e91a1c0
struct file {
...
  f_op = 0xffffffff80321880, 
...

crash> sym 0xffffffff80321880
ffffffff80321880 (d) proc_root_operations  

crash> file_operations 0xffffffff80321880
struct file_operations {
  owner = 0x0, 
  llseek = 0, 
  read = 0xffffffff800f2dc2 <generic_read_dir>, 
  aio_read = 0, 
  write = 0, 
  aio_write = 0, 
  readdir = 0xffffffff8004df01 <proc_root_readdir>, 
...

So vfs_readdir should have called proc_root_readdir().

proc_root_readdir() acquires the BKL while scanning the initial entries (but releases it before calling proc_pid_readdir()) and vfs_readdir() acquires the inode mutex around the call to proc_root_readdir().  Will these locks prevent a race against a rewinddir() (lseek) operation?

Since the llseek operation is NULL in the file operations structure above, it will use default_llseek().  That acquires the BKL but not the inode mutex.  This creates a window between when proc_root_readdir() drops the BKL and when proc_pid_readdir() uses f_pos.

The two obvious options to fixing this are:
1. Move the BKL in proc_root_readdir() to cover the call to proc_pid_readdir()
2. Add inode mutex locking in default_llseek() or vfs_llseek()

Comment 2 Lachlan McIlroy 2010-10-22 02:16:35 UTC
Created attachment 454973 [details]
Patch to avoid race condition

This patch avoids the race condition by having default_llseek() acquire the inode mutex so that it serialises with vfs_readdir().  I'm not sure if this is the best approach but it does work for this test case.  It might be better to acquire the mutex in vfs_llseek().

Comment 3 Lachlan McIlroy 2010-11-09 06:49:05 UTC
This bug has been fixed upstream via this commit:

commit ab91261f5c43f196ec7ff1d113847b87b7606b26
Author: Arnd Bergmann <arnd>
Date:   Wed Jul 7 22:55:17 2010 +0200

    vfs: don't use BKL in default_llseek
    
    There are currently 191 users of default_llseek.
    Nine of these are in device drivers that use the
    big kernel lock. None of these ever touch
    file->f_pos outside of llseek or file_pos_write.
    
    Consequently, we never rely on the BKL
    in the default_llseek function and can
    replace that with i_mutex, which is also
    used in generic_file_llseek.
    
    Signed-off-by: Arnd Bergmann <arnd>

diff --git a/fs/read_write.c b/fs/read_write.c
index 74e3658..fd09f61 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -124,7 +124,7 @@ loff_t default_llseek(struct file *file, loff_t offset, int origin)
 {
        loff_t retval;
 
-       lock_kernel();
+       mutex_lock(&file->f_dentry->d_inode->i_mutex);
        switch (origin) {
                case SEEK_END:
                        offset += i_size_read(file->f_path.dentry->d_inode);
@@ -145,7 +145,7 @@ loff_t default_llseek(struct file *file, loff_t offset, int origin)
                retval = offset;
        }
 out:
-       unlock_kernel();
+       mutex_unlock(&file->f_dentry->d_inode->i_mutex);
        return retval;
 }
 EXPORT_SYMBOL(default_llseek);

Comment 5 Lachlan McIlroy 2010-11-15 05:27:54 UTC
Created attachment 460473 [details]
Patch to add an llseek operation with mutex locking to procfs

This patch moves the inode mutex lock into procfs by creating a llseek file operation specific to procfs.  This eliminates the risk that acquiring the inode mutex in default_llseek() may cause problems for other users of that function.

Comment 8 RHEL Program Management 2010-11-23 06:19:46 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Jarod Wilson 2010-11-23 17:05:27 UTC
in kernel-2.6.18-233.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 13 errata-xmlrpc 2011-01-13 21:57:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html