Description of problem: When repeatedly dropping caches while a lot of file system activity is taking place the system will eventually start producing softlockup warnings and everything starts to contend for the inode_lock. This was reported on lkml (while running XFS stress tests): http://lkml.org/lkml/2008/3/18/150 This can also be reproduced, e.g. running a kernel build over NFS and dropping the page cache in a shell loop. Version-Release number of selected component (if applicable): kernel-2.6.18-89.el5.x86_64.rpm How reproducible: 100% Steps to Reproduce: 1. Generate some file system load, e.g. make -j8 on a kernel tree located on an NFS mounted file system 2. Run: # while true;do echo 1 > /proc/sys/vm/drop_caches;sleep 2;done Actual results: Within a few minutes the system will start logging softlockup warnings for various threads trying to acquire the inode_lock, e.g: BUG: soft lockup - CPU#3 stuck for 10s! [id:21024] CPU 3: Modules linked in: ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_uverbs ib_umad ib_mthca ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api sd_mod sg dm_mirror dm_multipa th dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev usb_storage sata_nv floppy i2c_nforce2 libata ohci_hcd i2c_core pcspkr scsi_mod k8temp k8_edac hwmon edac_mc shpchp serio_raw nfs nfs_acl lockd fscache sunrpc e1000 Pid: 21024, comm: id Not tainted 2.6.18-89.el5 #1 RIP: 0010:[<ffffffff80064b57>] [<ffffffff80064b57>] .text.lock.spinlock+0x5/0x30 RSP: 0018:ffff8100ccf39b80 EFLAGS: 00000282 RAX: 0000000000006621 RBX: ffff810103871280 RCX: 0000000000000014 RDX: 00000000f0000008 RSI: ffff81000c54fe08 RDI: ffffffff802f40b0 RBP: 0000000000000246 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8100c228a6c0 R11: ffffffff8002c525 R12: ffffffff803053c0 R13: 0000000300000000 R14: ffffffff80089214 R15: ffff8100ccf39b18 FS: 00002ae2ffbb7710(0000) GS:ffff810103967640(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002ae2ff5310d0 CR3: 00000001fc6d0000 CR4: 00000000000006e0 Call Trace: [<ffffffff8003d716>] ifind_fast+0x1a/0x83 [<ffffffff800230fa>] iget_locked+0x59/0x149 [<ffffffff800475f5>] proc_get_inode+0x89/0x161 [<ffffffff8002767a>] proc_lookup+0xb3/0xcc [<ffffffff80055c87>] proc_root_lookup+0x12/0x30 [<ffffffff8000cbb4>] do_lookup+0xd3/0x1d4 [<ffffffff80009f73>] __link_path_walk+0xa01/0xf42 [<ffffffff8000e782>] link_path_walk+0x5c/0xe5 [<ffffffff8000c965>] do_path_lookup+0x270/0x2e8 [<ffffffff800234db>] __path_lookup_intent_open+0x56/0x97 [<ffffffff8001a954>] open_namei+0x73/0x6d5 [<ffffffff8006683f>] do_page_fault+0x4eb/0x81d [<ffffffff8002732a>] do_filp_open+0x1c/0x38 [<ffffffff80019720>] do_sys_open+0x44/0xbe [<ffffffff8005d116>] system_call+0x7e/0x83 BUG: soft lockup - CPU#0 stuck for 10s! [dpkg-logger:21022] CPU 0: Modules linked in: ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_uverbs ib_umad ib_mthca ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api sd_mod sg dm_mirror dm_multipa th dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev usb_storage sata_nv floppy i2c_nforce2 libata ohci_hcd i2c_core pcspkr scsi_mod k8temp k8_edac hwmon edac_mc shpchp serio_raw nfs nfs_acl lockd fscache sunrpc e1000 Pid: 21022, comm: dpkg-logger Not tainted 2.6.18-89.el5 #1 RIP: 0010:[<ffffffff80064b57>] [<ffffffff80064b57>] .text.lock.spinlock+0x5/0x30 RSP: 0018:ffff8101faf17ec0 EFLAGS: 00000286 RAX: 0000000000008000 RBX: ffffffff8005604d RCX: 00000000c0000100 RDX: ffff8101febac040 RSI: 0000000000000018 RDI: ffffffff802f40b0 RBP: ffff8100c939f078 R08: ffff8101faf16000 R09: 0000000000000000 R10: ffff8100c939f078 R11: ffff8102fff4dc48 R12: ffff8100cefa1480 R13: ffff8100c8a56a68 R14: ffff8100c8a56a68 R15: ffff81030385d180 FS: 00002b1625da8e10(0000) GS:ffffffff8039e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002b1625b69760 CR3: 00000002f9324000 CR4: 00000000000006e0 Call Trace: [<ffffffff8002f34c>] generic_delete_inode+0xea/0x143 [<ffffffff8000d11c>] dput+0xf6/0x114 [<ffffffff800125f3>] __fput+0x16c/0x198 [<ffffffff800239d6>] filp_close+0x5c/0x64 [<ffffffff8001d9e5>] sys_close+0x88/0xa2 [<ffffffff8005d116>] system_call+0x7e/0x83 Expected results: Performance should suck but no softlockup warnings. Additional info: From the linked LKML thread: > Looks like everything is backed up on the inode_lock. Why? Looks > like drop_pagecache_sb() is doing something ..... suboptimal. > > static void drop_pagecache_sb(struct super_block *sb) > { > struct inode *inode; > > spin_lock(&inode_lock); > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > if (inode->i_state & (I_FREEING|I_WILL_FREE)) > continue; > __invalidate_mapping_pages(inode->i_mapping, 0, -1, true); > } > spin_unlock(&inode_lock); > } > > It holds the inode_lock for an amazingly long time, and calls a > function that ends up in ->release_page which can issue > transactions. > > Given that transactions can then mark an inode dirty or the > kjournald might need to mark an inode dirty while holding > transaction locks, the implementation of drop_pagecache_sb seems to > be just a little dangerous.... > > Anyone know the reason why drop_pagecache_sb() uses such a brute-force > mechanism to free up clean page cache pages?
Created attachment 304374 [details] Proposed upstream patch to drop inode_lock across __invalidate_mapping_pages
The patch in comment #1 has been added to 2.6.25-mm1 as vfs-fix-lock-inversion-in-drop_pagecache_sb.patch: From: Jan Kara <jack> Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock before calling __invalidate_mapping_pages(). We just have to make sure inode won't go away from under us by keeping reference to it and putting the reference only after we have safely resumed the scan of the inode list. A bit tricky but not too bad... Signed-off-by: Jan Kara <jack> Cc: Fengguang Wu <wfg.edu.cn> Cc: David Chinner <dgc> Signed-off-by: Andrew Morton <akpm>
Went into mainline back in April: commit eccb95cee4f0d56faa46ef22fb94dd4a3578d3eb Author: Jan Kara <jack> Date: Tue Apr 29 00:59:37 2008 -0700 vfs: fix lock inversion in drop_pagecache_sb() Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock before calling __invalidate_mapping_pages(). We just have to make sure inode won't go away from under us by keeping reference to it and putting the reference only after we have safely resumed the scan of the inode list. A bit tricky but not too bad... Signed-off-by: Jan Kara <jack> Cc: Fengguang Wu <wfg.edu.cn> Cc: David Chinner <dgc> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds>
in kernel-2.6.18-103.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html