Description of problem: Eliminate hang when using /proc/sys/vm/drop_caches under heavy load on large system. Version-Release number of selected component (if applicable): REHL4-U6 How reproducible: Frequest but requires large system(~64GB) and multiple CPUs(~8) running several(more then CPU count). Steps to Reproduce: 1. Start several file system exercisers that create and/or read large enoug files to exhaust memory in the pagecache. 2. "echo 3 > /proc/sys/vm /drop_caches" until system hangs 3. Capture AltSysrq-W ad verify all CPUs are stuck on the inode_lock. Actual results: System Hang. Expected results: Pagecache memory is freed without system hanging. Additional info: Back in RHEL4-U6 we backported the /proc/sys/vm/drop_caches functionality from upstream to RHEL4. Recently I encountered hang in this code while creating 256GB files on a 64GB 4-core system and dropping the pagecache at the same time. The cause of the hang is invalidate_list() calls invalidate_inode_pages() which calls invalidate_mapping_pages() with the inode_lock held. Since invalidate_mapping_pages() calls cond_resched(), every CPU can try to acquire the inode_lock if the time quantum of the process writing to /proc/sys/vm/drop_caches expires. So far I have only been able to reproduce this problem when writing multiple huge files on every CPU and "echo 3 > /proc/sys/vm/drop_caches" from a shell, but it can happen randomly. The attached patch fixes this problem by creating and calling a new function invalidate_all_mapping_pages() which does not reschedule. I could not backport the upstream solution to RHEL4 because invalidate_mapping_pages() is exported and the fix would break the kABI but the fix is basically the same logic that is upstream. The original BZ is 205722.
Created attachment 307376 [details] Patch that fixes this problem.
How big of the impact to the customer base w/o the patch?
Committed in 73.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html