Bug 502266 - shrink_zone holds lru_lock too long on systems with hundreds of gigabytes of ram
Summary: shrink_zone holds lru_lock too long on systems with hundreds of gigabytes of ram
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.3
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Rik van Riel
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-05-22 20:59 UTC by Casey Dahlin
Modified: 2018-10-20 00:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-06-30 17:46:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Casey Dahlin 2009-05-22 20:59:57 UTC
We've seen a number of cases of cores from systems with large amounts of RAM and CPUs (250GB and 16 cores in the most recent case) where all but one CPU was waiting on the normal zone lru lock, and the last cpu which held it was simply executing normally, and the system was killed by NMI. It seems that systems with very large amounts of RAM take too long in this critical section causing other CPUs waiting to enter it to appear to hang.

Comment 1 Rik van Riel 2009-05-22 21:43:28 UTC
Yes, they do.

However, no single code path holds the zone lru lock for a long period of time. Rather, the problem is caused by contention from many places and frequent re-acquisition of the lock.

The RHEL 5.4 kernel has some fixes to the VM that make it less eager to go into the page reclaim code paths where the zone lru lock is grabbed and may alleviate the issue somewhat.

A true fix is probably impossible in a RHEL 5 update, but I am interested in knowing whether RHEL 5.4 alleviates the problem enough that it is no longer a major issue.


Note You need to log in before you can comment on or make changes to this bug.