Bug 502266 - shrink_zone holds lru_lock too long on systems with hundreds of gigabytes of ram
shrink_zone holds lru_lock too long on systems with hundreds of gigabytes of ram
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
All Linux
high Severity high
: rc
: ---
Assigned To: Rik van Riel
Red Hat Kernel QE team
Depends On:
  Show dependency treegraph
Reported: 2009-05-22 16:59 EDT by Casey Dahlin
Modified: 2014-06-18 04:46 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-06-30 13:46:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Casey Dahlin 2009-05-22 16:59:57 EDT
We've seen a number of cases of cores from systems with large amounts of RAM and CPUs (250GB and 16 cores in the most recent case) where all but one CPU was waiting on the normal zone lru lock, and the last cpu which held it was simply executing normally, and the system was killed by NMI. It seems that systems with very large amounts of RAM take too long in this critical section causing other CPUs waiting to enter it to appear to hang.
Comment 1 Rik van Riel 2009-05-22 17:43:28 EDT
Yes, they do.

However, no single code path holds the zone lru lock for a long period of time. Rather, the problem is caused by contention from many places and frequent re-acquisition of the lock.

The RHEL 5.4 kernel has some fixes to the VM that make it less eager to go into the page reclaim code paths where the zone lru lock is grabbed and may alleviate the issue somewhat.

A true fix is probably impossible in a RHEL 5 update, but I am interested in knowing whether RHEL 5.4 alleviates the problem enough that it is no longer a major issue.

Note You need to log in before you can comment on or make changes to this bug.