We've seen a number of cases of cores from systems with large amounts of RAM and CPUs (250GB and 16 cores in the most recent case) where all but one CPU was waiting on the normal zone lru lock, and the last cpu which held it was simply executing normally, and the system was killed by NMI. It seems that systems with very large amounts of RAM take too long in this critical section causing other CPUs waiting to enter it to appear to hang.
Yes, they do. However, no single code path holds the zone lru lock for a long period of time. Rather, the problem is caused by contention from many places and frequent re-acquisition of the lock. The RHEL 5.4 kernel has some fixes to the VM that make it less eager to go into the page reclaim code paths where the zone lru lock is grabbed and may alleviate the issue somewhat. A true fix is probably impossible in a RHEL 5 update, but I am interested in knowing whether RHEL 5.4 alleviates the problem enough that it is no longer a major issue.