From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Description of problem: The AS 2.1 system (with latest e.12 kernel), running gpg-encryption tests hangs within couple of hours after starting the test. The system has 4G memory and 2G swap. The tests makes use of a lot of memory, and swap space. The same tests runs fine on RH 8.1 based system. At the point of hang system has zero free swap space and very little amount of memory (4M) available. So low that OS has to kill some process(es) to make progress. On RH 8.0 based system some processes does get killed on the way. I dont think that is actually happening with AS 2.1 kernel. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Configure gpg-encryption test program 2.Start the tests, with number of threads depending on number of processesors in the system 3.The system will hang in 1-2 hours Additional info:
Created attachment 90767 [details] oom_patch1.patch
Created attachment 90768 [details] oom_patch2.patch
Created attachment 90769 [details] oom_patch3.patch
Did more analysis into this one. At the point of hang, oom_kill has infact sent a kill signal to one of the processes, but the corresponding process never gets to handle the signal and do an exit. As that process is waiting on memory, and never gets woken up as we have a memory shortage condition. Following are the issues that when fixed will resolve this hang: 1) wakeup_memwaiters() in kswapd should wake up all processes waiting on memory everytime. Not only at (!VM_SHOULD_SLEEP) condition. This will prevent the situation where oom_kill sends a kill signal to a processs, and that process never gets to handle the signal as it is waiting on memory. 2.4.18+ kernels use this sort of mechanism to wakeup all processes waiting, ireespective of available memory, in kswapd. Attached oom_patch1.patch does this. 2) Second issue is what should happen if the process that got the first kill signal is sleeping. The current code keeps on sending the kill signal to the same process, thus ending up not doing any forward progress. A better approach, as suggested by Rik in lkml, is to mark the processes as oom_kill is sent to them, and try killing some other process when we reach the low memory condition next time. Attached patch oom_patch2.patch originally posted by Rik, rebased to AS 2.1. 3) Looks like a real bug in page_alloc.c. In _wrapped_alloc_pages(), in a check to retry alloc_pages, there is a condition where we should check (!free_shortage()), in place of (free_shortage()). Only when there is no shortage, we need to keep on retrying. If there is a shortage and we are looking for order > 0, then we should return failure rather than waiting forever. At the hang we also noticed that if the process getting the kill signal happens to be in this state, wherein it keeps on retrying indefinitely to get memory, doing non-interruptible sleeps inbetween. This again can result in hang as no process gets actually killed, even after oom_kill sends a kill signal. Another change that can help low memory condition is to retry indefinitely only on zero order pages. And not for page order <= 3. Attached oom_patch3.patch does the above changes.
Per Larry Woodman: Problem 1)above: fixed and in latest kernel build for AS 2.1 x86 (sent to Intel) Problem 2) above: under review, but patch would require a great deal of testing. Problem 3) above: fixed and in latest kernel build for AS 2.1 x86 (sent to Intel) Please test latest AS 2.1 x86 kernel build (pointer sent to Paul.Gutierrez in email)and report back.
Adding in more detail, also based on input from Larry Overall, we do not consider this problem to be of highest criticality. Reason being that it refers to an edge condition whereby all of memory and swap have been consumed. We can not compromise the integrity of the normal operating condition for an edge condition. Problem 1)above: fixed and in latest kernel build for AS 2.1 x86 (sent to Intel) Problem 2) above: we do not agree with the proposed patch. Problem 3) above: Improved in the prior Q1 errata. If we blindly accepted this patch, the end result would be substantially more process killing than necessary. Definitely too heavy handed, as our primary goal under these circumstances is to keep the system alive. We feel that the the majority of the problems highlighted here have been addressed under the best tradeoff policies. There may still be cases in which processes don't get killed during complete depletion, but that balance is deemed appropriate.
This was a part of the changes in patch 3 above. Somehow I still believe that this is a bug in the code. mm/page_alloc.c has a condition if (!order || free_shortage()) { and I feel it should be if (!order || !free_shortage()) { As per the existing check, we do a retry on the page request when we are low on memory, and we _do_not_ do a retry on a page request (with order > 1) when we have no shortage of free memory. Am I missing something here?
Status update with kernel-2.4.9-e.16.3: We ran the tests on two systems that used to display this problem before. The issue still persists with the new kernel.. Dell 870 based platform: 3/28/03 16:35 - gpgstress test started 3/31/03 03:30 - system hung swap space left: 2007840K Another 4-way system: 3/28/03 22:23 - gpgstress test started 3/28/03 23:13 - system hung swap space left: 4K I will try to do some analysis on these hanging system and will update the details later.
While we tend to agree that this test case is an edge condition wherein all of memory and swap have been consumed. But, we dont feel comfortable with the system going into hang state. I mean, even going into a panic() state, with this kind of workload is a much better way. With that atleast system can gracefully report the error and restart. If the system goes into the hang state, there is very little option for the system administrator, in terms of identifying the failure, especially with a remote system. And another point of concern is, this can be reproduced, on some systems, within couple of hours of gpgstress test run. Another fact observed from our tests. RH 8.0 survives this test on both platforms above, for more than 72 hours.
First of all, I dont think that the "if (!order || free_shortage())" is wrong. By the time you get to this test, previously in the __alloc_pages() routine, you have woken up kswapd and yield()'d and you have drained any pages on the inactive clean list onto the free list without satisfying the order>0 allocation. Looping back up and around again wont do anything for order>0 allocation unless there is a free shortage. nt forget that order>0 allocations must come from the buddy allocator free lists, the inactive free pages cant be used for order>0 allocations. I already applied the equivalent of your patch 1 in the 2.4.9-e.14.1 kernel although I dont instantly wakeup every process waiting inside wakeup_kswapd, I dont let any process sleep any longer than 30 itterations of kswapd. This is less aggressive but provides the same finctionality. I have include the equilivent of patch 3 in the latest kernel with the exception of the free_shortage() logic described above. The current logic is: if(order ==0) goto try_again if(order <= 3 && process is not being oom killed) goto try_again if(order > 3 && havent tried 3 times yet) goto try_agian I have not included patch 2 because of the potential undesirable side effects. With this patch, it is possible to oom kill many processes that are not currently in a killable state. Once one process is killable and it gets killed, the system will kill all other processes that were previously marked to be killed. This is very undesirable especially for an edge condition that you really have to forcefully try yo get the system into. Larry Woodman
Thanks for the detailed explanation. The kernel that were used for over the weekend tests (e-16.3), does have a variant of patch 1. But, I didn't see the "order <= 3 && process is not being oom killed" check that you have mentioned above. Can we get the latest kernel, so that we can rerun the tests and see the progress. As you say, patch 3 is kind of really aggressive. But, good thing about that is it wont hang with this workload, as eventually it would empty all killable processes and call panic(). But as you have mentioned, that may not help much in a normal workload. Another thing that we found during our analysis here is: The problem does not necessarily come from one single process that is waiting forever in alloc_pages. This process that is waiting forever, unfortunately is also holding some file system lock (inode, page) and there are a bunch of other process doing uninterruptible sleep on these locks. This way none of the process in this bunch can ever get killed too. Thanks again.
Did the test with the latest update kernel. And we successfully completed 72 hours test run on Dell 870 based platform. No failures were seen on other platforms too. We can go ahead and close this bug now. Thanks for all the support.
PER ABOVE COMMENT BY Venkatesh Pallipadi of Intel on 2003-04-04 17:52, CLOSING BUGZILLA AS FIXED.