From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: We have observed processes that suffer from memory leaks (internal issue, which has been fixed) which cause the server to go into I/O block. The kernel tries to kill (SIGTERM) the process which has exhausted the memory, but the kill is unsuccesful as the process is waiting on I/O. The process in this case was logging to an NFS file and was doing a lot of message handling and interactions with an Oracle install. Version-Release number of selected component (if applicable): kernel-2.4.21-27.0.2.EL How reproducible: Couldn't Reproduce Steps to Reproduce: I've writen a small program to exhaust the memory (malloc loop) and that is handled correctly. I haven't added I/O activity in yet to produce a I/O block. The primary servers where this is seen are production development systems. 1. Exhaust the memory (RAM) with a process that has network I/O Actual Results: Kernel loops trying to kill (SIGTERM) the process that has exhausted the memory. Expected Results: The process is killed and the system returns to status-quo. Additional info: Is it possible to patc the kernel to send a SIGKILL if the application does not respond to a SIGTERM after a period of time. This will generally be for applications that are in I/O block.
Hello, David. In general, processes waiting for I/O completion cannot be interrupted by signal delivery (because locks or other critical resources need to be held during the operation). This is appropriate/desirable behavior, and the kernel is not looping during the I/O wait even if a signal has been posted (i.e., queued for delivery). These types of waits are generally very short-term. Waiting on I/O completion from a remote machine (e.g., for an NFS request) is a bit of a special case. In certain situations, it might be helpful to use the "intr" and/or "soft" options for NFS mounts, but this is generally not advised because of the potential for data loss (from interrupted NFS operations). Applications that consume large amounts of virtual memory will likely have their physical pages "stolen" (reallocated to other processes) if they were to become stuck in a long-term I/O wait, so this is not a problem.
We've had scenarios where hosts which have an application which has a memory leak and consequently consumes all the memory available, the OS then attempts to kill the process, but is unable to because it is I/O bound. The host and process remain in this state for hours (I've observed it happening for at least 8 hours) and a physcial power-cycle is required to regain access to the server.