Bug 111656
Summary: | In 2.4.20.-20.7 memory module, rebalance_laundry_zone() does not respect gfp_mask GFP_NOFS | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Mahesh Patil <mpatil> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.3 | CC: | alexchadwick20, amatthews, holland, ltuikov, riel |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-01-05 19:32:03 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mahesh Patil
2003-12-08 01:28:10 UTC
why would this be correct? The kernel starts IO pretty early and waiting here is a really rare case. If you're this low on memory you sure want to wait. With which filesystem did you measure this ? (filesystem writeout needs to be very very carefull with allocating memory; it sounds like the fs in use may need some fixes instead) you mean kmalloc(.., GFP_NOFS) could eventually wait on a filesystem page. Thats Odd, sounds like a bug to me. GFP_NOFS means "dont' recurse by submitting IO" which is not the same as waiting for previously submitted IO to complete. 2.4.23 from kernel.org, checks for GFP_NOFS before waiting on a page. The fact that it does GFP_NOFS check on wait_page() before calling writepage(), indicates that "previously submitted IO" is also covered by GFP_NOFS. Do we know any other linux kernel that does not check for GFP_NOFS before waiting? If possible I'd like to open a dialog with RedHat about the interpretation of the GFP_NOFS constant as allowing the caller to block on I/O. To the best of our understanding this interpretation is at odds with the rest of the Linux community and is fundamentally less robust than a no-block interpretation, because it opens the door to a deadlock that the Panasas file system trips very regularly. At the core our problem is: - A thread T1 does an allocation which causes a write_page on page P - write_page initiates an async RPC to flush out that page - T1 blocks waiting for page P - A thread T2 picks up the async response to the RPC - T2 finds itself in a situation where it needs to allocate, so it uses GFP_NOFS - This allocation blocks on page P, causing a deadlock To the best of our knowledge Alexander Chadwick is correct in saying that the RedHat memory subsystem is the only Linux implementation that allows a GFP_NOFS allocation to block waiting on I/O. The same code that deadlocks under RedHat kernels runs flawlessly on kernel.org kernels, and applying the patch described in the bug report by Mahesh Patil eliminates the problem. We have not encountered this interpretation of GFP_NOFS anywhere else. Panasas is very open to suggestions about how to work around the deadlock described above, but simple answer like "eliminate allocations from write_page and the function that receives responses" or "make write_page fully synchronous" cannot be made workable. The former is precluded by the complex interactions necessary to write to a secure object-based file system, and the latter by the extremely bad performance implications of performing I/O in such small block sizes, and the fact that this situation arises with some regularity. Here is a more complete description of the problem, mostly the same as above but with some additional detail: 1. A thread T1 (any thread, not just ours) calls malloc with __GFP_FS on and enteres rebalance_laundry_zone due to low memory. 2. This causes a page P1 to move to the inactive_laundry list and a subsequent call to the DirectFlow write_page routine. 3. Write_page issues some async network operations (RPCs and/or iSCSI commands) to begin flushing out the page. 4. Thread T1 returns to Linux and subsequently blocks waiting for page P1 to become unlocked. 5. A different thread T2 associated with DirectFlow receives the response to the network operation initiated by write_page. 6. Thread T2 needs to perform an allocation in order to receive and process the response. 7. Since memory is low, T2 also enters rebalance_laundry_zone 8. Since page P1 is still on the inactive_laundry_list and is still locked, thread T2 blocks waiting for it to become unlocked, even though T2 has specified GFP_NOFS for the allocation. 9. Both T1 and T2 are blocked at this point so DirectFlow cannot complete the write_page operation. 10. The machine appears to be hung at this point until the 30 second timeout has expired. 11. We observe the machine to remain hung for an arbitrary period of time (as little as a few minutes, as much as an hour) and then it will typically emerge from the hang. From this we speculate that this code path is repeating until some unrelated memory is freed. Mark Holland, Ph.D. Software Architect for File Systems, Panasas Inc. Please provide a URL with the (obviously GPL) source code to your code so that we can see what is happening. However what you point out as "deadlock" is not a deadlock, the kernel will not wait infinite on any page in the VM subsystem. Agreed completely that the deadlock described above not a permanent hang. But we observe that it does render a client machine unuseable for minutes at a time. At this time the source is not under GPL. We believe that our file system module is not a derived work of Linux for the following reasons: - The code was originally developed for FreeBSD and Solaris - It was later ported to Linux, and the Linux porting layer is relatively thin - It continues to run in active and support use under FreeBSD. We run FreeBSD on all internal storage blades within our system. We have spoken (very informally) to Linus Torvalds and Andrew Tridgell and they both agree that we have a good case as to why our code is not derived. The same discussion has applied to AFS from Transarc Corporation in the past: the code base was not developed under Linux and is therefore not derived. That said, we do have plans to release the code under an open source license as we complete some key feature development and clean-up. We're also working with a wide range of other companies, universities and organizations to standardize these new storage and file system protocols. Additional open source implementations of the protocols are under development at U of Mich, Lustre, Intel and others. We discussed these standards with Stephen Tweede from Red Hat invited him to participate as well. |