Bug 111656

Summary:	In 2.4.20.-20.7 memory module, rebalance_laundry_zone() does not respect gfp_mask GFP_NOFS
Product:	[Retired] Red Hat Linux	Reporter:	Mahesh Patil <mpatil>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.3	CC:	alexchadwick20, amatthews, holland, ltuikov, riel
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-01-05 19:32:03 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mahesh Patil 2003-12-08 01:28:10 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)

Description of problem:
    rebalance_laundary_zone() has following check  before waiting 
    on page:

    if ((gfp_mask & __GFP_WAIT) && (work_done < max_work)) {
       ....
       wait_on_page_timeout(page, 5 * HZ);
       ....
    }
   
    This condition does "not" check  for  gfp_mask of  __GFP_FS. 
    I think we should have an extra check to check for   
    __GFP_FS as following:

    if ((gfp_mask & __GFP_WAIT) && (work_done < max_work) &&
         ((gfp_mask & __GFP_FS) || (page->mapping == NULL))) {
         ....
       wait_on_page_timeout(page, 5 * HZ);
       ....
    }
   
    This would prevent kernel from waiting on a "file system" page 
    if gfp_mask does "not" have __GFP_FS.

    This bug can have severe perfomance implications in low memory
    conditions, if we have following sequence of calls:
    "where -> means calls"
  
                   _alloc_page (when system has low memory) -> 
                   try_free_free_pages(..)->..->
 =>flush "page A"  page->mapping->aops->write_page( a "filesytem page 
A")-> 
                   _alloc_page(with gfp_mask GFP_NOFS)(system has low 
memory)->     
                    try_to_free_pages(..)->
                    rebalance_laundry_zone()-> 
=> wait on "page A" wait_on_page_timeout("filesystem page A").

    Essent

Comment 1 Arjan van de Ven 2003-12-08 08:13:12 UTC

why would this be correct?
The kernel starts IO pretty early and waiting here is a really rare
case. If you're this low on memory you sure want to wait. 
With which filesystem did you measure this ?
(filesystem writeout needs to be very very carefull with allocating
memory; it sounds like the fs in use may need some fixes instead)

Comment 2 Alexander Chadwick 2003-12-09 14:28:39 UTC

you mean kmalloc(.., GFP_NOFS) could eventually wait on a filesystem
page. Thats Odd, sounds like a bug to me.

Comment 3 Arjan van de Ven 2003-12-09 14:33:16 UTC

GFP_NOFS means "dont' recurse by submitting IO" which is not the same
as waiting for previously submitted IO to complete.

Comment 4 Alexander Chadwick 2003-12-10 03:56:13 UTC

2.4.23 from kernel.org, checks for GFP_NOFS before waiting on a page.
 
The fact that it does GFP_NOFS check on wait_page() before calling 
writepage(), indicates that "previously submitted IO" is also covered 
by GFP_NOFS.

Do we know any other linux kernel that does not check for GFP_NOFS 
before waiting?

Comment 5 Mark Holland 2004-01-21 08:00:20 UTC

If possible I'd like to open a dialog with RedHat about the 
interpretation of the GFP_NOFS constant as allowing the caller to 
block on I/O.  To the best of our understanding this interpretation 
is at odds with the rest of the Linux community and is fundamentally 
less robust than a no-block interpretation, because it opens the door 
to a deadlock that the Panasas file system trips very regularly.

At the core our problem is:

- A thread T1 does an allocation which causes a write_page on page P
- write_page initiates an async RPC to flush out that page
- T1 blocks waiting for page P
- A thread T2 picks up the async response to the RPC
- T2 finds itself in a situation where it needs to allocate, so it 
uses GFP_NOFS
- This allocation blocks on page P, causing a deadlock

To the best of our knowledge Alexander Chadwick is correct in saying 
that the RedHat memory subsystem is the only Linux implementation 
that allows a GFP_NOFS allocation to block waiting on I/O.  The same 
code that deadlocks under RedHat kernels runs flawlessly on 
kernel.org kernels, and applying the patch described in the bug 
report by Mahesh Patil eliminates the problem.  We have not 
encountered this interpretation of GFP_NOFS anywhere else.

Panasas is very open to suggestions about how to work around the 
deadlock described above, but simple answer like "eliminate 
allocations from write_page and the function that receives responses" 
or "make write_page fully synchronous" cannot be made workable.  The 
former is precluded by the complex interactions necessary to write to 
a secure object-based file system, and the latter by the extremely 
bad performance implications of performing I/O in such small block 
sizes, and the fact that this situation arises with some regularity.

Here is a more complete description of the problem, mostly the same 
as above but with some additional detail:

1. A thread T1 (any thread, not just ours) calls malloc with __GFP_FS 
on and enteres rebalance_laundry_zone due to low memory.
2. This causes a page P1 to move to the inactive_laundry list and a 
subsequent call to the DirectFlow write_page routine.
3. Write_page issues some async network operations (RPCs and/or iSCSI 
commands) to begin flushing out the page.
4. Thread T1 returns to Linux and subsequently blocks waiting for 
page P1 to become unlocked.
5. A different thread T2 associated with DirectFlow receives the 
response to the network operation initiated by write_page.
6. Thread T2 needs to perform an allocation in order to receive and 
process the response.
7. Since memory is low, T2 also enters rebalance_laundry_zone
8. Since page P1 is still on the inactive_laundry_list and is still 
locked, thread T2 blocks waiting for it to become unlocked, even 
though T2 has specified GFP_NOFS for the allocation.
9. Both T1 and T2 are blocked at this point so DirectFlow cannot 
complete the write_page operation.
10. The machine appears to be hung at this point until the 30 second 
timeout has expired.
11. We observe the machine to remain hung for an arbitrary period of 
time (as little as a few minutes, as much as an hour) and then it 
will typically emerge from the hang.  From this we speculate that 
this code path is repeating until some unrelated memory is freed.

Mark Holland, Ph.D.
Software Architect for File Systems, Panasas Inc.

Comment 6 Arjan van de Ven 2004-01-21 08:04:17 UTC

Please provide a URL with the (obviously GPL) source code to your code
so that we can see what is happening.

However what you point out as "deadlock" is not a deadlock, the kernel
will not wait infinite on any page in the VM subsystem.

Comment 7 Mark Holland 2004-01-21 17:31:28 UTC

Agreed completely that the deadlock described above not a permanent 
hang.  But we observe that it does render a client machine unuseable 
for minutes at a time.

At this time the source is not under GPL.  We believe that our file 
system module is not a derived work of Linux for the following 
reasons:

  - The code was originally developed for FreeBSD and Solaris
  - It was later ported to Linux, and the Linux porting layer is 
relatively thin
  - It continues to run in active and support use under FreeBSD.  We 
run FreeBSD on all internal storage blades within our system.

We have spoken (very informally) to Linus Torvalds and Andrew 
Tridgell and they both agree that we have a good case as to why our 
code is not derived.  The same discussion has applied to AFS from 
Transarc Corporation in the past: the code base was not developed 
under Linux and is therefore not derived.

That said, we do have plans to release the code under an open source 
license as we complete some key feature development and clean-up.  
We're also working with a wide range of other companies, universities 
and organizations to standardize these new storage and file system 
protocols.  Additional open source implementations of the protocols 
are under development at U of Mich, Lustre, Intel and others.  We 
discussed these standards with Stephen Tweede from Red Hat invited 
him to participate as well.