140424 – Kernel hang with __getblk_slow()

Bug 140424 - Kernel hang with __getblk_slow()

Summary: Kernel hang with __getblk_slow()

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Alexander Viro
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	140423 (view as bug list)
Depends On:
Blocks:	146015
TreeView+	depends on / blocked

Reported:	2004-11-22 20:51 UTC by sheryl sage
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-14 12:05:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description sheryl sage 2004-11-22 20:51:24 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET 
CLR 1.1.4322)

Description of problem:
kernel hang that we think is a bug with __getblk_slow(). 
This bug is causing the TCs to hang immediately, while /sbin/sfdisk 
tries to read the geometry(trying to read block 0). "sfdisk" command 
is not always failing, but at the time of running test cases, it is 
causing the hang immediately. 

The implementation of "__getblk_slow()" seems to be causing this. 
This function has to return the "buffer_head" structure for the given 
block. For this it is trying to get buffer_head by 
calling "__find_get_block" and "grow_buffers()" 

 


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
will attach

Additional info:

Comment 1 sheryl sage 2004-11-22 20:52:56 UTC

         The implementation of "__getblk_slow()" seems to be causing 
this. 
This function has to return the "buffer_head" structure for the given 
block. 
For this it is trying to get buffer_head by 
calling "__find_get_block" and 
"grow_buffers()" as shown below: 
 
__getblk_slow(struct block_device *bdev, sector_t block, int size) { 
         for (;;) { 
                 struct buffer_head * bh; 
 
                 bh = __find_get_block(bdev, block, size); 
                 if (bh) 
                         return bh; 
 
                 if (!grow_buffers(bdev, block, size)) 
                         free_more_memory(); 
         }                                                            
                       
}                                                                     
                       
 
__find_get_block() is calling "__find_get_block_slow()" function. 
 
__find_get_block_slow(), is trying to get the corrsponding page, and 
if, 
this page has buffers, then it is returning the corresponding buffer. 
If 
there are no buffer, then it is returning failure. This failure will 
cause 
"grow_buffers" to be called in "__getblk_slow()" 
function, which eventually allocates and maps the buffer_heads. 
Then the next call to "__find_get_block()" will be successful. 
 
The problem is comming when, the page has buffers, already allocated, 
but 
they are not mapped(b_bdev, b_blocknr fields are not set). 
__find_get_block_slow(), is returning error, since b_blocknr doesn't 
matches. "grow_buffers()" (or grow_dev_page()) doesn't do any thing, 
since 
buffers are already allocated. With this the thread is not comming 
out from 
__getblk_slow() function, and is causing the hang. 
 
While browsing the code, it looks obvious that, there can be buffers 
allocated for a page, but there is posibility that, they are not 
still 
mapped. 
 
I have a temporary fix, and it does the following: 
 
1) __find_get_block_slow() fails, immediately if buffers are not 
mapped. 
2) grow_dev_page(), releases the buffer_heads, if they are not 
mapped. 
    then it allocates the new buffer_heads and maps them.

Comment 2 Dave Jones 2004-11-22 21:30:12 UTC

*** Bug 140423 has been marked as a duplicate of this bug. ***

Comment 3 Jay Turner 2004-11-23 09:20:27 UTC

Fixing the summary.

Comment 5 Arjan van de Ven 2004-12-07 12:35:08 UTC

can you show where this situation actually happens in the kernel ?
It might well be a bug in the kernel that this happens in the first place.

Comment 6 sheryl sage 2004-12-09 23:12:59 UTC

The work around we are using:

--- /tmp/buffer.c       2004-11-29 11:49:42.203838536 -0800 << old 
file.
+++ ./fs/buffer.c       2004-08-04 09:29:13.000000000 -0700 << 
Modified file.
@@ -508,6 +508,9 @@
   head = page_buffers(page);
   bh = head;
   do {
+   if (!buffer_mapped(bh)) {
+     goto out_unlock;
+   }
    if (bh->b_blocknr == block) {
      ret = bh;
      get_bh(bh);
@@ -1217,7 +1220,7 @@
 
   if (page_has_buffers(page)) {
    bh = page_buffers(page);
-   if (bh->b_size == size)
+   if ((buffer_mapped(bh)) && (bh->b_size == size))
      return page;
    if (!try_to_free_buffers(page))
      goto failed;

Comment 7 Alexander Viro 2004-12-10 10:52:50 UTC

What leaves such pages around indefinitely?  AFAICS, to get a deadlock
you would need that - otherwise you'll get out of that loop as soon
as they get mapped.  And that should happen very soon on all paths
I'm seeing in case of block device pages.

What other processes are involved in that deadlock and what do
stack traces show?  (Alt-SysRq-T)

Comment 8 Alexander Viro 2004-12-10 23:22:17 UTC

OK...  Yes, it can happen.  However, patch above breaks unmap_underlying_metadata().
If buffers are partially mapped, we will bail out upon finding an
unmapped one and miss properly mapped we were looking for.

AFAICS, Mason's patch and analysis posted on l-k are correct.
Forwarded...

Comment 10 Tim Burke 2004-12-13 01:45:05 UTC

setting state to modified as fix has been checked in

Comment 11 Jay Turner 2005-01-14 12:05:21 UTC

Patch confirmed.  Closing this issue out.  Please reopen if problems still occur
with more recent code.

Note You need to log in before you can comment on or make changes to this bug.