From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Description of problem: kernel hang that we think is a bug with __getblk_slow(). This bug is causing the TCs to hang immediately, while /sbin/sfdisk tries to read the geometry(trying to read block 0). "sfdisk" command is not always failing, but at the time of running test cases, it is causing the hang immediately. The implementation of "__getblk_slow()" seems to be causing this. This function has to return the "buffer_head" structure for the given block. For this it is trying to get buffer_head by calling "__find_get_block" and "grow_buffers()" Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: will attach Additional info:
The implementation of "__getblk_slow()" seems to be causing this. This function has to return the "buffer_head" structure for the given block. For this it is trying to get buffer_head by calling "__find_get_block" and "grow_buffers()" as shown below: __getblk_slow(struct block_device *bdev, sector_t block, int size) { for (;;) { struct buffer_head * bh; bh = __find_get_block(bdev, block, size); if (bh) return bh; if (!grow_buffers(bdev, block, size)) free_more_memory(); } } __find_get_block() is calling "__find_get_block_slow()" function. __find_get_block_slow(), is trying to get the corrsponding page, and if, this page has buffers, then it is returning the corresponding buffer. If there are no buffer, then it is returning failure. This failure will cause "grow_buffers" to be called in "__getblk_slow()" function, which eventually allocates and maps the buffer_heads. Then the next call to "__find_get_block()" will be successful. The problem is comming when, the page has buffers, already allocated, but they are not mapped(b_bdev, b_blocknr fields are not set). __find_get_block_slow(), is returning error, since b_blocknr doesn't matches. "grow_buffers()" (or grow_dev_page()) doesn't do any thing, since buffers are already allocated. With this the thread is not comming out from __getblk_slow() function, and is causing the hang. While browsing the code, it looks obvious that, there can be buffers allocated for a page, but there is posibility that, they are not still mapped. I have a temporary fix, and it does the following: 1) __find_get_block_slow() fails, immediately if buffers are not mapped. 2) grow_dev_page(), releases the buffer_heads, if they are not mapped. then it allocates the new buffer_heads and maps them.
*** Bug 140423 has been marked as a duplicate of this bug. ***
Fixing the summary.
can you show where this situation actually happens in the kernel ? It might well be a bug in the kernel that this happens in the first place.
The work around we are using: --- /tmp/buffer.c 2004-11-29 11:49:42.203838536 -0800 << old file. +++ ./fs/buffer.c 2004-08-04 09:29:13.000000000 -0700 << Modified file. @@ -508,6 +508,9 @@ head = page_buffers(page); bh = head; do { + if (!buffer_mapped(bh)) { + goto out_unlock; + } if (bh->b_blocknr == block) { ret = bh; get_bh(bh); @@ -1217,7 +1220,7 @@ if (page_has_buffers(page)) { bh = page_buffers(page); - if (bh->b_size == size) + if ((buffer_mapped(bh)) && (bh->b_size == size)) return page; if (!try_to_free_buffers(page)) goto failed;
What leaves such pages around indefinitely? AFAICS, to get a deadlock you would need that - otherwise you'll get out of that loop as soon as they get mapped. And that should happen very soon on all paths I'm seeing in case of block device pages. What other processes are involved in that deadlock and what do stack traces show? (Alt-SysRq-T)
OK... Yes, it can happen. However, patch above breaks unmap_underlying_metadata(). If buffers are partially mapped, we will bail out upon finding an unmapped one and miss properly mapped we were looking for. AFAICS, Mason's patch and analysis posted on l-k are correct. Forwarded...
setting state to modified as fix has been checked in
Patch confirmed. Closing this issue out. Please reopen if problems still occur with more recent code.