From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.7) Gecko/20020104 Description of problem: Back in June 2001 there were some threads on linux-kernel about adding a BH_Async flag to the kernel. The code went into 2.4.10 and I got Alan to pick it up in 2.4.10-ac4. It needs back ported to your 2.4.9 kernels. The problem is that 2.4.9 and earlier have end_buffer_io_async() (fs/buffer.c) checking whether a page is in use partly based on the associated page's (or first bh in page?) b_end_io function being end_buffer_io_async(). Anyway, this assumption is not true for various md-type drivers which modify the bh's b_end_io function. The result after a period of async IO is the BUG() in UnlockPage being triggered because page usage counts have become corrupted. This patch: http://mirror.csit.fsu.edu/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.6pre5aa1/00_bh-async-1 fixes this problem. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Load an md driver that nests the system b_end_io function pointer for end_buffer_io_async() for buffer heads it processes behind it's own end IO handler. 2. Run an IO load that triggers page (async) IO. 3. Given enough time the kernel will eventually corrupt itself and in my testing it always caught it in end_buffer_io_async()'s call to UnlockPage(). Actual Results: BUG() in UnlockPage() as called from end_buffer_io_async() triggers and the kernel panics. Expected Results: IO should succeed and kernel should not panic. Additional info:
The attached patch had a slight bug. I've checked in a fixed version to our local tree. Do you want a test build to try it out with?
I can try a test build for an i686/SMP.
This appears to have shipped in 2.4.9-31's linux-2.4.9-assorted-bits.patch (if not earlier?). I'll test that kernel some, but the patch should've fixed the issue as it did in the official kernel tree.
FYI: The 2.4.9-31 kernel survives my testing now. Looks like the bug is closable.