109632 – kernel oops detaching dasd, tq_disk corrupt

Bug 109632 - kernel oops detaching dasd, tq_disk corrupt

Summary: kernel oops detaching dasd, tq_disk corrupt

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	s390
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Pete Zaitcev
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-11-10 14:44 UTC by Richard Hirst
Modified:	2007-11-30 22:06 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-04-20 19:40:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Richard Hirst 2003-11-10 14:44:40 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.0)
Gecko/20020623 Debian/1.0.0-0.woody.1

Description of problem:
My system boots and is running from a r/o dasd, after having booted
from initrd.  I make a CMS formatted dasd visible to the kernel with

echo "add device range=2000" > /proc/dasd/devices

I use a userland tool, cmsfscp, to read a file from the DASD (similar
to mtools for reading DOS filesystems under linux).

I detach the dasd with

echo "set device range=2000 off" > /proc/dasd/devices

and the kernel crashes:

06:10:44 CPU:    0    Not tainted
06:10:44 Process bootfs.sh (pid: 84, task: 07c3a000, ksp: 07c3bf10)
06:10:44 Krnl PSW : 07080000 8002c748
06:10:44            __run_task_queue ï¿½kernelï¿½ 0x90 (2.4.21-4.EL)
06:10:44 Krnl GPRS: 00000000 00000000 00000000 00000000
06:10:44            00b8c15c 07c3ba00 00000000 00000000 
06:10:44            00001000 00020dfc 07c3a000 07c3ba00
06:10:44            00080000 8002c6c0 07c3ba10 07c3b998 
06:10:44 Krnl ACRS: 4001f860 00000000 00000000 00000000 
06:10:44            00000000 00000000 00000000 00000000 
06:10:44            00000000 00000000 00000000 00000000
06:10:44            00000000 00000000 00000000 00000000
06:10:44 Krnl Code: d2 03 10 08 d0 04 a7 84 00 03 0d e3 19 cb a7 74 ff
f0 58 40
 
06:10:44 ï¿½<00067258>ï¿½ block_sync_page ï¿½kernelï¿½ 0x48 (0x7c3ba48)
06:10:44 ï¿½<00043a02>ï¿½ ___wait_on_page ï¿½kernelï¿½ 0xe6 (0x7c3baa8) 
06:10:44 ï¿½<00044cde>ï¿½ do_generic_file_read ï¿½kernelï¿½ 0x4c2 (0x7c3bb20)
06:10:44 ï¿½<0004557e>ï¿½ generic_file_new_read ï¿½kernelï¿½ 0x92 (0x7c3bba8) 
06:10:44 ï¿½<000456e0>ï¿½ generic_file_read ï¿½kernelï¿½ 0x20 (0x7c3bc28)
06:10:44 ï¿½<0006cc80>ï¿½ kernel_read ï¿½kernelï¿½ 0x74 (0x7c3bc88)
06:10:44 ï¿½<0006d128>ï¿½ prepare_binprm ï¿½kernelï¿½ 0xfc (0x7c3bcf0)  
06:10:44 ï¿½<0006d746>ï¿½ do_execve ï¿½kernelï¿½ 0xd6 (0x7c3bd50)
06:10:44 ï¿½<00017888>ï¿½ sys_execve ï¿½kernelï¿½ 0x74 (0x7c3bee8) 
06:10:44 ï¿½<00014f92>ï¿½ sys_execve_glue ï¿½kernelï¿½ 0xc (0x7c3bf48)


The problem is as follows.. the dasd driver has called
blk_cleanup_queue() while the request_queue struct is still queued,
via its plug_tq member, to the disk task queue, tq_disk.  Some time
later run_task_queue() tries to follow the tq_disk linked list and
gets a null pointer as blk_cleanup_queue() zeroed out the
request_queue struct.

I don't see this problem with a standard kernel.org 2.4.21 kernel.

I applied the following patch to drivers/s390/block/dasd.c, which
reports that the device is queued on tq_disk, calls run_task_queue(),
and the kernel no longer crashes.

--- drivers/s390/block/dasd.c.ori       2003-11-10 11:55:18.000000000
+0000
+++ drivers/s390/block/dasd.c   2003-11-10 11:55:51.000000000 +0000
@@ -4292,6 +4292,12 @@
                 max_sectors[major][minor + i] = 0;
         }
         if (device->request_queue) {
+           if (device->request_queue->plug_tq.sync) {
+               printk("dasd_disable_blkdev(): Device %d:%d on tq_disk
(entry %p), running queue\n", major, minor,
&device->request_queue->plug_tq);
+               run_task_queue(&tq_disk);
+               if (device->request_queue->plug_tq.sync)
+                   printk("dasd.c: Ugh, still on tq_disk.  Bye!!\n");
+           }
             blk_cleanup_queue (device->request_queue);
             kfree(device->request_queue);
             device->request_queue = NULL;


The following paragraph is just my theory as to what might be the cause:

I see RedHat have changes at the end of ll_rw_blk.c:__make_request(),
which set q->plugged=0 and effectively duplicate what would normally
happen when tq_disk is processed.  __make_request() has already called
q->plug_device_fn(), so the request_queue is queued on tq_disk.  If
your changes mean we can now get out of __make_request() with no work
left queued for my dasd device, then there is nothing to stop me
detaching it before anyone calls run_task_queue(&tq_disk), resulting
in the above oops.



Version-Release number of selected component (if applicable):
kernel-2.4.21-4.EL

How reproducible:
Always

Steps to Reproduce:
1.see description section.
2.
3.
    

Additional info:

Comment 1 Arjan van de Ven 2003-11-10 14:51:26 UTC

this code is rather broken.
run_task_queue() is NO guarantee all IO is finished etc etc.
Sounds like set device ... off isn't supportable.

Comment 2 Richard Hirst 2003-11-10 20:11:46 UTC

Well, in principle set...off should be no harder than scsi
remove-single-device, should it?  run_task_queue() may well not be the
right way to handle this, but the dasd driver has gone through the
motions of canceling outstanding requests first.  I wasn't proposing
my patch as a proper fix, just as evidence of what was causing the
crash.  Maybe the dasd driver didn't cancel outstanding requests properly.

Comment 3 Pete Zaitcev 2003-12-24 00:45:32 UTC

So, where from are thouse outstanding requests coming?
I suspect it might be one of those cases when it's better not
to do something that hurts.

Comment 5 Suzanne Hillman 2004-04-20 19:40:17 UTC

OK, this has been open and not touched forever. Closing. If this needs
to be fixed still, please reopen with additional information.

Note You need to log in before you can comment on or make changes to this bug.