Bug 109632

Summary:	kernel oops detaching dasd, tq_disk corrupt
Product:	Red Hat Enterprise Linux 3	Reporter:	Richard Hirst <rhirst>
Component:	kernel	Assignee:	Pete Zaitcev <zaitcev>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.0	CC:	rhirst, riel
Target Milestone:	---
Target Release:	---
Hardware:	s390
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-04-20 19:40:17 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Richard Hirst 2003-11-10 14:44:40 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.0)
Gecko/20020623 Debian/1.0.0-0.woody.1

Description of problem:
My system boots and is running from a r/o dasd, after having booted
from initrd.  I make a CMS formatted dasd visible to the kernel with

echo "add device range=2000" > /proc/dasd/devices

I use a userland tool, cmsfscp, to read a file from the DASD (similar
to mtools for reading DOS filesystems under linux).

I detach the dasd with

echo "set device range=2000 off" > /proc/dasd/devices

and the kernel crashes:

06:10:44 CPU:    0    Not tainted
06:10:44 Process bootfs.sh (pid: 84, task: 07c3a000, ksp: 07c3bf10)
06:10:44 Krnl PSW : 07080000 8002c748
06:10:44            __run_task_queue ï¿½kernelï¿½ 0x90 (2.4.21-4.EL)
06:10:44 Krnl GPRS: 00000000 00000000 00000000 00000000
06:10:44            00b8c15c 07c3ba00 00000000 00000000 
06:10:44            00001000 00020dfc 07c3a000 07c3ba00
06:10:44            00080000 8002c6c0 07c3ba10 07c3b998 
06:10:44 Krnl ACRS: 4001f860 00000000 00000000 00000000 
06:10:44            00000000 00000000 00000000 00000000 
06:10:44            00000000 00000000 00000000 00000000
06:10:44            00000000 00000000 00000000 00000000
06:10:44 Krnl Code: d2 03 10 08 d0 04 a7 84 00 03 0d e3 19 cb a7 74 ff
f0 58 40
 
06:10:44 ï¿½<00067258>ï¿½ block_sync_page ï¿½kernelï¿½ 0x48 (0x7c3ba48)
06:10:44 ï¿½<00043a02>ï¿½ ___wait_on_page ï¿½kernelï¿½ 0xe6 (0x7c3baa8) 
06:10:44 ï¿½<00044cde>ï¿½ do_generic_file_read ï¿½kernelï¿½ 0x4c2 (0x7c3bb20)
06:10:44 ï¿½<0004557e>ï¿½ generic_file_new_read ï¿½kernelï¿½ 0x92 (0x7c3bba8) 
06:10:44 ï¿½<000456e0>ï¿½ generic_file_read ï¿½kernelï¿½ 0x20 (0x7c3bc28)
06:10:44 ï¿½<0006cc80>ï¿½ kernel_read ï¿½kernelï¿½ 0x74 (0x7c3bc88)
06:10:44 ï¿½<0006d128>ï¿½ prepare_binprm ï¿½kernelï¿½ 0xfc (0x7c3bcf0)  
06:10:44 ï¿½<0006d746>ï¿½ do_execve ï¿½kernelï¿½ 0xd6 (0x7c3bd50)
06:10:44 ï¿½<00017888>ï¿½ sys_execve ï¿½kernelï¿½ 0x74 (0x7c3bee8) 
06:10:44 ï¿½<00014f92>ï¿½ sys_execve_glue ï¿½kernelï¿½ 0xc (0x7c3bf48)


The problem is as follows.. the dasd driver has called
blk_cleanup_queue() while the request_queue struct is still queued,
via its plug_tq member, to the disk task queue, tq_disk.  Some time
later run_task_queue() tries to follow the tq_disk linked list and
gets a null pointer as blk_cleanup_queue() zeroed out the
request_queue struct.

I don't see this problem with a standard kernel.org 2.4.21 kernel.

I applied the following patch to drivers/s390/block/dasd.c, which
reports that the device is queued on tq_disk, calls run_task_queue(),
and the kernel no longer crashes.

--- drivers/s390/block/dasd.c.ori       2003-11-10 11:55:18.000000000
+0000
+++ drivers/s390/block/dasd.c   2003-11-10 11:55:51.000000000 +0000
@@ -4292,6 +4292,12 @@
                 max_sectors[major][minor + i] = 0;
         }
         if (device->request_queue) {
+           if (device->request_queue->plug_tq.sync) {
+               printk("dasd_disable_blkdev(): Device %d:%d on tq_disk
(entry %p), running queue\n", major, minor,
&device->request_queue->plug_tq);
+               run_task_queue(&tq_disk);
+               if (device->request_queue->plug_tq.sync)
+                   printk("dasd.c: Ugh, still on tq_disk.  Bye!!\n");
+           }
             blk_cleanup_queue (device->request_queue);
             kfree(device->request_queue);
             device->request_queue = NULL;


The following paragraph is just my theory as to what might be the cause:

I see RedHat have changes at the end of ll_rw_blk.c:__make_request(),
which set q->plugged=0 and effectively duplicate what would normally
happen when tq_disk is processed.  __make_request() has already called
q->plug_device_fn(), so the request_queue is queued on tq_disk.  If
your changes mean we can now get out of __make_request() with no work
left queued for my dasd device, then there is nothing to stop me
detaching it before anyone calls run_task_queue(&tq_disk), resulting
in the above oops.



Version-Release number of selected component (if applicable):
kernel-2.4.21-4.EL

How reproducible:
Always

Steps to Reproduce:
1.see description section.
2.
3.
    

Additional info:

Comment 1 Arjan van de Ven 2003-11-10 14:51:26 UTC

this code is rather broken.
run_task_queue() is NO guarantee all IO is finished etc etc.
Sounds like set device ... off isn't supportable.

Comment 2 Richard Hirst 2003-11-10 20:11:46 UTC

Well, in principle set...off should be no harder than scsi
remove-single-device, should it?  run_task_queue() may well not be the
right way to handle this, but the dasd driver has gone through the
motions of canceling outstanding requests first.  I wasn't proposing
my patch as a proper fix, just as evidence of what was causing the
crash.  Maybe the dasd driver didn't cancel outstanding requests properly.

Comment 3 Pete Zaitcev 2003-12-24 00:45:32 UTC

So, where from are thouse outstanding requests coming?
I suspect it might be one of those cases when it's better not
to do something that hurts.

Comment 5 Suzanne Hillman 2004-04-20 19:40:17 UTC

OK, this has been open and not touched forever. Closing. If this needs
to be fixed still, please reopen with additional information.