Bug 595417

Summary:	[RHEL6] bug in DMA handling on ibmvscsi driver
Product:	Red Hat Enterprise Linux 6	Reporter:	Aristeu Rozanski <arozansk>
Component:	kernel	Assignee:	Steve Best <sbest>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Storage QE <storage-qe>
Severity:	urgent	Docs Contact:
Priority:	low
Version:	6.0	CC:	anton, arozansk, bugproxy, dougsland, gansalmon, itamar, jburke, jonathan, kernel-maint, mgahagan, peterm
Target Milestone:	rc
Target Release:	---
Hardware:	ppc64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	579454	Environment:
Last Closed:	2010-11-11 16:16:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	579454
Bug Blocks:

Description Aristeu Rozanski 2010-05-24 14:59:45 UTC

lso seen on kernel-debug version 2.6.32-29.el6:

+++ This bug was initially created as a clone of Bug #579454 +++

SACHIN P. SANT <sachinp.com>-
 After 1st stage of TEXT MODE INSTALLATION completes, the systems (p55 -power5 )
segfaults on 1st reboot.  Just before the segfault following badness message is
displayed on the console :

returning from prom_init
Phyp-dump not supported on this hardware
ibmvscsi 3000000b: fast_fail not supported in server
------------[ cut here ]------------
Badness at lib/dma-debug.c:820
sd 0:0:1:0: [sda] Assuming drive cache: write through
sd 0:0:1:0: [sda] Assuming drive cache: write through
sd 0:0:1:0: [sda] Assuming drive cache: write through

This same badness message can be recreated with latest upstream kernels. Here
is the complete back trace(against vanilla kernel) :

ibmvscsi 30000003: Client reserve enabled
ibmvscsi 30000003: sent SRP login
ibmvscsi 30000003: SRP_LOGIN succeeded
ibmvscsi 30000003: DMA-API: device driver frees DMA memory with wrong function
[device address=0x0000000000011520] [size=36 bytes] [mapped as scather-gather]
[unmapped as single]
------------[ cut here ]------------
Badness at lib/dma-debug.c:820
NIP: c00000000039bd24 LR: c00000000039bd20 CTR: c0000000000704a4
REGS: c00000000f69f6f0 TRAP: 0700   Tainted: G        W   (2.6.34-rc3)
MSR: 8000000000029032 <EE,ME,CE,IR,DR>  CR: 48000082  XER: 00000004
TASK = c00000000125cc70[0] 'swapper' THREAD: c000000001324000 CPU: 0
GPR00: c00000000039bd20 c00000000f69f970 c000000001322e38 00000000000000b6
GPR04: 0000000000000001 c0000000000c1ea8 0000000000000000 
0000000000000002
GPR08: 0000000000000000 c00000000125cc70 0000000000000f66 0000000000000001
GPR12: 0000000000000002 c00000000f669000 0000000000d47940 0000000001c00000
GPR16: ffffffffffffffff 0000000002673148 00000000018ff984 0000000000000006
GPR20: 0000000000000000 c000000000c7bb80 0000000000000000 
0000000000000000
GPR24: 0000000000000001 c000000001de6b00 0000000000000001 c000000001de6f80
GPR28: c000000109c696b0 c00000000f69fa90 c0000000012acab0 c00000000f69f970
NIP [c00000000039bd24] .check_unmap+0x3e0/0x784
LR [c00000000039bd20] .check_unmap+0x3dc/0x784
Call Trace:
[c00000000f69f970] [c00000000039bd20] .check_unmap+0x3dc/0x784 (unreliable)
[c00000000f69fa20] [c00000000039c3dc] .debug_dma_unmap_page+0x98/0xc8
[c00000000f69fb60] [d000000001dd63f4] .unmap_cmd_data+0xd0/0x11c [ibmvscsic]
[c00000000f69fc00] [d000000001dd8878] .handle_cmd_rsp+0xe0/0x154 [ibmvscsic]
[c00000000f69fca0] [d000000001dd7694] .ibmvscsi_handle_crq+0x44c/0x500
[ibmvscsic]
[c00000000f69fd40] [d000000001ddaca4] .rpavscsi_task+0x50/0xd8 [ibmvscsic]
[c00000000f69fdf0] [c0000000000c9e84] .tasklet_action+0x108/0x1d4
[c00000000f69fea0] [c0000000000cb778] .__do_softirq+0x168/0x2b8
[c00000000f69ff90] [c0000000000337b0] .call_do_softirq+0x14/0x24
[c000000001327840] [c000000000010664] .do_softirq+0xa0/0x104
[c0000000013278e0] [c0000000000cb0e4] .irq_exit+0x70/0xd0
[c000000001327960] [c00000000000fee4] .do_IRQ+0x214/0x2d8
[c000000001327a20] [c000000000004d28] hardware_interrupt_entry+0x28/0x2c
--- Exception: 501 at .raw_local_irq_restore+0xc0/0xdc
   LR = .cpu_idle+0x12c/0x1d0
[c000000001327d10] [c000000001290a28] mv88e6131_switch_driver+0x8da0/0x35588
(unreliable)
[c000000001327db0] [c000000000017e14] .cpu_idle+0x12c/0x1d0
[c000000001327e50] [c00000000000a71c] .rest_init+0xe8/0x10c
[c000000001327ee0] [c000000000a12e38] .start_kernel+0x4ec/0x510
[c000000001327f90] [c000000000008c64] .start_here_common+0x2c/0x48
Instruction dump:
e81c001a e93d001a e97e8030 78001f24 79291f24 e87e80c0 e8dd0028 e8fd0030
7d0b002a 7d2b482a 48393a49 60000000 <0fe00000> 480000b8 2f800003 409e00f4
Mapped at:
[<c00000000039c76c>] .debug_dma_map_sg+0xa0/0x220
[<c0000000005085c4>] .scsi_dma_map+0x120/0x164
[<d000000001dd8a6c>] .ibmvscsi_queuecommand+0x180/0x5d0 [ibmvscsic]
[<c0000000004fd9b4>] .scsi_dispatch_cmd+0x21c/0x2cc
[<c000000000506058>] .scsi_request_fn+0x3cc/0x57c
scsi 0:0:1:0: Direct-Access     AIX      VDASD            0001 PQ: 0 ANSI: 3

This problem has been reported to community. Here is the link :
http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081541.html

The following patch should fix this issue :

http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-April/081545.html

Please include the above patch in F13.

--- Additional comment from bugproxy.com on 2010-05-03 14:52:13 EDT ---

------- Comment From Subrata Modak subrata.ibm.com 2010-05-03 14:43 EDT-------
The issue is still reproducible with . The following error still occurs. Probably, the proposed patch has not made to the fedora packages:

Segmentation fault
No root device found
Segmentation fault
No root device found
Boot has failed, sleeping forever.

Redhat,

Any Information on this ?

Regards--
Subrata

--- Additional comment from bugproxy.com on 2010-05-04 08:14:02 EDT ---

------- Comment From Subrata Modak subrata.ibm.com 2010-05-04 07:53 EDT-------
Redhat,

Any news about this patch going into Fedora Kernel?

Regards--
Subrata

Comment 2 Steve Best 2010-06-03 17:49:03 UTC

posted to rh-kernel mailing list
http://post-office.corp.redhat.com/archives/rhkernel-list/2010-June/msg00205.html

Comment 3 RHEL Program Management 2010-06-07 15:53:29 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 4 Mike Gahagan 2010-06-08 14:58:34 UTC

I'm still seeing this with the -33.debug kernel.

vio_register_driver: driver ibmvscsi registering
ibmvscsi 30000002: SRP_VERSION: 16.a
scsi0 : IBM POWER Virtual SCSI Adapter 1.5.8
ibmvscsi 30000002: partner initialization complete
ibmvscsi 30000002: host srp version: 16.a, host partition 06-54F4A (1), OS 3, max io 262144
ibmvscsi 30000002: Client reserve enabled
ibmvscsi 30000002: sent SRP login
ibmvscsi 30000002: SRP_LOGIN succeeded
ibmvscsi 30000002: DMA-API: device driver frees DMA memory with wrong function [device address=0x0000000000004a10] [size=36 bytes] [mapped as scather-gather] [unmapped as single]
------------[ cut here ]------------
Badness at lib/dma-debug.c:820
NIP: c00000000030d094 LR: c00000000030d090 CTR: 0000000000000001
REGS: c000000002faf720 TRAP: 0700   Not tainted  (2.6.32-33.el6.ppc64.debug)
MSR: 8000000000029032 <EE,ME,CE,IR,DR>  CR: 28000042  XER: 20000001
TASK = c000000001056b20[0] 'swapper' THREAD: c000000001104000 CPU: 0
GPR00: c00000000030d090 c000000002faf9a0 c000000001103c48 00000000000000b6 
GPR04: 0000000000000001 c00000000008d634 0000000000000000 0000000000000002 
GPR08: 0000000000000000 c000000001056b20 c0000000000bf318 0000000000000001 
GPR12: 0000000028000028 c0000000011e2500 0000000001b1fa78 0000000000d48800 
GPR16: 0000000004300000 000000000021dcff 0000000000c85ba8 c000000001046dc0 
GPR20: 0000000000000006 0000000000000000 c000000001181180 000000000000000a 
GPR24: 0000000000000001 c000000001f91e90 c000000001026e98 c0000000011c8778 
GPR28: c000000001fb1d00 c0000001fdbd9bd8 c0000000010a39d8 c000000002fafac0 
NIP [c00000000030d094] .check_unmap+0x794/0x830
LR [c00000000030d090] .check_unmap+0x790/0x830
Call Trace:
[c000000002faf9a0] [c00000000030d090] .check_unmap+0x790/0x830 (unreliable)
[c000000002fafa50] [c00000000030d4a4] .debug_dma_unmap_page+0x94/0xe0
[c000000002fafb90] [d000000001860364] .unmap_cmd_data+0xf4/0x180 [ibmvscsic]
[c000000002fafc20] [d000000001862944] .handle_cmd_rsp+0x74/0x150 [ibmvscsic]
[c000000002fafcb0] [d0000000018618c4] .ibmvscsi_handle_crq+0x434/0x5b0 [ibmvscsic]
[c000000002fafd40] [d000000001864b44] .rpavscsi_task+0x44/0xe0 [ibmvscsic]
[c000000002fafdf0] [c000000000095284] .tasklet_action+0x1b4/0x1e0
[c000000002fafea0] [c000000000096cec] .__do_softirq+0x13c/0x2d0
[c000000002faff90] [c000000000032948] .call_do_softirq+0x14/0x24
[c000000001107900] [c00000000000e9b0] .do_softirq+0x140/0x160
[c0000000011079a0] [c0000000000966a4] .irq_exit+0xc4/0xd0
[c000000001107a20] [c00000000000ec14] .do_IRQ+0x144/0x230
[c000000001107ad0] [c000000000004804] hardware_interrupt_entry+0x1c/0x98
--- Exception: 501 at .cpu_idle+0x15c/0x1e0
    LR = .cpu_idle+0x15c/0x1e0
[c000000001107dc0] [c000000000016340] .cpu_idle+0x150/0x1e0 (unreliable)
[c000000001107e70] [c000000000009d74] .rest_init+0x84/0xa0
[c000000001107ef0] [c000000000840dc8] .start_kernel+0x598/0x5b8
[c000000001107f90] [c0000000000083d4] .start_here_common+0x1c/0x48
Instruction dump:
e97d001a e81f001a e93e8050 e87e80e0 e8df0028 e8ff0030 796b1f24 78001f24 
7d09582a 7d29002a 482a4fc9 60000000 <0fe00000> 4bfffca0 e89e8058 7c852378 
Mapped at:
 [<c0000000003d157c>] .scsi_dma_map+0x10c/0x150
 [<d000000001862bc0>] .ibmvscsi_queuecommand+0x1a0/0x660 [ibmvscsic]
 [<c0000000003c5910>] .scsi_dispatch_cmd+0x220/0x3e0
 [<c0000000003cf0a4>] .scsi_request_fn+0x484/0x580
 [<c0000000002cdcb8>] .__generic_unplug_device+0x58/0x70
scsi 0:0:3:0: Direct-Access     AIX      VDASD            0001 PQ: 0 ANSI: 3
scsi 0:0:4:0: Direct-Access     AIX      VDASD            0001 PQ: 0 ANSI: 3
scsi: waiting for bus probes to complete ...
scsi_scan_0 used greatest stack depth: 9008 bytes left
sd 0:0:3:0: [sda] 251658240 512-byte logical blocks: (128 GB/120 GiB)
sd 0:0:3:0: [sda] Write Protect is off
sd 0:0:3:0: [sda] Mode Sense: 2f 00 00 08
sd 0:0:4:0: [sdb] 251658240 512-byte logical blocks: (128 GB/120 GiB)
sd 0:0:3:0: [sda] Cache data unavailable
sd 0:0:3:0: [sda] Assuming drive cache: write through
sd 0:0:4:0: [sdb] Write Protect is off
sd 0:0:4:0: [sdb] Mode Sense: 2f 00 00 08
sd 0:0:4:0: [sdb] Cache data unavailable
sd 0:0:4:0: [sdb] Assuming drive cache: write through
sd 0:0:3:0: [sda] Cache data unavailable
sd 0:0:3:0: [sda] Assuming drive cache: write through
 sda:
sd 0:0:4:0: [sdb] Cache data unavailable
sd 0:0:4:0: [sdb] Assuming drive cache: write through
 sdb: sda1 sda2 sda3
sd 0:0:3:0: [sda] Cache data unavailable
sd 0:0:3:0: [sda] Assuming drive cache: write through
sd 0:0:3:0: [sda] Attached SCSI disk
 sdb1
sd 0:0:4:0: [sdb] Cache data unavailable
sd 0:0:4:0: [sdb] Assuming drive cache: write through
sd 0:0:4:0: [sdb] Attached SCSI disk

Comment 5 Aristeu Rozanski 2010-07-01 16:21:09 UTC

Patch(es) available on kernel-2.6.32-42.el6

Comment 8 Mike Gahagan 2010-08-19 21:38:48 UTC

No longer seeing this on an IBM Power6 with vscsi with the -66 kernel (snapshot 12)

Comment 9 releng-rhel@redhat.com 2010-11-11 16:16:05 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.