675864 – [RHEL6.1] PPC64 - Oops: Kernel access of bad area, sig: 11 [#1]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 675864 - [RHEL6.1] PPC64 - Oops: Kernel access of bad area, sig: 11 [#1]

Summary: [RHEL6.1] PPC64 - Oops: Kernel access of bad area, sig: 11 [#1]

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.1
Hardware:	ppc64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Don Zickus
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	6.1KnownIssues
TreeView+	depends on / blocked

Reported:	2011-02-08 01:06 UTC by Jeff Burke
Modified:	2011-08-05 21:08 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-03-24 13:50:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
console log for failed system (215.65 KB, text/plain) 2011-02-08 01:08 UTC, Jeff Burke	no flags	Details
View All

Description Jeff Burke 2011-02-08 01:06:31 UTC

Description of problem:
 While running the scrashme test the system Oops'd 

Version-Release number of selected component (if applicable):
 2.6.32-114.el6.ppc64.debug

How reproducible:
 Unknown
 
Actual results:

Unable to handle kernel paging request for data at address 0x00000040
Faulting instruction address: 0xc00000000043ad74
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=1024 NUMA pSeries
Modules linked in: sr_mod cdrom ums_cypress usb_storage aes_generic ts_kmp nls_koi8_u nls_cp932 sunrpc ipv6 dm_mirror dm_region_hash dm_log ses enclosure sg ehea ext4 jbd2 mbcache sd_mod crc_t10dif ipr radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core power_supply dm_mod [last unloaded: rmd128]
NIP: c00000000043ad74 LR: c00000000043ad64 CTR: c00000000045cf20
REGS: c0000000e31eb4d0 TRAP: 0300   Not tainted  (2.6.32-114.el6.ppc64.debug)
MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24004024  XER: 00000008
DAR: 0000000000000040, DSISR: 0000000040000000
TASK = c0000000e31da3d0[44] 'khubd' THREAD: c0000000e31e8000 CPU: 2
GPR00: c00000000043ad64 c0000000e31eb750 c0000000014b8c30 0000000000000001 
GPR04: c0000000de44fb00 fffffffffffffffe 0000000000000000 0000000000000002 
GPR08: 0000000000000001 0000000000000000 c00000000043ad38 0000000000000000 
GPR12: 0000000024004022 c000000001592a00 c00000000414f398 c0000000013e5a48 
GPR16: c00000000405dbc0 c0000000e50fd080 c00000000414f3a0 0000000000000002 
GPR20: 0000000000000001 c00000000414f3a8 c00000000414f3a0 0000000000000000 
GPR24: c0000000e50fd080 c0000000de442fa0 c0000000de44fb14 0000000000000001 
GPR28: fffffffffffffffe c0000000de44fb00 c00000000145f658 c0000000013e5ca8 
NIP [c00000000043ad74] .usb_hcd_unlink_urb+0x74/0x170
LR [c00000000043ad64] .usb_hcd_unlink_urb+0x64/0x170
Call Trace:
[c0000000e31eb750] [c00000000043ad64] .usb_hcd_unlink_urb+0x64/0x170 (unreliable)
[c0000000e31eb7f0] [c00000000043cdec] .usb_kill_urb+0x8c/0x140
[c0000000e31eb8c0] [c00000000043abe0] .usb_hcd_flush_endpoint+0x120/0x240
[c0000000e31eb970] [c00000000043dfc8] .usb_disable_endpoint+0x68/0xc0
[c0000000e31eba00] [c00000000043e0a0] .usb_disable_device+0x80/0x290
[c0000000e31ebab0] [c000000000435660] .usb_disconnect+0x110/0x250
[c0000000e31ebb60] [c000000000435630] .usb_disconnect+0xe0/0x250
[c0000000e31ebc10] [c000000000435630] .usb_disconnect+0xe0/0x250
[c0000000e31ebcc0] [c000000000437524] .hub_thread+0x6b4/0x1850
[c0000000e31ebea0] [c0000000000bcfcc] .kthread+0xbc/0xd0
[c0000000e31ebf90] [c000000000033174] .kernel_thread+0x54/0x70
Instruction dump:
2f800000 409d0078 e87d0048 4bff52d1 60000000 7fe3fb78 7f64db78 48189521 
60000000 e93d0048 7f85e378 7fa4eb78 <e8690040> 4bfffbb9 7c7f1b78 e87d0048 
Kernel panic - not syncing: Fatal exception
Call Trace:
[c0000000e31eb1f0] [c000000000013844] .show_stack+0x74/0x1c0 (unreliable)
[c0000000e31eb2a0] [c0000000005ca4ac] .panic+0x80/0x1c0
[c0000000e31eb330] [c000000000030c1c] .die+0x21c/0x2a0
[c0000000e31eb3e0] [c000000000043328] .bad_page_fault+0x98/0xe0
[c0000000e31eb460] [c00000000000525c] handle_page_fault+0x3c/0x74
--- Exception: 300 at .usb_hcd_unlink_urb+0x74/0x170
    LR = .usb_hcd_unlink_urb+0x64/0x170
[c0000000e31eb7f0] [c00000000043cdec] .usb_kill_urb+0x8c/0x140
[c0000000e31eb8c0] [c00000000043abe0] .usb_hcd_flush_endpoint+0x120/0x240
[c0000000e31eb970] [c00000000043dfc8] .usb_disable_endpoint+0x68/0xc0
[c0000000e31eba00] [c00000000043e0a0] .usb_disable_device+0x80/0x290
[c0000000e31ebab0] [c000000000435660] .usb_disconnect+0x110/0x250
[c0000000e31ebb60] [c000000000435630] .usb_disconnect+0xe0/0x250
[c0000000e31ebc10] [c000000000435630] .usb_disconnect+0xe0/0x250
[c0000000e31ebcc0] [c000000000437524] .hub_thread+0x6b4/0x1850
[c0000000e31ebea0] [c0000000000bcfcc] .kthread+0xbc/0xd0
[c0000000e31ebf90] [c000000000033174] .kernel_thread+0x54/0x70
panic occurred, switching back to text console
Rebooting in 180 seconds..[-- MARK -- Mon Feb  7 13:30:00 2011]

Expected results:


Additional info:

Comment 1 Jeff Burke 2011-02-08 01:08:14 UTC

Created attachment 477542 [details]
console log for failed system

Comment 2 Don Zickus 2011-02-18 20:39:31 UTC

Moving the target to RC.

I had trouble reproducing this issue and the console log looked strange because a USB device seemed to have been added during scrashme.

What Jeff and I discovered is, with the IBM blades there is a button on each blade that tells the console controller to switch USB CDROM to that particular blade.  It seemed that while the machine was running a test, someone accidentally pressed the button temporarily giving the CDROM to this blade.  Minutes later someone realized their error and pushed the button on the correct blade, thus disconnecting it from the blade in question.

Jeff and I tried multiple times to reproduce the panic using this scenario.  While we were able to duplicate the error messages exactly as seen in the console log, we could not get the panic to happen.  We tried various timings from 5 seconds between presses to 30 seconds.  Nothing.

Therefore, I don't find this to be a beta blocker at all, as it seems to be a strange race condition.

I'll continue investigating to find where the race can happen and if upstream already fixed it.

Cheers,
Don

Comment 3 Don Zickus 2011-03-24 13:50:01 UTC

Discussed this briefly with upstream, we couldn't figure out the race condition upon which this could happen.  I haven't been able to reproduce this.  Therefore it is very difficult to debug.

Closing for now unless someone sees it again.

Note You need to log in before you can comment on or make changes to this bug.