165453 – Panic after ENXIO with usb-uhci

Bug 165453 - Panic after ENXIO with usb-uhci

Summary: Panic after ENXIO with usb-uhci

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Pete Zaitcev
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	168424
TreeView+	depends on / blocked

Reported:	2005-08-09 15:33 UTC by Bastien Nocera
Modified:	2007-11-30 22:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHSA-2006-0144
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-15 16:22:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
oops-multitech1.txt (33.36 KB, text/plain) 2005-08-09 15:33 UTC, Bastien Nocera	no flags	Details
oops-multitech2.txt (35.07 KB, text/plain) 2005-08-09 15:34 UTC, Bastien Nocera	no flags	Details
Candidate #1 - backport from 2.6 (2.17 KB, patch) 2005-08-11 08:09 UTC, Pete Zaitcev	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0144	0	qe-ready	SHIPPED_LIVE	Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 7	2006-03-15 05:00:00 UTC

Description Bastien Nocera 2005-08-09 15:33:13 UTC

While using a Multitech MT5634ZBA V92 modem, and after some ENXIO errors in the
log files.
Panics attached below.

Comment 1 Bastien Nocera 2005-08-09 15:33:13 UTC

Created attachment 117576 [details]
oops-multitech1.txt

Comment 2 Bastien Nocera 2005-08-09 15:34:13 UTC

Created attachment 117577 [details]
oops-multitech2.txt

Comment 3 Pete Zaitcev 2005-08-11 00:57:38 UTC

Hmm. This is something that my fixes in 2.4.21-31.EL.usbserial.4 are not
likely to fix.

The ENXIO is a good clue. It happens upon disconnect, before the disconnect
method had a chance to run (either real disconnect, or just the device
giving up the ghost).

What did you actually do before getting the oops? I need to recreate this
situation.

Comment 4 Pete Zaitcev 2005-08-11 01:47:01 UTC

I happen to have a Multitech, and it actually does have endpoint 0x86,
believe it or not:

T:  Bus=03 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#=  2 Spd=12  MxCh= 0
D:  Ver= 1.00 Cls=02(comm.) Sub=00 Prot=00 MxPS= 8 #Cfgs=  2
P:  Vendor=06e0 ProdID=f107 Rev= 1.00
S:  Manufacturer=Multi-Tech Systems, Inc.
S:  Product=MultiModemUSB
C:  #Ifs= 2 Cfg#= 1 Atr=a0 MxPwr=400mA
I:  If#= 0 Alt= 0 #EPs= 0 Cls=ff(vend.) Sub=ff Prot=ff Driver=
I:  If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=
E:  Ad=02(O) Atr=02(Bulk) MxPS=  16 Ivl=0ms
E:  Ad=84(I) Atr=03(Int.) MxPS=  63 Ivl=2ms
C:* #Ifs= 2 Cfg#= 2 Atr=a0 MxPwr=400mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=02(comm.) Sub=02 Prot=01 Driver=cdc_acm
E:  Ad=84(I) Atr=03(Int.) MxPS=  32 Ivl=128ms
I:  If#= 1 Alt= 0 #EPs= 2 Cls=0a(data ) Sub=00 Prot=00 Driver=cdc_acm
E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=86(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms

I was wrong about ENXIO, by the way. The printout is misleading.
It is trying to tell us that a URB was submitted for an endpoint
which already has a URB submitted.

Comment 5 Pete Zaitcev 2005-08-11 07:11:34 UTC

I think I know what is happening here. We have open and close racing,
and as a result, open attempts to submit acm->ctrlurb and acm->readurb
which were not unlinked yet. The double submission of a bulk URB is
checked by usb-uhci and is refused with the "ENXIO" message. The double
submission of the control-interrupt URB "succeeds" quietly, and corrupts
something. Double termination results in oops (with urb->dev == NULL).

Unfortunately, lock_kernel is not enough to have opens and closes separated,
because some of operations the close path does are blocking. I expect we'll
need a semaphore here somewhere.

Comment 6 Pete Zaitcev 2005-08-11 08:09:56 UTC

Created attachment 117636 [details]
Candidate #1 - backport from 2.6

As it happens, Oliver already implemented the semaphore in 2.6.
Great minds think alike. Also, RHEL 4 is not affected.
I am using same code conventions for similarity.

Comment 7 Pete Zaitcev 2005-08-11 08:12:27 UTC

I'm de-needinfoing this bug, but I still need a precise scenario for surety.
The fix is only based on analysis of oops captures.

Comment 8 Bastien Nocera 2005-08-11 08:47:14 UTC

From what I know the current usage is "normal" usage as a fax server, using
Hylafax. I'll see whether I can get something more precise.

Comment 9 Bastien Nocera 2005-08-12 09:23:45 UTC

When a fax can't be sent, the send is retried at a later time. Every now and
then, the retry will trigger the panic. When the panic occurs, the lock file
from Hylafax usually contains "LOCKWAIT".

Would you be able to provide a test kernel for testing purposes?

Comment 10 Pete Zaitcev 2005-08-31 09:10:33 UTC

Plese find the kernel to test in ftp://people.redhat.com/zaitcev/165453/
Let me know how it went, and assuming success I'll post for acks.

Comment 12 Ernie Petrides 2005-09-15 04:17:30 UTC

A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.2.EL).

Comment 15 Red Hat Bugzilla 2006-03-15 16:22:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html

Note You need to log in before you can comment on or make changes to this bug.