Bug 243067

Summary: Kernel panic using USB serial I/O
Product: Red Hat Enterprise Linux 4 Reporter: Greg Bailey <gbailey>
Component: kernelAssignee: Pete Zaitcev <zaitcev>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: jbaron, jtluka, peterm, riek, syeghiay
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-18 19:26:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 430698, 461297    
Attachments:
Description Flags
/var/log/messages excerpt from "proxy7000" server
none
/var/log/messages excerpt from "communicator1" server
none
Test patch 1 - backport
none
cfa631.c program which opens/closes USB serial
none
Test patch 3 - smarter backport none

Description Greg Bailey 2007-06-07 05:48:07 UTC
Description of problem:

Servers which encountered the panic were running a "jbaron test kernel"
2.6.9-42.28.ELsmp (i686) in order to standardize on a kernel with updated sky2
drivers (vs. 2.6.9-42.0.8.ELsmp which is what was available at the time the
machines were installed).

Kernel panic occurred on 2 servers with identical hardware at the same time
while performing the same software update mechanism.  Servers are Intel
SE7221BK1E server board, and have a CrystalFontz CFA-631 LCD display connected
via USB serial.

The upgrade procedure starts and then restarts a driver program ("cfa631") which
controls the CFA-631 LCD display.  The panic message written to
/var/log/messages includes:

Jun  6 20:50:12 proxy7000 kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000070
Jun  6 20:50:12 proxy7000 kernel:  printing eip:
Jun  6 20:50:12 proxy7000 kernel: f89147fd
Jun  6 20:50:12 proxy7000 kernel: *pde = 13d8b001
Jun  6 20:50:12 proxy7000 kernel: Oops: 0000 [#1]
Jun  6 20:50:12 proxy7000 kernel: SMP
Jun  6 20:50:12 proxy7000 kernel: Modules linked in: loop ppp_deflate
zlib_deflate ppp_async crc_ccitt ppp_generic slhc ipt_LOG ipt_state ip_conntrack
iptable_filter ip_tables parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc
dm_mirror dm_mod button battery ac ftdi_sio usbserial uhci_hcd ehci_hcd
hw_random e1000 ext3 jbd ata_piix libata ipw2100 ieee80211 ieee80211_crypt
sd_mod scsi_mod
Jun  6 20:50:12 proxy7000 kernel: CPU:    1
Jun  6 20:50:12 proxy7000 kernel: EIP:    0060:[<f89147fd>]    Not tainted VLI
Jun  6 20:50:12 proxy7000 kernel: EFLAGS: 00010246   (2.6.9-42.28.ELsmp)
Jun  6 20:50:12 proxy7000 kernel: EIP is at ftdi_set_termios+0x1d/0x460 [ftdi_sio]
Jun  6 20:50:12 proxy7000 kernel: eax: 00000000   ebx: e3985000   ecx: f7fff080
  edx: c2a01600
Jun  6 20:50:12 proxy7000 kernel: esi: f7f43e00   edi: f7db6740   ebp: f7371c00
  esp: cbec5e54
Jun  6 20:50:12 proxy7000 kernel: ds: 007b   es: 007b   ss: 0068
Jun  6 20:50:12 proxy7000 kernel: Process cfa631 (pid: 12972,
threadinfo=cbec5000 task=f325ec30)
Jun  6 20:50:12 proxy7000 kernel: Stack: f7371c00 00007573 00000000 e3985000
c2a01600 f7db6740 f7f43e00 f8913ac6
Jun  6 20:50:12 proxy7000 kernel:        00000040 00000000 00000000 cbec5e93
00000000 00001388 f7371c00 c016f2fc
Jun  6 20:50:12 proxy7000 kernel:        cbec5f50 00000000 d0cf7000 c01676bf
c2b0cb94 f7f00e00 00001000 00000000
Jun  6 20:50:12 proxy7000 kernel: Call Trace:
Jun  6 20:50:12 proxy7000 kernel:  [<f8913ac6>] ftdi_open+0x91/0x16b [ftdi_sio]
Jun  6 20:50:12 proxy7000 kernel:  [<c016f2fc>] dput+0x34/0x1a7
Jun  6 20:50:12 proxy7000 kernel:  [<c01676bf>] link_path_walk+0x94/0xbe
Jun  6 20:50:12 proxy7000 kernel:  [<f890a2e5>] serial_open+0xa5/0xf9 [usbserial]
Jun  6 20:50:12 proxy7000 kernel:  [<c01fe811>] tty_open+0x19a/0x2be
Jun  6 20:50:12 proxy7000 kernel:  [<c0162ad7>] chrdev_open+0x14c/0x187
Jun  6 20:50:12 proxy7000 kernel:  [<c015a6d9>] __dentry_open+0xb7/0x18f
Jun  6 20:50:12 proxy7000 kernel:  [<c015a5c0>] filp_open+0x5c/0x70
Jun  6 20:50:12 proxy7000 kernel:  [<c01c381e>] direct_strncpy_from_user+0x46/0x5d
Jun  6 20:50:12 proxy7000 kernel:  [<c015a905>] sys_open+0x31/0x7d
Jun  6 20:50:12 proxy7000 kernel:  [<c02d523f>] syscall_call+0x7/0xb
Jun  6 20:50:12 proxy7000 kernel: Code: e8 07 e1 80 c7 83 c4 14 59 5b 5e 5f 5d
c3 55 57 56 89 c6 53 83 ec 0c 8b 00 83 3d 80 b2 91 f8 00 8b 96 f8 00 00 00 8b 28
8b 46 04 <8b> 40 70 89 54 24 04 8b 00 89 04 24 74 17 68 ac 52 91 f8 68 b5
Jun  6 20:50:12 proxy7000 kernel:  <0>Fatal exception: panic in 5 seconds

A google search turned up a very similar panic reported at:

http://lists.centos.org/pipermail/centos/2005-December/057755.html

Version-Release number of selected component (if applicable):

kernel 2.6.9-42.28.ELsmp (i686)

How reproducible:

TBD - will attempt to reproduce at will by running torture tests but haven't
done so yet.

Steps to Reproduce:
1.  Unknown - possible race condition involving USB serial I/O
2.
3.
  
Actual results:

kernel panic and messages to /var/log/messages

Expected results:

no panic

Additional info:

Comment 1 Greg Bailey 2007-06-07 05:52:29 UTC
Created attachment 156429 [details]
/var/log/messages excerpt from "proxy7000" server

Comment 2 Greg Bailey 2007-06-07 05:53:04 UTC
Created attachment 156430 [details]
/var/log/messages excerpt from "communicator1" server

Comment 3 Greg Bailey 2007-06-07 21:10:35 UTC
I'm able to reproduce this on kernel-smp-2.6.9-55.EL.
The panic occurs within a minute of two of stressing USB serial activity.

Jun  7 14:05:58 geb-test0 kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000070
Jun  7 14:05:58 geb-test0 kernel:  printing eip:
Jun  7 14:05:58 geb-test0 kernel: c02004b1
Jun  7 14:05:58 geb-test0 kernel: *pde = 3572b001
Jun  7 14:05:58 geb-test0 kernel: Oops: 0000 [#1]
Jun  7 14:05:58 geb-test0 kernel: SMP
Jun  7 14:05:58 geb-test0 kernel: Modules linked in: parport_pc lp parport
autofs4 i2c_dev i2c_core sunrpc ipt_LOG ipt_state ip_conntrack iptable_filter
ip_tables dm_mirror dm_mod button ftdi_sio usbserial battery ac uhci_hcd
ehci_hcd hw_random e1000 sky2 ext3 jbd ata_piix libata sd_mod scsi_mod
Jun  7 14:05:58 geb-test0 kernel: CPU:    0
Jun  7 14:05:58 geb-test0 kernel: EIP:    0060:[<c02004b1>]    Not tainted VLI
Jun  7 14:05:58 geb-test0 kernel: EFLAGS: 00010246   (2.6.9-55.ELsmp)
Jun  7 14:05:58 geb-test0 kernel: EIP is at tty_get_baud_rate+0x3/0x3e
Jun  7 14:05:58 geb-test0 kernel: eax: 00000000   ebx: 00000000   ecx: 00000000
  edx: f7f10000
Jun  7 14:05:58 geb-test0 kernel: esi: 00000000   edi: f7e86f00   ebp: f89303b4
  esp: f6c0ce24
Jun  7 14:05:58 geb-test0 kernel: ds: 007b   es: 007b   ss: 0068
Jun  7 14:05:58 geb-test0 kernel: Process cfa631 (pid: 6397, threadinfo=f6c0c000
task=f700f9b0)
Jun  7 14:05:58 geb-test0 kernel: Stack: f7f29e00 f892e267 f7f29e00 f6a68200
f7e86f00 f73c2800 f892e1fc 00000008
Jun  7 14:05:58 geb-test0 kernel:        f7f29e00 00001cb2 f73c2800 f892fa76
00000000 f7e86f00 00000000 f374b000
Jun  7 14:05:58 geb-test0 kernel:        f7e86f00 c2a8f940 f7f29e00 f892eac6
00000040 00000000 00000000 f6c0ce93
Jun  7 14:05:58 geb-test0 kernel: Call Trace:
Jun  7 14:05:58 geb-test0 kernel:  [<f892e267>] get_ftdi_divisor+0x19/0x25a
[ftdi_sio]
Jun  7 14:05:58 geb-test0 kernel:  [<f892e1fc>] change_speed+0x2d/0x7f [ftdi_sio]
Jun  7 14:05:58 geb-test0 kernel:  [<f892fa76>] ftdi_set_termios+0x296/0x460
[ftdi_sio]
Jun  7 14:05:58 geb-test0 kernel:  [<f892eac6>] ftdi_open+0x91/0x16b [ftdi_sio]
Jun  7 14:05:58 geb-test0 kernel:  [<c016f730>] dput+0x34/0x1a7
Jun  7 14:05:58 geb-test0 kernel:  [<c01678ea>] link_path_walk+0x94/0xbe
Jun  7 14:05:58 geb-test0 kernel:  [<f88cf2e5>] serial_open+0xa5/0xf9 [usbserial]
Jun  7 14:05:58 geb-test0 kernel:  [<c01ff270>] tty_open+0x189/0x2ad
Jun  7 14:05:58 geb-test0 kernel:  [<c0162aa9>] chrdev_open+0x14c/0x187
Jun  7 14:05:58 geb-test0 kernel:  [<c015a6c9>] __dentry_open+0xb7/0x18f
Jun  7 14:05:58 geb-test0 kernel:  [<c015a5b0>] filp_open+0x5c/0x70
Jun  7 14:05:58 geb-test0 kernel:  [<c01c3e92>] direct_strncpy_from_user+0x46/0x5d
Jun  7 14:05:58 geb-test0 kernel:  [<c015a8f5>] sys_open+0x31/0x7d
Jun  7 14:05:58 geb-test0 kernel:  [<c02d5ee3>] syscall_call+0x7/0xb
Jun  7 14:05:58 geb-test0 kernel: Code: 10 74 1c 89 d1 83 e1 0f 74 0b 8d 41 0f
3b 05 7c d2 33 c0 76 08 80 e6 ef 89 53 08 eb 02 89 c1 5b 8b 04 8d 00 d2 33 c0 c3
53 89 c3 <8b> 40 70 e8 bd ff ff ff 3d 00 96 00 00 75 2a 83 bb 48 09 00 00
Jun  7 14:05:58 geb-test0 kernel:  <0>Fatal exception: panic in 5 seconds

Comment 4 Pete Zaitcev 2007-06-14 04:56:32 UTC
Created attachment 156952 [details]
Test patch 1 - backport

Comment 5 Pete Zaitcev 2007-06-14 05:05:18 UTC
The first OOPS seems like the usual problem with USB serial used as a console,
without a tty corresponding to it. I cannot explain how this happens. Surely
the control program cfa631 should open a normal tty... Regardless, it's easy
to fix with a backport of a few NULL checks.

Putting "stress" on USB serial is ill advised. It's somewhat fragile. Worse,
fixing it may be difficult. We'll see what I can do, but at least rapid
and overlapping open/close must be 100% supported.

Greg, please test the kernel at this URL:
 http://people.redhat.com/zaitcev/ftp/243067/


Comment 6 Greg Bailey 2007-06-14 15:53:37 UTC
Created attachment 157014 [details]
cfa631.c program which opens/closes USB serial

Comment 7 Greg Bailey 2007-06-14 15:56:01 UTC
I ran my tests using the supplied 2.6.9-55.7.EL.bz243067.1smp kernel, and was
unable to reproduce the crash(es) after running over 10 hours.  Previously, I
could crash the kernel after 5-10 minutes.  This patch fixes the issue from my
perspective.

Comment 8 Pete Zaitcev 2007-06-14 16:41:52 UTC
Thanks for testing.

Unfortunately we just missed 4.6 submission deadline by 8 days.
And in any case PM has to approve.


Comment 10 RHEL Program Management 2008-03-25 20:10:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 11 Pete Zaitcev 2008-05-18 20:52:16 UTC
My candidate patch #1 was rejected by the engineering review. It's still
full of bugs, just plugs bigger holes. I'm working on a better quality fix.

Comment 14 RHEL Program Management 2008-09-03 13:05:03 UTC
Updating PM score.

Comment 16 Pete Zaitcev 2008-10-08 01:00:45 UTC
Created attachment 319713 [details]
Test patch 3 - smarter backport

This version disposes with all the unnecessary tty!=NULL checking,
including those in ftdi. This is because in reality poisoning is not
used by upstream, and single stage disconnect with refcounts is employed.
The whole mechanism is covered with table_lock.  In the patch, port_lock
is only an implementation detail: it avoids changing usb-serial's
structures, to permit module mix-and-match.

Comment 17 Pete Zaitcev 2008-10-08 01:05:23 UTC
Greg, please test the kernel 2.6.9-78.13.EL.bz243067.2, same location:
 http://people.redhat.com/zaitcev/ftp/243067/

Comment 18 Greg Bailey 2008-10-08 21:45:40 UTC
Sorry for the delay; had to recreate my torture test environment and it's been awhile...

I reconfirmed the bug exists in the latest 4.7 kernel (2.6.9-78.0.5 SMP), and I can trigger a panic immediately after starting my test.

After booting 2.6.9-78.13.EL.bz243067.2smp, my torture test runs fine--no panics occur.  This patch fixes the issue from my perspective.

Comment 19 Pete Zaitcev 2008-10-09 02:32:59 UTC
Thanks, Greg. Posted now.

Comment 20 Ludek Smid 2008-10-21 12:36:00 UTC
Does it mean, that Conditional NAK is no longer valid for this bug? If yes, remove it from Developer Whiteboard, please.

Comment 21 RHEL Program Management 2008-12-17 20:20:25 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 23 Vivek Goyal 2009-01-05 14:17:25 UTC
Committed in 78.23.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 25 Jan Tluka 2009-05-05 15:46:52 UTC
Patch is in -89.EL kernel.

Comment 27 errata-xmlrpc 2009-05-18 19:26:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html