Bug 250482

Summary: memory leak caused by USBDEVFS_BULK transfers immediately before process exit
Product: Red Hat Enterprise Linux 4 Reporter: Sam Bishop <sbishop>
Component: kernelAssignee: Pete Zaitcev <zaitcev>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4CC: cw, pep, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-16 19:04:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace_dump.txt
none
test program
none
Test patch 1
none
four strace logs, xxx.mon, xxx.dmesg, xxx.devices none

Description Sam Bishop 2007-08-01 20:33:21 UTC
Description of problem:
I maintain a simple USB kernel driver for a tester built and used in-house.  I'm
trying to move to a userspace driver using libusb.  It appears, however, that
the kernel has trouble with a USBDEVFS_BULK ioctl right before a process terminates.

Version-Release number of selected component (if applicable):
uname -r => 2.6.9-42.ELsmp

How reproducible:
The problem appears to be timing related, though I can make the problem happen
or not on demand.

Steps to Reproduce:
1. Execute a program which claims an interface, sends a packet to a bulk-out
endpoint, releases the endpoint, and immediately exits.  (The strace log of a
program which does that is attached.)  In my case, the device receives the
packet but stops responding after the trouble-causing program is run a few times.
2. dmesg | tail -n 1 will show you something like this:
ehci_hcd 0000:00:1d.7: qh cec8e700 (#2) state 1
  
Actual results:
The error message comes from line 996 of drivers/usb/host/ehci-hcd.c, in the
function ehci_endpoint_disable().  The comment before the ehci_err() call
indicates that the kernel is leaking memory.  I added a call to dump_stack()
right before the ehci_err() call, and got this:
 [<f8daace5>] ehci_endpoint_disable+0x98/0x109 [ehci_hcd]
 [<f8daac4d>] ehci_endpoint_disable+0x0/0x109 [ehci_hcd]
 [<c025a508>] hcd_endpoint_disable+0x177/0x17f
 [<c025b77f>] usb_disable_endpoint+0x1e/0x3d
 [<c025b7c4>] usb_disable_interface+0x26/0x33
 [<c0255b12>] usb_unbind_interface+0x1d/0x50
 [<c021f8e4>] device_release_driver+0x3c/0x46
 [<c0255d27>] usb_driver_release_interface+0x2c/0x40
 [<c025e2fe>] releaseintf+0x7c/0x95
 [<c025fec1>] proc_releaseinterface+0x1f/0x35
 [<c026079f>] usbdev_ioctl+0x5c0/0x62a
 [<c016ab9d>] file_ioctl+0x19d/0x1af
 [<c016adb7>] sys_ioctl+0x208/0x269
 [<c02d4703>] syscall_call+0x7/0xb
(I get a similar stack dump if I remove the USBDEVFS_RELEASEINTERFACE ioctl.  In
that case, this error shows up as a result of the automatic cleanup after
process termination.)

Expected results:
I think that's obvious. :)

Additional info:
If I put a one-second sleep between the bulk transfer and the implicit or
explicit release of the interface, I do not see the error message.

Comment 1 Sam Bishop 2007-08-01 20:33:21 UTC
Created attachment 160461 [details]
strace_dump.txt

Comment 2 Sam Bishop 2007-08-30 20:42:51 UTC
Created attachment 181961 [details]
test program

Comment 3 Sam Bishop 2007-08-30 20:46:08 UTC
Pete,

I got an out-of-band request for some more information, asking about the
hardware and code involved.  The full hardware involved is a little rare, of
course, but the USB controller portion is simply a Cypress SX2.

I have attached a simple test program which I can use to reproduce the "ehci_hcd
... state 1" error message just like the original code.  This time, however, I
have not seen the tester quit responding.  I will keep looking into this and
will let you know if I figure out why.

Comment 5 Pete Zaitcev 2007-08-30 21:30:52 UTC
Hmm, I'm pretty sure I have just fixed (ok, a couple of months ago) a very
similar issue with IBM where a submission error triggered a memory leak.
It happened on RHEL 5, bug 236922, fixed in kernel-2.6.18-26.el5.
Upstream thread:
 http://marc.info/?t=118068129800003&r=1&w=2

I'll look into this bug, maybe it's the same one, maybe not.

Comment 7 Sam Bishop 2007-08-31 15:37:26 UTC
I'm just curious...  Have any of you been able to reproduce this?

Comment 8 Pete Zaitcev 2007-09-01 00:59:54 UTC
I reproduced this on 2.6.9-52.EL in RHEL. Fedora 7 seems to be unaffected,
so perhaps we can backport something small.

BTW, this issue does not have anything to do with the bug I mentioned in
comment #5.

Comment 9 Pete Zaitcev 2007-09-01 05:23:20 UTC
Created attachment 184321 [details]
Test patch 1

Something like this may be the solution. I tested it to work with the
provied reproducer. The problem is, ehci_hcd is a rather complex driver,
so I cannot be sure that I covered all bases.

Comment 10 Pete Zaitcev 2007-09-01 05:28:03 UTC
I would like Support to ask the customer to migrate the system to RHEL 5.
If they use the box as a single-task system with a custom software as
the initial report indicates, it should be the most practical solution.
Also, RHEL 5 is available immediately, but if I touch EHCI in RHEL 4 it
will need extensive beta testing before coming deployed with an update.
It can be done, just seems like too much trouble for the same result.

Comment 11 Sam Bishop 2007-09-05 19:29:33 UTC
Hmm...  I very much wish that was an option.  One of the reasons I am trying to
move to a userspace driver, though, is to make the software easier to deploy--we
are finding more uses for our little tester all the time--and Micron has only
just moved from RHEL3 to RHEL4 as its standard Linux install.  Requiring RHEL5
would be a non-starter, I'm afraid.

Comment 12 Pete Zaitcev 2007-09-06 01:47:48 UTC
Please try a test kernel 2.6.9-58.EL.bz250482.1 from this location:
 http://people.redhat.com/zaitcev/ftp/250482/


Comment 13 Sam Bishop 2007-09-06 16:20:51 UTC
This is better; I am not seeing anything in dmesg now.  But one of my
"regression tests" isn't working.

I put together some shell scripts using the test program I have already sent
you.  Here is the one with the most interesting behavior:

# Write an increasing number of unique words to the LCD screen.  The output
# (on stdout) looks like this:
#
# 1 zyzzyvas
# 2 zyzzyva zythum
# 3 zythem zymurgy zymurgies
# 4 zymotoxic zymotize zymotically zymotic
tac /usr/share/dict/words | \
{
    for i in `seq 1 4`
    do
        unset words
        for j in `seq 1 $i`
        do
            read words[${#words[@]}]
        done

        echo $i ${words[@]}
        sleep 1
        ./a.out ${words[@]}
    done
}

I also enabled the 1 second delay inside the test program.

If I power-cycle the tester and run the above shell script, I see this on the
tester's LCD screen: 

(<...> = missing word)

zyzzyvas
<zyzzyva> zythum
zythem zymurgy zymurgies
<zymotoxic> zymotize zymotically zymotic

Let me know if there is anything else you'd like me to try.

Comment 14 Pete Zaitcev 2007-09-18 22:22:16 UTC
Unfortunately, I cannot test for the follow-up problem per comment #13.
I am working with the customer out of band to gain access to the equipment.

Comment 15 Pete Zaitcev 2007-09-20 02:07:18 UTC
Sam informed me that the hardware is not prohibitively large, but expensive.
He did not say, but I deduce that they do not have spares for me to play with.
I'm switching to instrumented kernels.

Comment 16 Sam Bishop 2007-09-24 22:13:00 UTC
I just got a hold of another PC with RHEL 4, update 4 installed.  With a
standard kernel image, I can reproduce the test and behavior described in
comment #13.  I do not see any error messages in the kernel log.

I removed the ehci_hcd module and got the same results.

I will try RHEL 5 and Fedora 7 on that same PC ASAP.

Comment 17 Pete Zaitcev 2007-09-25 04:54:05 UTC
So, the "another" PC is useless for testing fixes for the original issue,
because it fails to reproduce it. However, it can test for the follow-up
issue.

I still haven't got the specific instrumentation ready, but if it's HCD
independent, let's try something simpler, like usbmon. I uploaded kernel
2.6.9-58.EL.bz250482.2 to http://people.redhat.com/zaitcev/ftp/250482/ .
Sam, please use it to capture a usbmon trace for me.

This is how:
 1. Add strace -o xxx.strace  in front of ./a.out  (It helps me separate
issues with libusb from issues with kernel. It cannot be the same trace
as in comment #1, because of several bulk transfers).
 2. modprobe debugfs   (this can throw an error)
 3. modprobe usbmon    (this too)
 4. mkdir /dbg
 5. mount -t debugfs d /dbg
 6. Look at /proc/bus/usb/devices and find the bus number (it's Bus=XX)
 7. cat /dbg/usbmon/XXt > xxx.mon &
 8. Run the script, look at LCD and verify that it fails
 9. kill %1   (should kill the cat)
10. dmesg > xxx.dmesg
11. cat /proc/bus/usb/devices > xxx.devices

At this point you should have 4 pieces: devices, dmesg, strace output,
usbmon output. They all should be consistent. Attach them to the bug.

I cannot promise that the usbmon captures enough to figure this out,
but if it's not a EHCI-only bug, it usually helps a lot.

Comment 18 Sam Bishop 2007-09-25 15:46:50 UTC
Sorry, I should have explained myself...  I just tried, and I am able to
reproduce the original issue on my second PC.

The test program of comment #2 has a compile-time switch for a one-second delay
after sending a bulk packet.  The behavior I described in comment #13 and #16
was with the delay enabled.  If I remove the delay I can reproduce the original
issue on the second PC.

This tells me that the "missing word" problem isn't caused by your EHCI patch. 
(But we know that now anyway because I was able to reproduce it with the--just
checked this--UHCI HCD.)

I just popped a Fedora 7 live CD into the second PC, and performed the same
experiment as described in comment #16, with and without the one-second sleep in
the ./a.out binary.  Neither caused an error message to show up in dmesg, but I
did see the "missing word" behavior in both cases.

Comment 19 Pete Zaitcev 2007-09-25 16:33:56 UTC
Thanks for the explanation. I'm waiting for the traces then. It would be better
to use Fedora for it, because its usbmon is somewhat more modern. Please follow
instructions here (if the LiveCD contains it):
 /usr/share/doc/kernel-doc-2.6.22/Documentation/usb/usbmon.txt
If not, please proceed as planned with the test kernel I built for you.

Comment 20 Sam Bishop 2007-09-25 16:38:23 UTC
Created attachment 205651 [details]
four strace logs, xxx.mon, xxx.dmesg, xxx.devices

I modified the script by putting "strace -o strace_log.$i" in front of
"./a.out".

Comment 21 Pete Zaitcev 2007-09-25 18:17:50 UTC
The bad news is, there's absolutely nothing of note in the log. I hoped to
find any overlaps, but they aren't there. I think that something in the
tester's firmware gets disrupted by the SET_INTERFACE. Unfortunately,
bypassing that is not easy, because the call is embedded deeply into
the unbind sequence.

devio.c:releaseintf
  usb_driver_release_interface
    device_release_driver
      drv->remove ---> usb_unbind_interface
        driver->disconnect // does nothing bad
        message.c:usb_set_interface(dev, intf, altconfig=0)
          checks
          usb_control_msg(USB_REQ_SET_INTERFACE)   // this is the problem
          usb_disable_interface
          iface->cur_altsetting = alt  // by pointer
          clear halts, toggles set to zero
          usb_enable_interface

I and Sam will have to consider our options now.

Comment 23 Pete Zaitcev 2008-06-11 04:41:04 UTC
Sam, not sure if this is available to you, but could you please test
the 2.6.9-69.EL or later? It incorporates the fix for the original leak.

Comment 24 Pete Zaitcev 2009-01-16 19:04:48 UTC
Closing, requestor not responding.