Description of problem: I maintain a simple USB kernel driver for a tester built and used in-house. I'm trying to move to a userspace driver using libusb. It appears, however, that the kernel has trouble with a USBDEVFS_BULK ioctl right before a process terminates. Version-Release number of selected component (if applicable): uname -r => 2.6.9-42.ELsmp How reproducible: The problem appears to be timing related, though I can make the problem happen or not on demand. Steps to Reproduce: 1. Execute a program which claims an interface, sends a packet to a bulk-out endpoint, releases the endpoint, and immediately exits. (The strace log of a program which does that is attached.) In my case, the device receives the packet but stops responding after the trouble-causing program is run a few times. 2. dmesg | tail -n 1 will show you something like this: ehci_hcd 0000:00:1d.7: qh cec8e700 (#2) state 1 Actual results: The error message comes from line 996 of drivers/usb/host/ehci-hcd.c, in the function ehci_endpoint_disable(). The comment before the ehci_err() call indicates that the kernel is leaking memory. I added a call to dump_stack() right before the ehci_err() call, and got this: [<f8daace5>] ehci_endpoint_disable+0x98/0x109 [ehci_hcd] [<f8daac4d>] ehci_endpoint_disable+0x0/0x109 [ehci_hcd] [<c025a508>] hcd_endpoint_disable+0x177/0x17f [<c025b77f>] usb_disable_endpoint+0x1e/0x3d [<c025b7c4>] usb_disable_interface+0x26/0x33 [<c0255b12>] usb_unbind_interface+0x1d/0x50 [<c021f8e4>] device_release_driver+0x3c/0x46 [<c0255d27>] usb_driver_release_interface+0x2c/0x40 [<c025e2fe>] releaseintf+0x7c/0x95 [<c025fec1>] proc_releaseinterface+0x1f/0x35 [<c026079f>] usbdev_ioctl+0x5c0/0x62a [<c016ab9d>] file_ioctl+0x19d/0x1af [<c016adb7>] sys_ioctl+0x208/0x269 [<c02d4703>] syscall_call+0x7/0xb (I get a similar stack dump if I remove the USBDEVFS_RELEASEINTERFACE ioctl. In that case, this error shows up as a result of the automatic cleanup after process termination.) Expected results: I think that's obvious. :) Additional info: If I put a one-second sleep between the bulk transfer and the implicit or explicit release of the interface, I do not see the error message.
Created attachment 160461 [details] strace_dump.txt
Created attachment 181961 [details] test program
Pete, I got an out-of-band request for some more information, asking about the hardware and code involved. The full hardware involved is a little rare, of course, but the USB controller portion is simply a Cypress SX2. I have attached a simple test program which I can use to reproduce the "ehci_hcd ... state 1" error message just like the original code. This time, however, I have not seen the tester quit responding. I will keep looking into this and will let you know if I figure out why.
Hmm, I'm pretty sure I have just fixed (ok, a couple of months ago) a very similar issue with IBM where a submission error triggered a memory leak. It happened on RHEL 5, bug 236922, fixed in kernel-2.6.18-26.el5. Upstream thread: http://marc.info/?t=118068129800003&r=1&w=2 I'll look into this bug, maybe it's the same one, maybe not.
I'm just curious... Have any of you been able to reproduce this?
I reproduced this on 2.6.9-52.EL in RHEL. Fedora 7 seems to be unaffected, so perhaps we can backport something small. BTW, this issue does not have anything to do with the bug I mentioned in comment #5.
Created attachment 184321 [details] Test patch 1 Something like this may be the solution. I tested it to work with the provied reproducer. The problem is, ehci_hcd is a rather complex driver, so I cannot be sure that I covered all bases.
I would like Support to ask the customer to migrate the system to RHEL 5. If they use the box as a single-task system with a custom software as the initial report indicates, it should be the most practical solution. Also, RHEL 5 is available immediately, but if I touch EHCI in RHEL 4 it will need extensive beta testing before coming deployed with an update. It can be done, just seems like too much trouble for the same result.
Hmm... I very much wish that was an option. One of the reasons I am trying to move to a userspace driver, though, is to make the software easier to deploy--we are finding more uses for our little tester all the time--and Micron has only just moved from RHEL3 to RHEL4 as its standard Linux install. Requiring RHEL5 would be a non-starter, I'm afraid.
Please try a test kernel 2.6.9-58.EL.bz250482.1 from this location: http://people.redhat.com/zaitcev/ftp/250482/
This is better; I am not seeing anything in dmesg now. But one of my "regression tests" isn't working. I put together some shell scripts using the test program I have already sent you. Here is the one with the most interesting behavior: # Write an increasing number of unique words to the LCD screen. The output # (on stdout) looks like this: # # 1 zyzzyvas # 2 zyzzyva zythum # 3 zythem zymurgy zymurgies # 4 zymotoxic zymotize zymotically zymotic tac /usr/share/dict/words | \ { for i in `seq 1 4` do unset words for j in `seq 1 $i` do read words[${#words[@]}] done echo $i ${words[@]} sleep 1 ./a.out ${words[@]} done } I also enabled the 1 second delay inside the test program. If I power-cycle the tester and run the above shell script, I see this on the tester's LCD screen: (<...> = missing word) zyzzyvas <zyzzyva> zythum zythem zymurgy zymurgies <zymotoxic> zymotize zymotically zymotic Let me know if there is anything else you'd like me to try.
Unfortunately, I cannot test for the follow-up problem per comment #13. I am working with the customer out of band to gain access to the equipment.
Sam informed me that the hardware is not prohibitively large, but expensive. He did not say, but I deduce that they do not have spares for me to play with. I'm switching to instrumented kernels.
I just got a hold of another PC with RHEL 4, update 4 installed. With a standard kernel image, I can reproduce the test and behavior described in comment #13. I do not see any error messages in the kernel log. I removed the ehci_hcd module and got the same results. I will try RHEL 5 and Fedora 7 on that same PC ASAP.
So, the "another" PC is useless for testing fixes for the original issue, because it fails to reproduce it. However, it can test for the follow-up issue. I still haven't got the specific instrumentation ready, but if it's HCD independent, let's try something simpler, like usbmon. I uploaded kernel 2.6.9-58.EL.bz250482.2 to http://people.redhat.com/zaitcev/ftp/250482/ . Sam, please use it to capture a usbmon trace for me. This is how: 1. Add strace -o xxx.strace in front of ./a.out (It helps me separate issues with libusb from issues with kernel. It cannot be the same trace as in comment #1, because of several bulk transfers). 2. modprobe debugfs (this can throw an error) 3. modprobe usbmon (this too) 4. mkdir /dbg 5. mount -t debugfs d /dbg 6. Look at /proc/bus/usb/devices and find the bus number (it's Bus=XX) 7. cat /dbg/usbmon/XXt > xxx.mon & 8. Run the script, look at LCD and verify that it fails 9. kill %1 (should kill the cat) 10. dmesg > xxx.dmesg 11. cat /proc/bus/usb/devices > xxx.devices At this point you should have 4 pieces: devices, dmesg, strace output, usbmon output. They all should be consistent. Attach them to the bug. I cannot promise that the usbmon captures enough to figure this out, but if it's not a EHCI-only bug, it usually helps a lot.
Sorry, I should have explained myself... I just tried, and I am able to reproduce the original issue on my second PC. The test program of comment #2 has a compile-time switch for a one-second delay after sending a bulk packet. The behavior I described in comment #13 and #16 was with the delay enabled. If I remove the delay I can reproduce the original issue on the second PC. This tells me that the "missing word" problem isn't caused by your EHCI patch. (But we know that now anyway because I was able to reproduce it with the--just checked this--UHCI HCD.) I just popped a Fedora 7 live CD into the second PC, and performed the same experiment as described in comment #16, with and without the one-second sleep in the ./a.out binary. Neither caused an error message to show up in dmesg, but I did see the "missing word" behavior in both cases.
Thanks for the explanation. I'm waiting for the traces then. It would be better to use Fedora for it, because its usbmon is somewhat more modern. Please follow instructions here (if the LiveCD contains it): /usr/share/doc/kernel-doc-2.6.22/Documentation/usb/usbmon.txt If not, please proceed as planned with the test kernel I built for you.
Created attachment 205651 [details] four strace logs, xxx.mon, xxx.dmesg, xxx.devices I modified the script by putting "strace -o strace_log.$i" in front of "./a.out".
The bad news is, there's absolutely nothing of note in the log. I hoped to find any overlaps, but they aren't there. I think that something in the tester's firmware gets disrupted by the SET_INTERFACE. Unfortunately, bypassing that is not easy, because the call is embedded deeply into the unbind sequence. devio.c:releaseintf usb_driver_release_interface device_release_driver drv->remove ---> usb_unbind_interface driver->disconnect // does nothing bad message.c:usb_set_interface(dev, intf, altconfig=0) checks usb_control_msg(USB_REQ_SET_INTERFACE) // this is the problem usb_disable_interface iface->cur_altsetting = alt // by pointer clear halts, toggles set to zero usb_enable_interface I and Sam will have to consider our options now.
Sam, not sure if this is available to you, but could you please test the 2.6.9-69.EL or later? It incorporates the fix for the original leak.
Closing, requestor not responding.