Description of problem: Kernel halts with an Oops while initializing system. Problem is in usb-uhci module, and is triggered when cups initializes an HP printer through ptal_mlcd, which is part of the hpoj package. Version-Release number of selected component (if applicable): Kernel 2.4.20-20.9smp cups-1.1.17-13.3.i386 hpoj-0.90-14 How reproducible: About 1/3 of the time during boot. Steps to Reproduce: 1. Install an HP Photosmart 7350 printer, and configure HPOJ to recognize it. Configure cups to recognize it. Configure cups to start during boot. 2. Reboot. 3. Oops will occur during cups start-up. If it doesn't happen, sorry, try rebooting again. Actual results: Kernel Oops during cups initialization. Expected results: Smooth sailing. Happens on occasion. Additional info: Sometimes HPOJ fails to initialize, but an oops does not result. The message in the log is: ptal-mlcd: ERROR at ExMgr.cpp:2744, dev=<mlc:usb: photosmart_7350@/dev/usb/lp0>, pid=920, e=19 llioService: llioRead returns -1, expected=6! ptal-mlcd: ERROR at ExMgr.cpp:902, dev=<mlc:usb: photosmart_7350@/dev/usb/lp0>, pid=920, e=19 exClose(reason=0x0010) It's not clear to me that there is a bug in ptal-mlcd. The error message could have resulted from flaky kernel behavior. Anyway, I know you are wondering whether I am just going to leave you with this information, or whether I investigated the oops. I did. Here's what I found. Stack trace: (EIP) uhci_submit_bulk_urb [usb-uhci] 0x16 do_select [kernel] 0x153 uhci_submit_urb [usb-uhci] 0x319 usb_submit_urb_Rsmp_93abab4d [usbcore] 0x3d usblp_read [printer] 0x12e sys_read [kernel] 0x97 system_call [kernel] 0x33 The code at EIP is: usb-uhci.c: 820: _static int uhci_submit_bulk_urb (struct urb *urb, struct urb *bulk_urb) 821: { 822: uhci_t *s = (uhci_t*) urb->dev->bus->hcpriv; where offset 0x16 is: mov 0xcc(%ebx),%eax where %ebx is "urb->dev", and 0xcc is the offset to "bus". The kernel stops here because %ebx is zero. So in the urb structure, the "dev" field is null. That doesn't seem right. I have USB 2.0 on the motherboard that has had plenty of time to start up, and the printer has been on for weeks. I suspect the initialization of the dev field is flaky, and this could also explain the soft failures reported by the ptal-mlcd process and why it works sometimes. By the way, when ptal-mlcd fails during start up and the kernel does not oops, I sometimes have to rmmod usb-uhci and then reload it with modprobe.
A null dev means that URB was completed. All HC drivers zap ->dev before they decrement device usage. I'll look into this, although I do not have a printer. Probably someone used (urb->status==-EINPROGRESS) test again, or something simple like that.
BTW, Craig, can you try a Fedora kernel?
Pete, I don't think so. According to your Fedora Project pages, I have to download 3 ISOs and configure a dual boot system. Sorry, but I don't have the time to do that right now. If I have misunderstood the situation, and can simply install another kernel and add it to my grub.conf, then please tell me where the fedora kernel RPM is, and I'll do it. But I'm guessing that upgrading only the kernel without Fedora's glibc & other user space friends may not work too well - true?
Craig, one more thing - please attach the actual dmesg capture with the oops, if possible. Re. the Fedora kernel, it can be downloaded separately from isos and installed on top of RHL 9 userland. Bother RHL 9 and FC 1 are NPTL based, so it matches. But let's concentrate on dmesg.
Created attachment 96065 [details] Kernel oops detail I copied the oops data manually from the screen. The system log did not have it.
Awwww, I did not mean to make all this extra work, especially when I wanted to see if any other messages were present before the oops. I continue to suspect (urb->status==-EINPROGRESS) at this point.
Created attachment 96067 [details] dmesg preceeding oops Thanks for your concern, Pete, but it really was not a problem. The oops detail was handy because I copied it into a file to run ksymoops with (and then realized that modern oops reports pretty much obviate ksymoops). I did not understand that you wanted to see the messages preceeding the oops. Here they are.
Created attachment 96078 [details] Take One
I installed and booted with Fedora kernel 2.4.22-1.2115.nptlsmp. Rebooted 6 times. Based on previous behavior, that should have elicited either the oops or the ptal complaint at least once. Didn't. So it appears the problem is cured with the kernel. Yet I have lingering suspicions that this bug results from a timing problem in a multiprocessor environment, and do not recommend closing this bug just yet. Presumably, this problem has been in the kernel for awhile. Yet it did not show up until I upgraded my P-3 processor to a P-4, installed an SMP kernel, and enabled both processors. Although I am now using Fedora's SMP kernel, it appears to be using only one processor! Both top and gkrellm show plenty of activity on CPU0 and no activity whatsoever on CPU1, and /proc/cpuinfo shows two processors in the system. Maybe the Fedora folks broke that part temporarily.
Are you running Fedora kernel on top of RHL9 userland, or your yum-ed whole distro? In any case, please try this: ftp://people.redhat.com/zaitcev/tmp/kernel-smp-2.4.22-1.2121.2.1.nptl.i686.rpm Please capture me the trace with a serial console, digicam, or some other method, if it blows up. If it refuses to sit on top of RHL 9 userland with rpm -i, --force it. It should work with old glibc just fine.
I ran the Fedora kernel on top of RH-9, and am using the latest packages from RHN. Seemed to work fine. I tried your 2121 build. Didn't blow up, nor require forced install. Did not oops on me, either. I rebooted 3 times before I got bored with sublime reboot behavior. In fact, 2121 seemed indistinguishable from the Fedore 2115 build. And this is a problem, because both Fedora kernels are labeled as "SMP", but they are not. They enabled the second processor, but did not utilize it. To reiterate my concern, I never saw this problem with my single processor system, and fear that it could come be due to the SMP environment; Fedora's broken SMP could be masking the bug. I suggest retesting the fix when the SMP is working again.
Craig, did you file a bug against the SMP utilization? The printer backport was committed to 2.4.22-1.2136, but I cannot do anything to this bug except close->worksforme, unless your claims about SMP are resolved, and this bug is not a ticket to track those.
Sorry, I thought this one was so obvious that I never checked. I just now filed Bug 112597 for SMP utilization. I also just now tried 2.4.22-1.2135, and it has the same SMP problem. 2136 has not yet been posted to the Fedora download site.
Craig, can I close this? Is the problem resolved?
Sorry, I cannot provide much more info. I switched away from Red Hat. All I can tell you is that hpijs-1.4.1 works fine on kernel 2.6.6 (Gentoo), and hpoj appears to now be unnecessary.