110307 – Oops when hpoj reads from printer

Bug 110307 - Oops when hpoj reads from printer

Summary: Oops when hpoj reads from printer

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Pete Zaitcev
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-11-18 07:16 UTC by Craig Lawson
Modified:	2007-04-18 16:59 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-08-19 23:06:25 UTC
Embargoed:

Attachments	(Terms of Use)
Kernel oops detail (1.11 KB, text/plain) 2003-11-19 19:08 UTC, Craig Lawson	no flags	Details
dmesg preceeding oops (27.95 KB, text/plain) 2003-11-19 19:25 UTC, Craig Lawson	no flags	Details
Take One (17.48 KB, patch) 2003-11-20 01:19 UTC, Pete Zaitcev	no flags	Details \| Diff
View All

Description Craig Lawson 2003-11-18 07:16:02 UTC

Description of problem:
Kernel halts with an Oops while initializing system. Problem is in 
usb-uhci module, and is triggered when cups initializes an HP printer 
through ptal_mlcd, which is part of the hpoj package.

Version-Release number of selected component (if applicable):
Kernel 2.4.20-20.9smp
cups-1.1.17-13.3.i386
hpoj-0.90-14

How reproducible:
About 1/3 of the time during boot.

Steps to Reproduce:
1. Install an HP Photosmart 7350 printer, and configure HPOJ to 
recognize it. Configure cups to recognize it. Configure cups to start 
during boot.
2. Reboot.
3. Oops will occur during cups start-up. If it doesn't happen, sorry, 
try rebooting again.
  
Actual results:
Kernel Oops during cups initialization.

Expected results:
Smooth sailing. Happens on occasion.

Additional info:

Sometimes HPOJ fails to initialize, but an oops does not result. The 
message in the log is: 

ptal-mlcd: ERROR at ExMgr.cpp:2744, dev=<mlc:usb:
photosmart_7350@/dev/usb/lp0>, pid=920, e=19         llioService: 
llioRead returns -1, expected=6!

ptal-mlcd: ERROR at ExMgr.cpp:902, dev=<mlc:usb:
photosmart_7350@/dev/usb/lp0>, pid=920, e=19         
exClose(reason=0x0010)

It's not clear to me that there is a bug in ptal-mlcd. The error 
message could have resulted from flaky kernel behavior.

Anyway, I know you are wondering whether I am just going to leave you 
with this information, or whether I investigated the oops. I did. 
Here's what I found.

Stack trace:
(EIP) uhci_submit_bulk_urb  [usb-uhci] 0x16
      do_select  [kernel] 0x153
      uhci_submit_urb  [usb-uhci] 0x319
      usb_submit_urb_Rsmp_93abab4d  [usbcore] 0x3d
      usblp_read  [printer]  0x12e
      sys_read  [kernel] 0x97
      system_call  [kernel]  0x33

The code at EIP is:
usb-uhci.c:
820: _static int uhci_submit_bulk_urb (struct urb *urb,
                                       struct urb *bulk_urb)
821: {
822:         uhci_t *s = (uhci_t*) urb->dev->bus->hcpriv;

where offset 0x16 is:
    	mov    0xcc(%ebx),%eax
where %ebx is "urb->dev", and 0xcc is the offset to "bus".

The kernel stops here because %ebx is zero.

So in the urb structure, the "dev" field is null. That doesn't seem 
right. I have USB 2.0 on the motherboard that has had plenty of time 
to start up, and the printer has been on for weeks. I suspect the 
initialization of the dev field is flaky, and this could also explain 
the soft failures reported by the ptal-mlcd process and why it works 
sometimes.

By the way, when ptal-mlcd fails during start up and the kernel does 
not oops, I sometimes have to rmmod usb-uhci and then reload it with 
modprobe.

Comment 1 Pete Zaitcev 2003-11-19 01:06:26 UTC

A null dev means that URB was completed.
All HC drivers zap ->dev before they decrement device usage.

I'll look into this, although I do not have a printer.
Probably someone used (urb->status==-EINPROGRESS) test again,
or something simple like that.

Comment 2 Pete Zaitcev 2003-11-19 01:07:32 UTC

BTW, Craig, can you try a Fedora kernel?

Comment 3 Craig Lawson 2003-11-19 07:02:56 UTC

Pete,
  I don't think so. According to your Fedora Project pages, I have to 
download 3 ISOs and configure a dual boot system. Sorry, but I don't 
have the time to do that right now.
  If I have misunderstood the situation, and can simply install 
another kernel and add it to my grub.conf, then please tell me where 
the fedora kernel RPM is, and I'll do it. But I'm guessing that 
upgrading only the kernel without Fedora's glibc & other user space 
friends may not work too well - true?

Comment 4 Pete Zaitcev 2003-11-19 07:16:13 UTC

Craig, one more thing - please attach the actual dmesg capture
with the oops, if possible.

Re. the Fedora kernel, it can be downloaded separately from isos
and installed on top of RHL 9 userland. Bother RHL 9 and FC 1
are NPTL based, so it matches. But let's concentrate on dmesg.

Comment 5 Craig Lawson 2003-11-19 19:08:40 UTC

Created attachment 96065 [details]
Kernel oops detail

I copied the oops data manually from the screen. The system log did not have
it.

Comment 6 Pete Zaitcev 2003-11-19 19:13:13 UTC

Awwww, I did not mean to make all this extra work, especially
when I wanted to see if any other messages were present before
the oops.

I continue to suspect (urb->status==-EINPROGRESS) at this point.

Comment 7 Craig Lawson 2003-11-19 19:25:23 UTC

Created attachment 96067 [details]
dmesg preceeding oops

Thanks for your concern, Pete, but it really was not a problem. The oops detail
was handy because I copied it into a file to run ksymoops with (and then
realized that modern oops reports pretty much obviate ksymoops).

I did not understand that you wanted to see the messages preceeding the oops.
Here they are.

Comment 8 Pete Zaitcev 2003-11-20 01:19:28 UTC

Created attachment 96078 [details]
Take One

Comment 9 Craig Lawson 2003-11-20 04:15:27 UTC

I installed and booted with Fedora kernel 2.4.22-1.2115.nptlsmp.
Rebooted 6 times. Based on previous behavior, that should have 
elicited either the oops or the ptal complaint at least once. Didn't.

So it appears the problem is cured with the kernel. Yet I have 
lingering suspicions that this bug results from a timing problem in a 
multiprocessor environment, and do not recommend closing this bug 
just yet.

Presumably, this problem has been in the kernel for awhile. Yet it 
did not show up until I upgraded my P-3 processor to a P-4, installed 
an SMP kernel, and enabled both processors. Although I am now using 
Fedora's SMP kernel, it appears to be using only one processor! Both 
top and gkrellm show plenty of activity on CPU0 and no activity 
whatsoever on CPU1, and /proc/cpuinfo shows two processors in the 
system. Maybe the Fedora folks broke that part temporarily.

Comment 10 Pete Zaitcev 2003-11-20 23:13:26 UTC

Are you running Fedora kernel on top of RHL9 userland, or your
yum-ed whole distro?

In any case, please try this:
 ftp://people.redhat.com/zaitcev/tmp/kernel-smp-2.4.22-1.2121.2.1.nptl.i686.rpm

Please capture me the trace with a serial console, digicam, or
some other method, if it blows up.

If it refuses to sit on top of RHL 9 userland with rpm -i, --force it.
It should work with old glibc just fine.

Comment 11 Craig Lawson 2003-11-21 06:33:47 UTC

I ran the Fedora kernel on top of RH-9, and am using the latest 
packages from RHN. Seemed to work fine.

I tried your 2121 build. Didn't blow up, nor require forced install. 
Did not oops on me, either. I rebooted 3 times before I got bored 
with sublime reboot behavior. In fact, 2121 seemed indistinguishable 
from the Fedore 2115 build.

And this is a problem, because both Fedora kernels are labeled as 
"SMP", but they are not. They enabled the second processor, but did 
not utilize it. To reiterate my concern, I never saw this problem 
with my single processor system, and fear that it could come be due 
to the SMP environment; Fedora's broken SMP could be masking the bug.

I suggest retesting the fix when the SMP is working again.

Comment 12 Pete Zaitcev 2003-12-23 21:25:35 UTC

Craig, did you file a bug against the SMP utilization?

The printer backport was committed to 2.4.22-1.2136, but I cannot
do anything to this bug except close->worksforme, unless your
claims about SMP are resolved, and this bug is not a ticket
to track those.

Comment 13 Craig Lawson 2003-12-24 04:15:35 UTC

Sorry, I thought this one was so obvious that I never checked. I just
now filed Bug 112597 for SMP utilization.

I also just now tried 2.4.22-1.2135, and it has the same SMP problem.
2136 has not yet been posted to the Fedora download site.

Comment 14 Pete Zaitcev 2004-04-22 03:22:32 UTC

Craig, can I close this? Is the problem resolved?

Comment 15 Craig Lawson 2004-04-22 05:00:02 UTC

Sorry, I cannot provide much more info. I switched away from Red Hat.
All I can tell you is that hpijs-1.4.1 works fine on kernel 2.6.6
(Gentoo), and hpoj appears to now be unnecessary.

Note You need to log in before you can comment on or make changes to this bug.