124345 – CUPS Hangs when network printer unreachable

Bug 124345 - CUPS Hangs when network printer unreachable

Summary: CUPS Hangs when network printer unreachable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	cups
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Tim Waugh
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	116727
TreeView+	depends on / blocked

Reported:	2004-05-25 20:27 UTC by Tom Sightler
Modified:	2007-11-30 22:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-08-18 15:04:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Last few minutes before the cupsd process hung (89.17 KB, application/x-gzip) 2004-05-26 13:30 UTC, Tom Sightler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:228	0	normal	SHIPPED_LIVE	Updated CUPS packages fix several bugs	2004-09-01 04:00:00 UTC

Description Tom Sightler 2004-05-25 20:27:57 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510

Description of problem:
We have a central print server with about 130 printers.  This system
prints several thousand documents a day and usually works well,
however, we have discovered that if a printer becomes unreachable for
some reason (WAN circuit down, printer powered off, etc) and someone
sends a job to one of these printers the print server will continually
attempt to resend the job but eventually the "socket" or "lpd" backend
will hang trying to print the jobs and this somehow hangs the entire
cups daemon.

After the cups daemon hangs any client that tries to print to the
server will hang indefinitely.  For example, a simple command such as
"lp -d printer1 < /etc/fstab" will hang on the client forever (well,
maybe not forever, but I have left systems in this state overnight and
they are still hung there the next day).

The only way to recover at this point is to restart the cups daemon,
however, if you don't go in and quickly either repair the broken
printer, stop the broken printers queue, or remove the jobs for the
broken printer, the cupsd process will hang again in only a few minutes.

Whenever this symptom arises there are usually child processes such as
socket, pstops, or cupsomatic that are in zombie/defunct state.  An
strace on the cupsd process when it is hung in this state shows it
stuck at FUTEX_WAIT.

I'm am trying to get a complete strace and debug log, however, it
seems to be more difficult to reproduce when full debug2 logging or
strace is in effect.  We can reproduce almost at will without this
extra overhead.  This makes me suspcious that it might be some subtle
timing issue.

It also seems to be much more of a problem if many printers are
unavailable simultaneously.  For example during a power outage at one
of our remote facilities the problem became so extreme the cupsd
process would hang withing seconds after starting it.

I'll continue to try to gather debug2 and strace output from a failing
process.  We have reproduced the problem on another, smaller server
with only 6 printers configured (one configured to a non-existent host
name) but it is difficult.  We are attempting to come up with a easy
to reproduce scenario.



Version-Release number of selected component (if applicable):
cups-1.1.17-13.3.6

How reproducible:
Sometimes

Steps to Reproduce:
1. Configure several printers as shared CUPS printers using the socket
(HP Jetdirect) or LPD backend (these should not be raw printers as
that seems to make the problem even harder to reproduce)
2. Send jobs to each of these printers, preferably something that
requires a filter such as a PDF file
3. Make one printer unavailable (turn the printer off is a good way to
do this)
4. Make sure there are jobs in the queue for the unavailable printer.
5. Continue to send jobs to the other printers while the cupsd process
 continues to attempt to resend jobs to the offline printer.
6. Usually withing 15 to 30 minutes the CUPS server will hang completely.

Actual Results:  The cupsd process hangs.  You cannot access the web
administration or in anyway communicate with the cupsd process.  A
telnet to port 631 will produce and open, but perpetually hung
connection.  All printing stops.

Expected Results:  Printing to all printers that remain online should
continue is normal.

Additional info:

I have also tried cups-1.1.17-13.3.8 which was posted in another CUPS
bugzilla report but there was no change in behaviour.

We replicated our print server setup on a Fedora Core 2 system
(cups-1.1.20-6) but so far cannot reproduce the issue on that platform.

I'm marking this as a high severity bug as for us, it is critical.  It
stops all printing from our Oracle ERP/CRM system which means we
cannot print checks, PO's, shipping and shop floor paperwork, etc. 
It's quite high profile for us.

Because of the nature to the cupsd process hanging rather than
completely crashing the client systems do not failover to the backup
print server.

Also, I have logged a ticket with Redhat support on this issue, but
wanted to open a bug as well in the hope of perhaps getting a better
response.

Comment 1 Tom Sightler 2004-05-26 13:27:18 UTC

OK, I now have a error_log from cupsd set on "debug2" log level that
shows a hang in a very simple case.  To reproduce we did the following:

1. Setup a second server with the identical printers an the main print
server.
2. Added a printer called "ob-training-lj2200" which does not actually
exist.
3. Sent a print job to this non-existing printer at 3:37PM (according
to the access_log).
4. Waited.
5. At 11:38PM the cupsd process hung as can be seen from the attached log.

This print server had no other activity, just sat there from 3:37PM to
11:38PM trying to print this single job to the non-existent printer,
then hung hard.

What other information can I provide to help with this issue?

Thanks,
Tom

Comment 2 Tom Sightler 2004-05-26 13:30:16 UTC

Created attachment 100582 [details]
Last few minutes before the cupsd process hung

Comment 3 Tim Waugh 2004-05-26 14:13:56 UTC

Please fetch and install this package:

ftp://people.redhat.com/twaugh/tmp/cups-debuginfo/cups-debuginfo-1.1.17-13.3.6.i386.rpm

and then use gdb to attach to the cupsd process.  Look up its PID
using 'ps ax | grep [c]upsd', and then run gdb as root like this:

gdb /usr/sbin/cupsd 534

You'll see a prompt saying "(gdb)" -- type 'bt' and press enter.  What
does it say? (If possible, could you avoid ending the gdb session
until I've had a look?)

Thanks.

Comment 4 Tom Sightler 2004-05-28 12:32:41 UTC

I am still trying to get the above information.  I have installed the
cups-debuginfo package on our production and test servers. 
Unfortunately neither system has hung in the last two days and I had
already cleared the hung test system before I got this response (yes,
I know, that wasn't smart).  However, I'm confident I will be able to
reproduce a hang in the next few days.  Since this problem first
cropped up it's never gone more than a week without hanging.

As soon as I get something I will post it.

Thanks,
Tom

Comment 5 Tom Sightler 2004-06-02 16:25:21 UTC

Here is the gdb output from a hung cups process.

(gdb) bt
#0  0xb73ee241 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#1  0xb73996bd in _L_mutex_lock_1602 () from /lib/tls/libc.so.6
#2  0x00000000 in ?? ()

Unfortunately this was on our production print server so I had to
reset it.  It's proving to be a pain to reproduce again on the test
server.

Later,
Tom

Comment 6 Tim Waugh 2004-06-02 18:02:26 UTC

Oh, I know the problem: cups is built as a position-independent
executable and gdb only recently got support for reading the symbols
correctly for those.

I'll build a new cups package (with debuginfo) for you to run, and let
you know where to get it when it's ready.

Comment 7 Tim Waugh 2004-06-02 18:06:59 UTC

Hmm, PIE wasn't turned on until after RHEL3 by the looks of it, so it
isn't that.  Wonder why you couldn't get symbols out of it. :-(

If you try attaching gdb to cupsd when it is running normally does
'bt' give any more useful information? (In other words, 'is gdb
functioning correctly or will we need to try a newer one?')

Comment 8 Tom Sightler 2004-06-02 18:29:08 UTC

Here's output from a "normal" idle cupsd.

(gdb) bt
#0  0xb73da308 in ___newselect_nocancel () from /lib/tls/libc.so.6
#1  0x0805bec5 in main (argc=1, argv=0xbfffc2a4) at main.c:449
#2  0xb731a768 in __libc_start_main () from /lib/tls/libc.so.6
#3  0x0804c401 in _start ()

Later,
Tom

Comment 9 Tim Waugh 2004-06-03 08:20:50 UTC

Okay, so gdb is at least working.

I'm having trouble seeing why the backend can have this sort of effect
on cupsd -- it's a separate process after all.

Since the last log message is about browsing, is it at all possible to
disable browsing and see if the problem still occurs?

Comment 10 Tim Waugh 2004-06-03 11:41:44 UTC

I found this trouble report that might be relevant:

http://www.cups.org/str.php?L242

I have built CUPS packages incorporating the patch from that report,
and put them here:

ftp://people.redhat.com/twaugh/tmp/cups-str242/

Please try to reproduce the problem with these packages.

Comment 11 Tom Sightler 2004-06-03 13:36:44 UTC

I don't know if the backend is "causing" this issue, only that almost
every time it happens there are "Zombie" backend processes.  This is
probably normal since they are children of the cupsd process and that
process is hung so it cannot properly reclaim them when they exit.

The production print server was hung yet again this morning, same gdb
output as I posted yesterday.  Interestingly today there were no hung
child processes, just the main cupsd process, sitting there locked
tight as a drum.  Since the cups process seems to be stuck waiting on
a lock I'm wondering if it could be some sort of race or signal
handling problem (there are a number of these reported against CUPS on
Solaris).

Anyway, I've installed the new packages, and disabled browsing for the
time being (well, set Browsing Off anyway, I still seem to get
messages about browsing in the logs, I would have thought they would
go completely away).  Let me know if there is anything else I can do.

Thanks,
Tom

Comment 12 Tom Sightler 2004-06-03 14:56:15 UTC

Well, that didn't take long.  Here's another gdb output from a hang
with the new packages:

(gdb) bt
#0  0xb73ee241 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#1  0xb73996bd in _L_mutex_lock_1602 () from /lib/tls/libc.so.6
#2  0x4f280a6d in ?? ()
#3  0x68732029 in ?? ()
#4  0x330a776f in ?? ()
#5  0xb7363510 in ftrylockfile () from /lib/tls/libc.so.6
#6  0x39312e36 in ?? ()
#7  0x280a6d20 in ?? ()
#8  0xb739918b in __tz_convert () from /lib/tls/libc.so.6
Previous frame inner to this frame (corrupt stack?)

Thanks,
Tom

Comment 13 Tim Waugh 2004-06-03 15:10:40 UTC

Aha.  Can you please confirm that you have cups-1.1.17-13.3.10.0.1?

The __tz_convert is very interesting indeed -- the patch is meant to
avoid problems in signal handlers that use localtime().

Comment 14 Tom Sightler 2004-06-03 15:31:39 UTC

Yep, version is cups-1.1.17-13.3.10.0.1 which I downloaded from the 
URL you posted earlier.

Thanks,
Tom

Comment 15 Tim Waugh 2004-06-03 16:54:07 UTC

Okay, I'm going to try a more daring change, since the STR #242 patch
is not sufficient to fix the problem.  This should move all
signal-unsafe functions out of the SIGCHLD signal handling path.

I'm afraid it's only been lightly tested, but it seems to work fine
for me.  Could you please try 1.1.17-13.3.10.0.2 out, if possible?

ftp://people.redhat.com/twaugh/tmp/cups-str242/

Comment 16 Tom Sightler 2004-06-06 01:20:12 UTC

I installed the new packages on our production print server Friday 
morning after it locked up for the fourth time in less than 24 hours 
(I figured it couldn't get much worse than that).  Both the test and 
production systems have survived Friday and Saturday without a lockup 
so things look promising, but it's still too early to tell for sure, 
I just wanted to give you an update with the current status.

Thanks,
Tom

Comment 17 Tom Sightler 2004-06-16 13:12:10 UTC

OK, we've had 1.1.17-13.3.10.0.2 in production now since 06-04-2004
and we haven't had a single hang on either the production or the
standby print service and we've definately had circumstances that
previously would have hung it solid.

I'm very tempted to say this problem is fixed.  Is there any
additional testing perform or information that I can provide?

Thanks,
Tom

Comment 18 Tim Waugh 2004-06-23 17:52:22 UTC

That's great.  No, I don't think we need anything else.  Thanks for
testing!

Comment 19 Jay Turner 2004-08-18 15:04:31 UTC

Closing out based on feedback from original reporter.

Comment 20 Jay Turner 2004-09-02 02:10:57 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-228.html

Note You need to log in before you can comment on or make changes to this bug.