From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510 Description of problem: We have a central print server with about 130 printers. This system prints several thousand documents a day and usually works well, however, we have discovered that if a printer becomes unreachable for some reason (WAN circuit down, printer powered off, etc) and someone sends a job to one of these printers the print server will continually attempt to resend the job but eventually the "socket" or "lpd" backend will hang trying to print the jobs and this somehow hangs the entire cups daemon. After the cups daemon hangs any client that tries to print to the server will hang indefinitely. For example, a simple command such as "lp -d printer1 < /etc/fstab" will hang on the client forever (well, maybe not forever, but I have left systems in this state overnight and they are still hung there the next day). The only way to recover at this point is to restart the cups daemon, however, if you don't go in and quickly either repair the broken printer, stop the broken printers queue, or remove the jobs for the broken printer, the cupsd process will hang again in only a few minutes. Whenever this symptom arises there are usually child processes such as socket, pstops, or cupsomatic that are in zombie/defunct state. An strace on the cupsd process when it is hung in this state shows it stuck at FUTEX_WAIT. I'm am trying to get a complete strace and debug log, however, it seems to be more difficult to reproduce when full debug2 logging or strace is in effect. We can reproduce almost at will without this extra overhead. This makes me suspcious that it might be some subtle timing issue. It also seems to be much more of a problem if many printers are unavailable simultaneously. For example during a power outage at one of our remote facilities the problem became so extreme the cupsd process would hang withing seconds after starting it. I'll continue to try to gather debug2 and strace output from a failing process. We have reproduced the problem on another, smaller server with only 6 printers configured (one configured to a non-existent host name) but it is difficult. We are attempting to come up with a easy to reproduce scenario. Version-Release number of selected component (if applicable): cups-1.1.17-13.3.6 How reproducible: Sometimes Steps to Reproduce: 1. Configure several printers as shared CUPS printers using the socket (HP Jetdirect) or LPD backend (these should not be raw printers as that seems to make the problem even harder to reproduce) 2. Send jobs to each of these printers, preferably something that requires a filter such as a PDF file 3. Make one printer unavailable (turn the printer off is a good way to do this) 4. Make sure there are jobs in the queue for the unavailable printer. 5. Continue to send jobs to the other printers while the cupsd process continues to attempt to resend jobs to the offline printer. 6. Usually withing 15 to 30 minutes the CUPS server will hang completely. Actual Results: The cupsd process hangs. You cannot access the web administration or in anyway communicate with the cupsd process. A telnet to port 631 will produce and open, but perpetually hung connection. All printing stops. Expected Results: Printing to all printers that remain online should continue is normal. Additional info: I have also tried cups-1.1.17-13.3.8 which was posted in another CUPS bugzilla report but there was no change in behaviour. We replicated our print server setup on a Fedora Core 2 system (cups-1.1.20-6) but so far cannot reproduce the issue on that platform. I'm marking this as a high severity bug as for us, it is critical. It stops all printing from our Oracle ERP/CRM system which means we cannot print checks, PO's, shipping and shop floor paperwork, etc. It's quite high profile for us. Because of the nature to the cupsd process hanging rather than completely crashing the client systems do not failover to the backup print server. Also, I have logged a ticket with Redhat support on this issue, but wanted to open a bug as well in the hope of perhaps getting a better response.
OK, I now have a error_log from cupsd set on "debug2" log level that shows a hang in a very simple case. To reproduce we did the following: 1. Setup a second server with the identical printers an the main print server. 2. Added a printer called "ob-training-lj2200" which does not actually exist. 3. Sent a print job to this non-existing printer at 3:37PM (according to the access_log). 4. Waited. 5. At 11:38PM the cupsd process hung as can be seen from the attached log. This print server had no other activity, just sat there from 3:37PM to 11:38PM trying to print this single job to the non-existent printer, then hung hard. What other information can I provide to help with this issue? Thanks, Tom
Created attachment 100582 [details] Last few minutes before the cupsd process hung
Please fetch and install this package: ftp://people.redhat.com/twaugh/tmp/cups-debuginfo/cups-debuginfo-1.1.17-13.3.6.i386.rpm and then use gdb to attach to the cupsd process. Look up its PID using 'ps ax | grep [c]upsd', and then run gdb as root like this: gdb /usr/sbin/cupsd 534 You'll see a prompt saying "(gdb)" -- type 'bt' and press enter. What does it say? (If possible, could you avoid ending the gdb session until I've had a look?) Thanks.
I am still trying to get the above information. I have installed the cups-debuginfo package on our production and test servers. Unfortunately neither system has hung in the last two days and I had already cleared the hung test system before I got this response (yes, I know, that wasn't smart). However, I'm confident I will be able to reproduce a hang in the next few days. Since this problem first cropped up it's never gone more than a week without hanging. As soon as I get something I will post it. Thanks, Tom
Here is the gdb output from a hung cups process. (gdb) bt #0 0xb73ee241 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6 #1 0xb73996bd in _L_mutex_lock_1602 () from /lib/tls/libc.so.6 #2 0x00000000 in ?? () Unfortunately this was on our production print server so I had to reset it. It's proving to be a pain to reproduce again on the test server. Later, Tom
Oh, I know the problem: cups is built as a position-independent executable and gdb only recently got support for reading the symbols correctly for those. I'll build a new cups package (with debuginfo) for you to run, and let you know where to get it when it's ready.
Hmm, PIE wasn't turned on until after RHEL3 by the looks of it, so it isn't that. Wonder why you couldn't get symbols out of it. :-( If you try attaching gdb to cupsd when it is running normally does 'bt' give any more useful information? (In other words, 'is gdb functioning correctly or will we need to try a newer one?')
Here's output from a "normal" idle cupsd. (gdb) bt #0 0xb73da308 in ___newselect_nocancel () from /lib/tls/libc.so.6 #1 0x0805bec5 in main (argc=1, argv=0xbfffc2a4) at main.c:449 #2 0xb731a768 in __libc_start_main () from /lib/tls/libc.so.6 #3 0x0804c401 in _start () Later, Tom
Okay, so gdb is at least working. I'm having trouble seeing why the backend can have this sort of effect on cupsd -- it's a separate process after all. Since the last log message is about browsing, is it at all possible to disable browsing and see if the problem still occurs?
I found this trouble report that might be relevant: http://www.cups.org/str.php?L242 I have built CUPS packages incorporating the patch from that report, and put them here: ftp://people.redhat.com/twaugh/tmp/cups-str242/ Please try to reproduce the problem with these packages.
I don't know if the backend is "causing" this issue, only that almost every time it happens there are "Zombie" backend processes. This is probably normal since they are children of the cupsd process and that process is hung so it cannot properly reclaim them when they exit. The production print server was hung yet again this morning, same gdb output as I posted yesterday. Interestingly today there were no hung child processes, just the main cupsd process, sitting there locked tight as a drum. Since the cups process seems to be stuck waiting on a lock I'm wondering if it could be some sort of race or signal handling problem (there are a number of these reported against CUPS on Solaris). Anyway, I've installed the new packages, and disabled browsing for the time being (well, set Browsing Off anyway, I still seem to get messages about browsing in the logs, I would have thought they would go completely away). Let me know if there is anything else I can do. Thanks, Tom
Well, that didn't take long. Here's another gdb output from a hang with the new packages: (gdb) bt #0 0xb73ee241 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6 #1 0xb73996bd in _L_mutex_lock_1602 () from /lib/tls/libc.so.6 #2 0x4f280a6d in ?? () #3 0x68732029 in ?? () #4 0x330a776f in ?? () #5 0xb7363510 in ftrylockfile () from /lib/tls/libc.so.6 #6 0x39312e36 in ?? () #7 0x280a6d20 in ?? () #8 0xb739918b in __tz_convert () from /lib/tls/libc.so.6 Previous frame inner to this frame (corrupt stack?) Thanks, Tom
Aha. Can you please confirm that you have cups-1.1.17-13.3.10.0.1? The __tz_convert is very interesting indeed -- the patch is meant to avoid problems in signal handlers that use localtime().
Yep, version is cups-1.1.17-13.3.10.0.1 which I downloaded from the URL you posted earlier. Thanks, Tom
Okay, I'm going to try a more daring change, since the STR #242 patch is not sufficient to fix the problem. This should move all signal-unsafe functions out of the SIGCHLD signal handling path. I'm afraid it's only been lightly tested, but it seems to work fine for me. Could you please try 1.1.17-13.3.10.0.2 out, if possible? ftp://people.redhat.com/twaugh/tmp/cups-str242/
I installed the new packages on our production print server Friday morning after it locked up for the fourth time in less than 24 hours (I figured it couldn't get much worse than that). Both the test and production systems have survived Friday and Saturday without a lockup so things look promising, but it's still too early to tell for sure, I just wanted to give you an update with the current status. Thanks, Tom
OK, we've had 1.1.17-13.3.10.0.2 in production now since 06-04-2004 and we haven't had a single hang on either the production or the standby print service and we've definately had circumstances that previously would have hung it solid. I'm very tempted to say this problem is fixed. Is there any additional testing perform or information that I can provide? Thanks, Tom
That's great. No, I don't think we need anything else. Thanks for testing!
Closing out based on feedback from original reporter.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-228.html