Bug 660604

Summary: Cupsd segfault without warnings
Product: [Fedora] Fedora Reporter: Pierre-YvesChibon <pingou>
Component: cupsAssignee: Tim Waugh <twaugh>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: jmoskovc, jpopelka, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cups-1.4.6-1.fc13 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 855431 (view as bug list) Environment:
Last Closed: 2011-01-18 21:33:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gdb output asked
none
gdb output asked (version 2)
none
gdb output of the select-frame 6 & info locals
none
error_log from cups running debug2
none
cups.conf none

Description Pierre-YvesChibon 2010-12-07 10:01:24 UTC
Description of problem:
My computer is connected to the printed of my work via samba.
Once in a while cups just dies without warnings or even notice. I see it by seing all printers being absent.

dmesg returns at the bottom:
[1185471.785553] SELinux: 2048 avtab hash slots, 210653 rules.
[1185471.841767] SELinux: 2048 avtab hash slots, 210653 rules.
[1185471.982002] SELinux:  9 users, 14 roles, 3473 types, 181 bools, 1 sens, 1024 cats
[1185471.982006] SELinux:  77 classes, 210653 rules
[1211561.415362] cupsd[1386]: segfault at 8 ip 00007fc3ba58dd81 sp 00007fffa1791b18 error 4 in libc-2.12.90.so[7fc3ba50b000+19a000]
[1273199.184587] cupsd[10509]: segfault at 8 ip 00007f5d8b46ad81 sp 00007fff58142a58 error 4 in libc-2.12.90.so[7f5d8b3e8000+19a000]

Version-Release number of selected component (if applicable):
# rpm -q cups
cups-1.4.4-11.fc14.x86_64


How reproducible:
often

Steps to Reproduce:
1. I don't know how to reproduce it since I do not know when it happens
2.
3.
  
Actual results:
# service cups status
cupsd dead but pid file exists


Expected results:
Having cups running

Additional info:

Comment 1 Tim Waugh 2010-12-07 10:19:36 UTC
Please run this command as root:

debuginfo-install cups

and then attach gdb to cupsd and leave it running until it crashes.  Do this by finding the PID of cupsd (ps ax | grep [c]upsd) and running gdb like this:

gdb /usr/sbin/cupsd 1711
(gdb) continue

When it crashes again, please run these gdb commands:

bt
info locals
up
info locals
up
info locals
up
info locals

and paste the results here.  Thanks!

Comment 2 Pierre-YvesChibon 2010-12-07 15:06:28 UTC
Created attachment 465248 [details]
gdb output asked

Comment 3 Pierre-YvesChibon 2010-12-08 12:20:10 UTC
Btw, I attached the asked gdb output

Comment 4 Jiri Popelka 2010-12-08 13:08:08 UTC
Thank you, but the output unfortunately doesn't contain the info we need.

I think (maybe I'm wrong) that you pressed Enter when you saw that cupsd had crashed which evoked the previous command, i.e 'continue' which caused that cupsd was already terminated in the time when you tried to obtain info with 'bt' and 'info locals'. Can you try it again, please ? Thanks.

Comment 5 Pierre-YvesChibon 2010-12-10 09:06:25 UTC
Attached in the output from gdb, I ran the given command after seeing that gdb announced the segfault but I still have:

# service cups status
cupsd (pid  32204) is running...

(on the contrary to previous time)

Comment 6 Pierre-YvesChibon 2010-12-10 09:07:15 UTC
Created attachment 467928 [details]
gdb output asked (version 2)

Comment 7 Pierre-YvesChibon 2010-12-14 10:32:44 UTC
Does it contains useful information this time ?

Comment 8 Tim Waugh 2010-12-14 13:25:43 UTC
It tells us how it crashed, yes, but I haven't had time to fully investigate it yet I'm afraid.  It was trying to send a D-Bus message at the time, but it isn't clear from the stack trace what's wrong.

Comment 9 Pierre-YvesChibon 2010-12-14 13:58:28 UTC
Well, thanks for looking at this. At the moment I have no problem reproducing it ([1]) so let me know if I can help.


[1] 
# service cups status
cupsd dead but pid file exists

Comment 10 Tim Waugh 2010-12-14 14:51:16 UTC
OK, if you can repeat it that makes it a lot easier to debug.

Can you get to the same point in gdb again, and then run this?:

select-frame 6
info locals
p *dest

Thanks.

Comment 11 Pierre-YvesChibon 2010-12-14 15:01:01 UTC
Should I run  

bt
info locals
up
info locals
up
info locals
up
info locals

before 
 
select-frame 6
info locals
p *dest

?

Comment 12 Tim Waugh 2010-12-14 15:57:38 UTC
Just:

bt
select-frame 6
info locals
p *dest

In the "bt" output, I'm hoping that it says "#6 ... cupsd_send_dbus ..." -- if not, take a look to see which frame number cupsd_send_dbus is in and use 'select-frame' for that number instead.

Thanks.

Comment 13 Pierre-YvesChibon 2010-12-14 17:02:33 UTC
And there is the output (told you I can reproduce it :-)).

Comment 14 Pierre-YvesChibon 2010-12-14 17:03:10 UTC
Created attachment 468646 [details]
gdb output of the select-frame 6 & info locals

Comment 15 Tim Waugh 2010-12-15 15:50:40 UTC
Thanks.

How are you reproducing the problem?

At the moment I'm thinking that it's some sort of issue with a job not getting stopped when the printer it is queued for reaches its browse timeout.  We expect the printer to disappear from the CUPS list of printers, and for any jobs currently processing to be stopped, but it looks like that might not be happening for some reason.

It would be a great help if you can still reproduce the problem after doing this:

cupsctl LogLevel=debug2
service cups restartlog

If so, please attach /var/log/cups/error_log (you might have to compress it with e.g. xz first as it might be quite large).

Comment 16 Pierre-YvesChibon 2010-12-15 16:05:27 UTC
To reproduce the problem I just let cups running. I don't even try to print or anything.

The surprising bits is that:
# service cups status
cupsd (pid  20806) is running...

While I have not touched it since I posted the log yesterday (not even restarted it).

I will run the command and keep you informed of the output.

Comment 17 Pierre-YvesChibon 2010-12-15 16:38:44 UTC
Created attachment 468886 [details]
error_log from cups running debug2

Comment 18 Tim Waugh 2010-12-15 17:16:35 UTC
Ha!  It times out the printer, stops the job, and then starts the job again.  No wonder...

Still analysing the cause.

Comment 19 Tim Waugh 2010-12-15 17:55:26 UTC
Could you please attach your /etc/cups/cupsd.conf file?  Thanks.

Comment 20 Pierre-YvesChibon 2010-12-15 18:04:56 UTC
Created attachment 468923 [details]
cups.conf

Comment 21 Tim Waugh 2010-12-16 13:01:03 UTC
OK, I've managed to reproduce the problem locally.  Here's what I did:

1. Stop CUPS.
2. Edit /var/cache/cups/remote.cache and add these lines to the end:

<Printer Nemo>
BrowseTime 1292500224
Info Not there
Location Nowhere
MakeModel Mysteriously absent
DeviceURI ipp://192.168.20.10:631/printers/Nemo
State Idle
Type 16814094
Accepting Yes
JobSheets none none
Option job-sheets none,none
Option lease-duration 300
</Printer>

3. Start CUPS
4. Turn on debugging: cupsctl LogLevel=debug2
5. Submit a job to Nemo e.g. "echo . | lp -d Nemo"
6. Now restart CUPS again and clear the log: service cups restartlog
7. Wait for 5 minutes

Without performing step 6 I couldn't reproduce the problem, which leads me to wonder if it's a timing issue of some sort.  The backend is retrying every 30 seconds, and the browse timeout is 300 seconds, so the backend is logging messages at exactly the moment we time out its printer.

Here's the log at that point:

d [16/Dec/2010:12:54:32 +0000] select_timeout(0): 8 seconds to browse timeout a printer
d [16/Dec/2010:12:54:40 +0000] cupsdAddEvent(event=printer-deleted, dest=0x7f6d86a6f4e0(Nemo), job=(nil)(0), text="%s '%s' deleted by directory services (timeout).", ...)
d [16/Dec/2010:12:54:40 +0000] cupsd_send_notification(sub=0x7f6d86aa7590(281), event=0x7f6d86ac6aa0(printer-deleted))
d [16/Dec/2010:12:54:40 +0000] cupsd_send_notification(sub=0x7f6d86aa76e0(282), event=0x7f6d86ac46f0(printer-deleted))
D [16/Dec/2010:12:54:40 +0000] cupsdMarkDirty(-----S)
D [16/Dec/2010:12:54:40 +0000] cupsdSetBusyState: Printing jobs and dirty files
D [16/Dec/2010:12:54:40 +0000] Remote destination "Nemo" has timed out; deleting it...
d [16/Dec/2010:12:54:40 +0000] cupsdDeletePrinter(p=0x7f6d86a6f4e0(Nemo), update=1)
d [16/Dec/2010:12:54:40 +0000] cupsdSetJobState(job=0x7f6d86a749b0(47), state=5, newstate=3, action=1, message="Job stopped due to printer being deleted.")
d [16/Dec/2010:12:54:40 +0000] stop_job(job=0x7f6d86a749b0(47), action=1)
d [16/Dec/2010:12:54:40 +0000] cupsdEndProcess(pid=1058, force=1)
d [16/Dec/2010:12:54:40 +0000] cupsdAddEvent(event=job-state-changed, dest=0x7f6d86a6f4e0(Nemo), job=0x7f6d86a749b0(47), text="%s", ...)
d [16/Dec/2010:12:54:40 +0000] cupsd_send_notification(sub=0x7f6d86aa7590(281), event=0x7f6d86ac4cb0(job-state-changed))
D [16/Dec/2010:12:54:40 +0000] cupsdMarkDirty(-----S)
I [16/Dec/2010:12:54:40 +0000] [Job 47] Job stopped due to printer being deleted.
D [16/Dec/2010:12:54:40 +0000] cupsdMarkDirty(----J-)
d [16/Dec/2010:12:54:40 +0000] finalize_job(job=0x7f6d86a749b0(47))
d [16/Dec/2010:12:54:40 +0000] cupsdSetPrinterReasons(p=0x7f6d86a6f4e0(Nemo),s="-connecting-to-device"
d [16/Dec/2010:12:54:40 +0000] cupsdSetPrinterReasons(p=0x7f6d86a6f4e0(Nemo),s="-com.apple.print.recoverable-warning"
D [16/Dec/2010:12:54:40 +0000] cupsdMarkDirty(P-----)
d [16/Dec/2010:12:54:40 +0000] cupsdSetPrinterReasons(p=0x7f6d86a6f4e0(Nemo),s="-offline-report"
d [16/Dec/2010:12:54:40 +0000] cupsdDeleteProfile(profile="(null)")
d [16/Dec/2010:12:54:40 +0000] cupsdRemoveSelect(fd=13)
d [16/Dec/2010:12:54:40 +0000] cupsdCheckJobs: 1 active jobs, sleeping=0, reload=0
d [16/Dec/2010:12:54:40 +0000] start_job(job=0x7f6d86a749b0(47), printer=0x7f6d86a6f4e0(Nemo))

...and it starts the job again.

Comment 22 Tim Waugh 2010-12-16 15:27:26 UTC
Reported upstream.  The reason is that cupsdDeletePrinter() assumes that this is sufficient to prevent jobs being started in the queue:

787      cupsdSetPrinterState(p, IPP_PRINTER_STOPPED, update);

In fact, it isn't:

2952    /*
2953     * Can't set status of remote printers...
2954     */
2955   
2956     if (p->type & CUPS_PRINTER_DISCOVERED)
2957       return;

When cupsdCheckJobs() is called later (from finalize_job()), it re-starts the job.

Comment 23 Pierre-YvesChibon 2010-12-17 13:40:46 UTC
I am glad you could reproduce it.

Let me know if there is anything else I can do.

Comment 24 Tim Waugh 2010-12-22 11:46:40 UTC
This build contains the fix:
  http://koji.fedoraproject.org/koji/buildinfo?buildID=211214

Comment 25 Tim Waugh 2011-01-07 10:59:01 UTC
*** Bug 667920 has been marked as a duplicate of this bug. ***

Comment 26 Fedora Update System 2011-01-07 12:31:06 UTC
cups-1.4.6-1.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/cups-1.4.6-1.fc14

Comment 27 Fedora Update System 2011-01-07 12:32:03 UTC
cups-1.4.6-1.fc13 has been submitted as an update for Fedora 13.
https://admin.fedoraproject.org/updates/cups-1.4.6-1.fc13

Comment 28 Fedora Update System 2011-01-07 19:57:51 UTC
cups-1.4.6-1.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update cups'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/cups-1.4.6-1.fc13

Comment 29 Fedora Update System 2011-01-18 21:32:57 UTC
cups-1.4.6-1.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 30 Fedora Update System 2011-04-17 21:25:24 UTC
cups-1.4.6-1.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.