Bug 11740

Summary: Race condition in locking for LPD
Product: [Retired] Red Hat Linux Reporter: jwilliam
Component: lprAssignee: Crutcher Dunnavant <crutcher>
Status: CLOSED WONTFIX QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: chris.ryan
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2000-10-11 15:27:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jwilliam 2000-05-30 05:36:55 UTC
Package:
lpr-50-4

Caveat:
For some reason, this was producible quite easily when the printer was
hooked up to the serial port, don't believe this is essential though.

Description:
'lpr'ing two small files would frequently case the second file to get
caught in the print queue.  'lpq' would report 'warning: no daemon
present'.  Restarting the daemon would cause the second job to be
printed properly.

Detective work:
The lpd daemon would spawn off a child to take care of the first job.
You lpr the second job, a second daemon gets spawned off.  During this
time, the first daemon would enter printjob.c, line 261(the 'done'
section).  The first daemon believe's it is done processesing, but
it doesn't release the lock untill the exit on line 271.  During this
time, the daemon which was spawned to handle the second job hits lpd.c
line 142 and sees that the first daemon has the lock, so the second
daemon exits, assuming the first daemon will take care of printing
the second job.  Of course, the first daemon think's it's done
processing, so it doesn't bother printing the second job either, leaving
the job queue'd up untill the daemon is restarted(either explicitly,
or through submission of another print-job).

Fix:
Get rid of the deadlock.  This was causing our software to break pretty
seriously, right now as a stop-gap I've added a flock() to printjob.c:L262
to unlock the file.  I *believe* this resolves the problem we were having,
but I haven't had a chance to scrutinize the code very closely to ensure
that unlocking the file at that point is Ok, or to see if this removes
the deadlock completly, or if it just makes the timing harder to hit.

Comment 1 Bernhard Rosenkraenzer 2000-06-17 16:02:05 UTC
This is fixed in Rawhide - we're using LPRng from now on.

Comment 2 Crutcher Dunnavant 2000-10-02 18:19:59 UTC
This will have to be fixed, garh.

Comment 3 Crutcher Dunnavant 2000-10-02 19:06:49 UTC
This is what I ended up doing, based upon your comments. I reordered the code to
reduce the time during which a deadlock might happen (though not by much).

        /*
         * search the spool directory for more work.
         */
        nitems = getq(&queue);
        if (nitems == 0) {              /* no more work to do */
        done:
                flock(lfd,LOCK_UN);     /* Unlock the lock now, to avoid
deadlocks */
                if (count > 0) {        /* Files actually printed */
                        if (!SF && !tof)
                                (void) write(ofd, FF, strlen(FF));
                        if (TR != NULL)         /* output trailer */
                                (void) write(ofd, TR, strlen(TR));
                }
                (void) close(ofd);
                (void) wait(NULL);
                (void) unlink(tempfile);
                exit(0);
        } else if (nitems < 0) {
                syslog(LOG_ERR, "%s: can't scan %s", printer, SD);
                exit(1);
        }
        goto again;

I will close this when I push the Errata (which means QA and Docs has to sign
off on it).

Comment 4 Crutcher Dunnavant 2000-10-03 16:45:39 UTC
Okay, if a more guaranteed fix is needed, I could add a spinlock file lock, but
I dont really want to do that, for obvious reasons. So if you need this
reopened, I guess we add the spin lock.

Comment 5 Crutcher Dunnavant 2000-10-11 15:14:51 UTC
Okay, we have a child bug: Bug #18853

It's contents:
Having problem printing to network printers.  The jobs hang in the queue.  
when running 'lpc stat <printername>' jobs show as waiting.  When I 
run 'ps -ef |grep lpd' - 'lpd' shows as running more than once (usually 
twice).  If I kill the second instance, the reports will occasionally 
print.  Often it is a matter of a combination of killing the process, and 
trying 'lpc restart <printername>' or 'lpc down/up <printername>'.  I have 
not found consistency in getting the reports to print.
I had a file sent to me - lpr-0.50-7.i386.rpm - installed that.  It seems 
to help a little, but we still have a problem.
There are only 3 of 18 printers that get hung.  They are all at the bottom 
in printtool, but were not to start.  I deleted and re-added 2 of them.  I 
have also swapped hardware on one of them.  The 3 printers are HP 
laserjets - a model 4, 5, & 4050N.  The 4050N has a built in network card, 
the other 2 are using NetGear PS104 print servers.  I have them set up on 
linux as 'Remote Unix (lpd) queue', Remote Host is IP Address, Remote 
Queue is Raw.  These printers also exist on the NT side.  I actually set 
them up there first, to use the software with the printservers/network 
cards to assign the IP address to the devices.  Then I create them on 
Linux.

Comment 6 Crutcher Dunnavant 2000-10-11 15:15:30 UTC
*** Bug 18853 has been marked as a duplicate of this bug. ***

Comment 7 Crutcher Dunnavant 2000-10-11 15:27:19 UTC
I will examine how to do more complete locking on lpr, but it wont be easy.
The lpr-0.50-7 package is about the most that can easily be done to fix this,
and anything more complete will take some serious thought to avoid creating
NEW race condtions.

Comment 8 Crutcher Dunnavant 2001-03-14 17:38:50 UTC
I'm not going to get to this.