134090 – Postgres's init script does not remove stale PID file

Bug 134090 - Postgres's init script does not remove stale PID file

Summary: Postgres's init script does not remove stale PID file

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	rh-postgresql
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tom Lane
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-09-29 15:51 UTC by Milan Kerslager
Modified:	2013-07-03 03:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-12-20 17:54:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2004:489	0	normal	SHIPPED_LIVE	Low: rh-postgresql security update	2004-12-20 05:00:00 UTC

Description Milan Kerslager 2004-09-29 15:51:59 UTC

After postgres server crash, the init script removes socket but not
PID file and thus PostgreSQL server does'n start and starting scripts
just hangs.

When manually starting I'm able to see:

pg_ctl: Another postmaster may be running.  Trying to start postmaster
anyway.
Lock file "/var/lib/pgsql/data/postmaster.pid" already exists.
Is another postmaster (pid 1497) running in "/var/lib/pgsql/data"?
pg_ctl: cannot start postmaster
Examine the log output.

Please change line 178 of /etc/rc.d/init.d/rhdb to:

rm -f /tmp/.s.PGSQL.${PGPORT} /var/run/postmaster.${PGPORT}.pid

There are tests already to make sure postmaster is not running. So
deleteing PID file is safe.

Comment 1 Tom Lane 2004-09-29 19:49:24 UTC

Well, that particular fix would fix nothing, because the /var/run file
is not the one that creates this issue.

The reason that there is no data directory lockfile removal in the
script is that I consider it too dangerous to put in a script that may
be manually invoked after system startup.  The pidof test is not
entirely trustworthy (particularly in the more recent script versions
that support multiple postmasters).  What you'd be doing is a
tradeoff: create a risk of data corruption to remove the risk of no
automatic postmaster start.  Do you really think that's a good
tradeoff?

The correct place to do the removal is in a script that will certainly
be run only during boot, such as /etc/rc.sysinit which cleans out
/var/run.  But hacking random data directories there seems frowned on
:-(

I'm sitting and thinking about ways to leverage the automatic cleanout
of /var/run, that is look to see if /var/run/postmaster.${PGPORT}.pid
is still there, but this also seems not completely trustworthy
(particularly if the DBA ever starts the postmaster directly rather
than through the init script).

Comment 2 Milan Kerslager 2004-09-29 21:58:16 UTC

As I wrote, there is a command for removing socket file after regular
check (if there is a running postmaster) already inplace. IMHO there
is no reason why support 'hand runned postmaster' by init script as if
somebody want to run postmaster by hand, he should know what he is doing.

There is no chance to write more fullproof script, but there is a REAL
chance to have stopped PostgreSQL after machine crash and reboot
because stale PID file.

I think that you are worried about already solved issue (as if there
is running postmaster, the socket file is already removed so odd
things may happen alredy). But I don't believe that this case is
possible (because double pg_ctl and pidof test).

Comment 3 Tom Lane 2004-09-29 22:31:55 UTC

Well, in fact the removal of the socket lockfile is quite broken too,
but I've left it alone because it does not pose the same sort of data
integrity threat that incorrectly zapping the data directory lockfile
does.  Having two postmasters running in the same data directory is a
disaster of the first magnitude --- it *will* destroy your data,
probably irretrievably --- and I'm not willing to take that risk.

Comment 4 Milan Kerslager 2004-09-29 22:39:15 UTC

I'm not sure how could be another postmaster running when there is a
pidof test and pg_ctl test in place. Maybye like another (program)name
but this seems to me like wery odd paranoia from you.

Comment 5 Tom Lane 2004-09-30 01:03:49 UTC

The pidof test could fail because it is testing for a specific executable, not just any 
postmaster.  The pg_ctl test is basically useless since what it really checks for is existence 
of the same lockfile that we are talking about removing; it doesn't try to check whether 
there is a live postmaster associated with the lockfile or not (and if it did, it would 
probably use the same fallible test that the postmaster itself is using, ie does that PID
exist and belong to a postgres-owned process?).

Admittedly, you are not likely to get into trouble if you have only one Postgres installation 
on the machine.  But I'm hesitant to put in a deliberate defeat of a safety mechanism.

I had another idea though ... the case where you get into trouble involves a small increase 
in the number of processes launched-so-far, so that the PID that belonged to the 
postmaster in the previous system cycle now belongs to the shell or pg_ctl process 
launching it.  Maybe if we didn't use pg_ctl but launched the postmaster directly from "su", 
we could arrange that there are no postgres-owned processes except the postmaster 
itself, that is its direct parent is root-owned.  Then the postmaster's internal test wouldn't 
get confused, and we'd not need to remove the lockfiles at all.

I'm not certain that su can be made to work like that, but it seems worth looking into...

Comment 6 Tom Lane 2004-10-05 19:26:22 UTC

Turns out that can be made to work with one extra hack: there has to
be at least one layer of postgres-owned shell, but it's safe for the
postmaster to ignore its immediate parent process PID (which it knows
from getppid).  I've committed a fix along these lines in
rh-postgresql-7.3.7-2.

Comment 7 John Flanagan 2004-12-20 17:54:14 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-489.html

Note You need to log in before you can comment on or make changes to this bug.