After postgres server crash, the init script removes socket but not PID file and thus PostgreSQL server does'n start and starting scripts just hangs. When manually starting I'm able to see: pg_ctl: Another postmaster may be running. Trying to start postmaster anyway. Lock file "/var/lib/pgsql/data/postmaster.pid" already exists. Is another postmaster (pid 1497) running in "/var/lib/pgsql/data"? pg_ctl: cannot start postmaster Examine the log output. Please change line 178 of /etc/rc.d/init.d/rhdb to: rm -f /tmp/.s.PGSQL.${PGPORT} /var/run/postmaster.${PGPORT}.pid There are tests already to make sure postmaster is not running. So deleteing PID file is safe.
Well, that particular fix would fix nothing, because the /var/run file is not the one that creates this issue. The reason that there is no data directory lockfile removal in the script is that I consider it too dangerous to put in a script that may be manually invoked after system startup. The pidof test is not entirely trustworthy (particularly in the more recent script versions that support multiple postmasters). What you'd be doing is a tradeoff: create a risk of data corruption to remove the risk of no automatic postmaster start. Do you really think that's a good tradeoff? The correct place to do the removal is in a script that will certainly be run only during boot, such as /etc/rc.sysinit which cleans out /var/run. But hacking random data directories there seems frowned on :-( I'm sitting and thinking about ways to leverage the automatic cleanout of /var/run, that is look to see if /var/run/postmaster.${PGPORT}.pid is still there, but this also seems not completely trustworthy (particularly if the DBA ever starts the postmaster directly rather than through the init script).
As I wrote, there is a command for removing socket file after regular check (if there is a running postmaster) already inplace. IMHO there is no reason why support 'hand runned postmaster' by init script as if somebody want to run postmaster by hand, he should know what he is doing. There is no chance to write more fullproof script, but there is a REAL chance to have stopped PostgreSQL after machine crash and reboot because stale PID file. I think that you are worried about already solved issue (as if there is running postmaster, the socket file is already removed so odd things may happen alredy). But I don't believe that this case is possible (because double pg_ctl and pidof test).
Well, in fact the removal of the socket lockfile is quite broken too, but I've left it alone because it does not pose the same sort of data integrity threat that incorrectly zapping the data directory lockfile does. Having two postmasters running in the same data directory is a disaster of the first magnitude --- it *will* destroy your data, probably irretrievably --- and I'm not willing to take that risk.
I'm not sure how could be another postmaster running when there is a pidof test and pg_ctl test in place. Maybye like another (program)name but this seems to me like wery odd paranoia from you.
The pidof test could fail because it is testing for a specific executable, not just any postmaster. The pg_ctl test is basically useless since what it really checks for is existence of the same lockfile that we are talking about removing; it doesn't try to check whether there is a live postmaster associated with the lockfile or not (and if it did, it would probably use the same fallible test that the postmaster itself is using, ie does that PID exist and belong to a postgres-owned process?). Admittedly, you are not likely to get into trouble if you have only one Postgres installation on the machine. But I'm hesitant to put in a deliberate defeat of a safety mechanism. I had another idea though ... the case where you get into trouble involves a small increase in the number of processes launched-so-far, so that the PID that belonged to the postmaster in the previous system cycle now belongs to the shell or pg_ctl process launching it. Maybe if we didn't use pg_ctl but launched the postmaster directly from "su", we could arrange that there are no postgres-owned processes except the postmaster itself, that is its direct parent is root-owned. Then the postmaster's internal test wouldn't get confused, and we'd not need to remove the lockfiles at all. I'm not certain that su can be made to work like that, but it seems worth looking into...
Turns out that can be made to work with one extra hack: there has to be at least one layer of postgres-owned shell, but it's safe for the postmaster to ignore its immediate parent process PID (which it knows from getppid). I've committed a fix along these lines in rh-postgresql-7.3.7-2.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-489.html