Description of problem: Around 3 out of 5 times, a 'service postgresql start' reports [ FAILED ] when postmaster has started successfuly - ps shows all the postgresql processes running. Before enabling verbose logging in /etc/sysconfig/pgsql/postgresql with: PGOPTS='-c log_connections=yes -c log_destination=syslog \ -c log_error_verbosity=verbose' and creating a database, the problem occurred more infrequently, around 1 out of 5 times, but still occurred. It seems the postgresql initscripts' "sleep 2" is not sufficient for postmaster startup. A patch is attached that fixes the problem . Version-Release number of selected component (if applicable): postgresql-8.0.3-1 from rawhide-20050808 on an i686 2.8Ghz P4 How reproducible: 80% Steps to Reproduce: # service postgresql start Actual results: script reports [ FAILED ], even though all postmaster processes are running OK. Expected results: script should have reported [ OK ] . Additional info:
Created attachment 117815 [details] patch against /etc/init.d/postgresql
Created attachment 117816 [details] bash -x debug output log of unpatched script when it reported [ FAILURE ] after it started postmaster OK
I can confirm this on my home machine, not on Rawhide but on FC4. Up to now it has always failed during the normal system boot, but When I logged in after that, it always succeeded. It seems that the other stuff happening at boot time braked postmaster startup suifficiently to let it take longer than two seconds. I'll attach a slightly different version of the init script patch which only waits 1 second instead of 2 between checking. This might shave off up to 1 second of boot time (which other people are trying to get down). Tom: How probable is it that postmaster needs 20 seconds or longer to get started? If this were sufficiently probable, we should make the timeout value configurable.
Created attachment 118240 [details] Changed postgresql init script patch
It's really pretty durn hard to believe that it'd take two seconds to get to the point where the postmaster creates the postmaster.pid file (which is all that the initscript is looking for). The total start time might be much more, particularly if any WAL replay is needed ... but I'm loath to apply a patch like this because I think it is a band-aid masking a deeper problem. [ thinks... ] Is it possible that the "su" is the part taking a long time?
I agree the patch is a bandaid and that we should fix the underlying problem of postmaster not creating its pid file fast enough. I don't think the $SU (/sbin/runuser) command is the problem: $ time /sbin/runuser -l postgres -c /bin/true real 0m0.285s user 0m0.104s sys 0m0.080s When I modified the initscript to print out the time, as follows: --- echo "postgresql START "`date +%s.%N` $SU -l postgres -c "$PGENGINE/postmaster -p '$PGPORT' -D '$PGDATA' ${PGOPTS} &" >> "$PGLOG" 2>&1 < /dev/null script_result=$? if [ "$script_result" -eq 0 ]; then let t=0 while [ "$t" -lt 20 ]; do sleep 2 let t=t+2 pid=`pidof -s "$PGENGINE/postmaster"` if [ $pid ] && [ -f "$PGDATA/postmaster.pid" ]; then script_result=0; echo "posgresql STARTED "`date +%s.%N` break; --- I get these times : $ service postgresql start postgresql START 1125411959.992102000 posgresql STARTED 1125411964.318344000 [ OK ] (stop) $ service postgresql start postgresql START 1125412258.027964000 posgresql STARTED 1125412260.365444000 [ OK ] (stop) $ service postgresql start postgresql START 1125412296.352903000 posgresql STARTED 1125412300.704747000 [ OK ] (stop) $ service postgresql start postgresql START 1125412340.264724000 posgresql STARTED 1125412342.569291000 [ OK ] (stop) $ service postgresql start postgresql START 1125412375.399402000 posgresql STARTED 1125412379.731528000 [ OK ] (stop) $ service postgresql start postgresql START 1125412463.141094000 posgresql STARTED 1125412467.495207000 [ OK ] So around 3/5 times, it takes longer than 2 seconds to create the pid file. This is on a Rawhide machine, and the ~postgres directory is on an XFS partition that is not the / partition.
The problem as I see it is that postgresql is not alone one the system, i.e. it could be that it takes more than 2 seconds before the postmaster process itself is loaded into memory and started and the start script must take this into account as well.
Jason, in your comment #6, are those times measured during system reboots, or were you just stopping and starting the service by hand on an otherwise idle machine? It's really hard to believe postmaster start would take so long unless there's a lot else going on. If you can replicate this during a manual start, it'd be interesting to run the "su" command that's starting the postmaster under "strace -f -tt" so that we can get some indication of why it's so slow.
*** Bug 152931 has been marked as a duplicate of this bug. ***
*** Bug 165824 has been marked as a duplicate of this bug. ***
I believe we've finally identified the reason for this behavior: http://archives.postgresql.org/pgsql-hackers/2005-10/msg00874.php The slow routine is new code in PG 8.0, which explains why there were no reports of trouble before. A fix is now in upstream CVS and should make its way here in due time (whenever PG 8.0.5 is released). In the meantime, a workaround to avoid the slow startup is to explicitly set the TIMEZONE parameter in postgresql.conf.