166117 – postgresql initscript started reports [ FAILED ], but postmaster started

Bug 166117 - postgresql initscript started reports [ FAILED ], but postmaster started

Summary: postgresql initscript started reports [ FAILED ], but postmaster started

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	postgresql
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tom Lane
QA Contact:	David Lawrence
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	152931 165824 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-08-17 00:18 UTC by Jason Vas Dias
Modified:	2013-07-03 03:06 UTC (History)
CC List:	4 users (show)
Fixed In Version:	8.0.5
Clone Of:
Environment:
Last Closed:	2005-10-20 22:06:11 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch against /etc/init.d/postgresql (577 bytes, patch) 2005-08-17 00:18 UTC, Jason Vas Dias	no flags	Details \| Diff
bash -x debug output log of unpatched script when it reported [ FAILURE ] after it started postmaster OK (2.98 KB, text/plain) 2005-08-17 00:20 UTC, Jason Vas Dias	no flags	Details
Changed postgresql init script patch (577 bytes, patch) 2005-08-30 07:37 UTC, Nils Philippsen	no flags	Details \| Diff
View All

Description Jason Vas Dias 2005-08-17 00:18:02 UTC

Description of problem:

Around 3 out of 5 times, a 'service postgresql start' reports [ FAILED ]
when postmaster has started successfuly - ps shows all the postgresql
processes running.
 
Before enabling verbose logging in /etc/sysconfig/pgsql/postgresql with:
PGOPTS='-c log_connections=yes -c log_destination=syslog \
        -c log_error_verbosity=verbose'
and creating a database, the problem occurred more infrequently, around 1 
out of 5 times, but still occurred.

It seems the postgresql initscripts' "sleep 2" is not sufficient for 
postmaster startup.

A patch is attached that fixes the problem .

Version-Release number of selected component (if applicable):
postgresql-8.0.3-1 from rawhide-20050808 on an i686 2.8Ghz P4

How reproducible:
80%

Steps to Reproduce:
# service postgresql start
  
Actual results:
script reports [ FAILED ], even though all postmaster processes are running OK.

Expected results:
script should have reported [ OK ] .

Additional info:

Comment 1 Jason Vas Dias 2005-08-17 00:18:02 UTC

Created attachment 117815 [details]
patch against /etc/init.d/postgresql

Comment 2 Jason Vas Dias 2005-08-17 00:20:50 UTC

Created attachment 117816 [details]
bash -x debug output log of unpatched script when it reported [ FAILURE ] after it started postmaster OK

Comment 3 Nils Philippsen 2005-08-30 07:33:30 UTC

I can confirm this on my home machine, not on Rawhide but on FC4. Up to now it
has always failed during the normal system boot, but When I logged in after
that, it always succeeded. It seems that the other stuff happening at boot time
braked postmaster startup suifficiently to let it take longer than two seconds.

I'll attach a slightly different version of the init script patch which only
waits 1 second instead of 2 between checking. This might shave off up to 1
second of boot time (which other people are trying to get down).

Tom: How probable is it that postmaster needs 20 seconds or longer to get
started? If this were sufficiently probable, we should make the timeout value
configurable.

Comment 4 Nils Philippsen 2005-08-30 07:37:51 UTC

Created attachment 118240 [details]
Changed postgresql init script patch

Comment 5 Tom Lane 2005-08-30 14:11:39 UTC

It's really pretty durn hard to believe that it'd take two seconds to get to the
point where the postmaster creates the postmaster.pid file (which is all that
the initscript is looking for).  The total start time might be much more,
particularly if any WAL replay is needed ... but I'm loath to apply a patch like
this because I think it is a band-aid masking a deeper problem.

[ thinks... ]  Is it possible that the "su" is the part taking a long time?

Comment 6 Jason Vas Dias 2005-08-30 14:37:34 UTC

I agree the patch is a bandaid and that we should fix the underlying problem
of postmaster not creating its pid file fast enough.

I don't think the $SU (/sbin/runuser) command is the problem:

$ time /sbin/runuser -l postgres -c /bin/true

real    0m0.285s
user    0m0.104s
sys     0m0.080s


When I modified the initscript to print out the time, as follows:
---
	echo "postgresql START "`date +%s.%N` 
	$SU -l postgres -c "$PGENGINE/postmaster -p '$PGPORT' -D '$PGDATA' ${PGOPTS} &"
>> "$PGLOG" 2>&1 < /dev/null
	script_result=$?
	if [ "$script_result" -eq 0 ]; then
	    let t=0
	    while [ "$t" -lt 20 ]; do
		sleep 2
		let t=t+2
		pid=`pidof -s "$PGENGINE/postmaster"`		
		if [ $pid ] && [ -f "$PGDATA/postmaster.pid" ]; then
		    script_result=0;
		    echo "posgresql STARTED "`date +%s.%N`
		    break;
---

I get these times :
$ service postgresql start
postgresql START 1125411959.992102000
posgresql STARTED 1125411964.318344000
                                                           [  OK  ]
(stop)
$ service postgresql start
postgresql START 1125412258.027964000
posgresql STARTED 1125412260.365444000
                                                           [  OK  ]
(stop)
$ service postgresql start
postgresql START 1125412296.352903000
posgresql STARTED 1125412300.704747000
                                                           [  OK  ]
(stop)
$ service postgresql start
postgresql START 1125412340.264724000
posgresql STARTED 1125412342.569291000
                                                           [  OK  ]
(stop)
$ service postgresql start
postgresql START 1125412375.399402000
posgresql STARTED 1125412379.731528000
                                                           [  OK  ]
(stop)
$ service postgresql start
postgresql START 1125412463.141094000
posgresql STARTED 1125412467.495207000
                                                           [  OK  ]

So around 3/5 times, it takes longer than 2 seconds to create the pid file.
This is on a Rawhide machine, and the ~postgres directory is on an XFS 
partition that is not the / partition.

Comment 7 Nils Philippsen 2005-08-31 09:38:24 UTC

The problem as I see it is that postgresql is not alone one the system, i.e. it
could be that it takes more than 2 seconds before the postmaster process itself
is loaded into memory and started and the start script must take this into
account as well.

Comment 8 Tom Lane 2005-09-30 15:34:37 UTC

Jason, in your comment #6, are those times measured during system reboots, or
were you just stopping and starting the service by hand on an otherwise idle
machine?  It's really hard to believe postmaster start would take so long unless
there's a lot else going on.

If you can replicate this during a manual start, it'd be interesting to run the
"su" command that's starting the postmaster under "strace -f -tt" so that we can
get some indication of why it's so slow.

Comment 9 Tom Lane 2005-09-30 15:39:13 UTC

*** Bug 152931 has been marked as a duplicate of this bug. ***

Comment 10 Tom Lane 2005-09-30 18:54:01 UTC

*** Bug 165824 has been marked as a duplicate of this bug. ***

Comment 11 Tom Lane 2005-10-20 22:06:11 UTC

I believe we've finally identified the reason for this behavior:
http://archives.postgresql.org/pgsql-hackers/2005-10/msg00874.php
The slow routine is new code in PG 8.0, which explains why there were no reports
of trouble before.

A fix is now in upstream CVS and should make its way here in due time (whenever
PG 8.0.5 is released).  In the meantime, a workaround to avoid the slow startup
is to explicitly set the TIMEZONE parameter in postgresql.conf.

Note You need to log in before you can comment on or make changes to this bug.