Bug 1257334

Summary:	Better detection of staled postmaster.pid (e.g. after power outage)
Product:	[Fedora] Fedora	Reporter:	Radek Hladik <rhladik>
Component:	postgresql	Assignee:	Filip Januš <fjanus>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	anezbeda, devrim, hhorak, jmlich83, pkubat, praiskup, tgl
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-05-30 17:38:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Radek Hladik 2015-08-26 20:27:32 UTC

Description of problem:
I am running two PostgreSQL server instances on a Fedora 20 box. The second one is controlled by second systemd unit file that is a copy of default one with small modifications (see additional info for the unit file) - should be only PGPORT and PGDATA.

This works fine but if the server crashes and reboots, second instance sometimes refuses to start because of a stale PID file - depends if the first instance will get conflicting PID.

Version-Release number of selected component (if applicable):
systemd-libs-208-20.fc20.x86_64
systemd-208-20.fc20.x86_64
postgresql-server-9.3.4-1.fc20.x86_64

Server is basically out-of-the-box Fedora installation with minimum modifications.


How reproducible:
Install two postgresql instances on one server with two data directories

Steps to Reproduce:
1. Reboot server
2. Wait for both postgresql instances to start up
3. Perform unclean shuttdown of the server
4. Boot server


Actual results:
As systemd does parallel service startup both postgresql instances will get cca the same PID every boot but in undefined order. For example in step 2 instance A got 445 and instance B got 446.
In step 4 first instance (i.e. A) may get PID 446. When instance B starts, it sees a stale PID file with PID=446 and it also sees running postmaster with PID 446 so it decides not to start:

(Unfortunately I do have only shortened log file at the moment)
Aug 17 10:30:46 xxxxx pg_ctl[1500]: pg_ctl: another server might be running; trying to start server any
Aug 17 10:30:46 xxxxx pg_ctl[1500]: FATAL:  lock file "postmaster.pid" already exists                  
Aug 17 10:30:46 xxxxx pg_ctl[1500]: HINT:  Is another postmaster (PID 446) running in data directory "/
Aug 17 10:30:51 xxxxx pg_ctl[1500]: pg_ctl: this data directory appears to be running a pre-existing po
Aug 17 10:30:51 xxxxx pg_ctl[1500]: pg_ctl: could not start server
Aug 17 10:30:51 xxxxx pg_ctl[1500]: Examine the log output.
Aug 17 10:30:51 xxxxx systemd[1]: postgresql.service: control process exited, code=exited status=1
Aug 17 10:30:51 xxxxx systemd[1]: Failed to start PostgreSQL database server.
Aug 17 10:30:51 xxxxx systemd[1]: Unit postgresql.service entered failed state.



Expected results:
Pg_ctl or systemd unit file will delete the stale lock file as it belongs to other postgresql instance and the instance will start up.

Additional info:

[Unit]
Description=PostgreSQL database server
After=network.target
[Service]
Type=forking
User=postgres
Group=postgres
Environment=PGPORT=5430
Environment=PGDATA=/data/pgsql-test
OOMScoreAdjust=-1000
ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGDATA}
ExecStart=/usr/bin/pg_ctl start -D ${PGDATA} -s -o "-p ${PGPORT}" -w -t 300
ExecStop=/usr/bin/pg_ctl stop -D ${PGDATA} -s -m fast
ExecReload=/usr/bin/pg_ctl reload -D ${PGDATA} -s
TimeoutSec=300
[Install]
WantedBy=multi-user.target

Additional info 2:
I know that it would be better for the server to not crash but sometimes there is a issue with power that is out of my reach to fix. And the server is used only for read-only access to a local copy of remote data.

Comment 1 Pavel Raiskup 2015-08-31 14:21:56 UTC

Hmm, this sounds like tough problem to solve.  IMO, this situation might
happen also if there was _just one_ PostgreSQL instance running on Fedora
box..  due to parallel nature of systemd -- the PID of old postmaster could
already be used by any other service.

I don't think there is much we can do (sanely) here, but I tried to ping
upstream with this:
http://www.postgresql.org/message-id/1711927.hbtYs8Lf7C@nb.usersys.redhat.com

Comment 2 Tom Lane 2015-08-31 15:21:32 UTC

If the PID's been reused by a different service, it's not a problem.

The postmaster will only think that a postmaster.pid file represents a live conflict if the PID mentioned in it is live, has the same userid as the new postmaster, and is not the postmaster's immediate parent process (typically pg_ctl).  I think we also had a kluge in there once upon a time to detect the grandparent process, in case that was a shell also executing as the postgres user.  I'm not sure if that kluge survived the transition to systemd, or whether it'd still be of use in systemd-land.

In the case described, much the safest solution would be to run the two postmasters under two different userids.

If there's some step in systemd bootup that is responsible for blowing away pidfiles of other services, *and it is iron clad guaranteed to run only at system boot and never again*, you could think about having that step also delete Postgres pidfiles.  However, I've seen way too many people lose data due to implementing such a manual removal incorrectly --- if there is any way for it to happen after system start, you risk removing a pidfile that corresponds to a live postmaster, and if you do that you will end up with a corrupt database eventually.  Do not even think of just adding that to the "service start" sequence.

Comment 3 Tom Lane 2015-08-31 15:31:09 UTC

Ah, wait, the kluge I mentioned is in the postgres sources, it wasn't added by Red Hat. pg_ctl tells the postmaster the PID of its parent process. It's probably worth quoting the comment in miscinit.c:

* If the PID in the lockfile is our own PID or our parent's or
* grandparent's PID, then the file must be stale (probably left over from
* a previous system boot cycle). We need to check this because of the
* likelihood that a reboot will assign exactly the same PID as we had in
* the previous reboot, or one that's only one or two counts larger and
* hence the lockfile's PID now refers to an ancestor shell process. We
* allow pg_ctl to pass down its parent shell PID (our grandparent PID)
* via the environment variable PG_GRANDPARENT_PID; this is so that
* launching the postmaster via pg_ctl can be just as reliable as
* launching it directly. There is no provision for detecting
* further-removed ancestor processes, but if the init script is written
* carefully then all but the immediate parent shell will be root-owned
* processes and so the kill test will fail with EPERM. Note that we
* cannot get a false negative this way, because an existing postmaster
* would surely never launch a competing postmaster or pg_ctl process
* directly.

It'd be worth studying the systemd script with this in mind to make sure there's not more than one level of postgres-owned process above pg_ctl. However, the whole thing is risky anyway if you are launching multiple postmasters concurrently with identical userids; there's no way to tell that some other postmaster isn't actually interested in your data directory.

Comment 4 Radek Hladik 2015-08-31 19:19:54 UTC

It is no problem for me to delete the stalled PID file manually. And I will probably use the workaround with different userid and change uid of one of the instance. Or I have been suggested to use "external_pid_file" option to move PID files to /var/run (where they should get deleted on system start) - however I am not sure if this will work.
But one thing I do not understand. I am not a skilled unix programmer but I do not understand why pg_ctl can not simply query the "candidate" postmaster from PID file if it is using the data directory in question. Or instead of using PID file just use some socket in data directory to "ping" the postmaster...

Comment 5 Pavel Raiskup 2015-10-14 14:05:21 UTC

(In reply to Radek Hladik from comment #4)
> It is no problem for me to delete the stalled PID file manually. And I will
> probably use the workaround with different userid and change uid of one of
> the instance.

I wouldn't call this workaround, we should however somehow document this --
README.rpm-dist seems to be proper place for this..?

> Or I have been suggested to use "external_pid_file" option to
> move PID files to /var/run (where they should get deleted on system start) -
> however I am not sure if this will work.

You can also use 'After=' construct in one of your service specifications, if
you don't mind you'll loose a bit of parallelism (for 'postgresql.service' and
'postgresql'):

  # mkdir /etc/systemd/system/postgresql.d
  # cd !$
  # cat >> 90-order.conf <<EOF
  [Unit]
  After=postgresql.service
  EOF

(might be worth documenting too)

> But one thing I do not understand. I am not a skilled unix programmer but I
> do not understand why pg_ctl can not simply query the "candidate" postmaster
> from PID file if it is using the data directory in question. Or instead of
> using PID file just use some socket in data directory to "ping" the
> postmaster...

We have only 'PID' of the concurrent process.  We know that it is some process
under UID 26 (postgres), otherwise we would succeed.  But we have no obvious
place where to ping for additional info (i.e. we do not even know what is the
datadir of that postmaster).

Pavel

Comment 6 Radek Hladik 2015-10-14 21:11:24 UTC

(In reply to Pavel Raiskup from comment #5)
> (In reply to Radek Hladik from comment #4)
> > It is no problem for me to delete the stalled PID file manually. And I will
> > probably use the workaround with different userid and change uid of one of
> > the instance.
> 
> I wouldn't call this workaround, we should however somehow document this --
> README.rpm-dist seems to be proper place for this..?
Good place would be default systemd postgres unit file. Now there is a example how to change a port, pgdata ,etc... So note about this would be very useful
> 
> > Or I have been suggested to use "external_pid_file" option to
> > move PID files to /var/run (where they should get deleted on system start) -
> > however I am not sure if this will work.
> 
> You can also use 'After=' construct in one of your service specifications, if
> you don't mind you'll loose a bit of parallelism (for 'postgresql.service'
> and
> 'postgresql'):
> 
>   # mkdir /etc/systemd/system/postgresql.d
>   # cd !$
>   # cat >> 90-order.conf <<EOF
>   [Unit]
>   After=postgresql.service
>   EOF
> 
> (might be worth documenting too)
I tried to do this the simpliest yet the most standard Fedora way. And to be honest I am still quite unsure about various sideeffects of ordering unit files (like removing circular unit files, etc...). And I am not sure if this would solve the problem in all cases. If something before postgres would launch one child process more/less, we could run into matching PIDs too.
> 
> > But one thing I do not understand. I am not a skilled unix programmer but I
> > do not understand why pg_ctl can not simply query the "candidate" postmaster
> > from PID file if it is using the data directory in question. Or instead of
> > using PID file just use some socket in data directory to "ping" the
> > postmaster...
> 
> We have only 'PID' of the concurrent process.  We know that it is some
> process
> under UID 26 (postgres), otherwise we would succeed.  But we have no obvious
> place where to ping for additional info (i.e. we do not even know what is the
> datadir of that postmaster).
Its interesting that being root and having PID of someone, you can send signals to the process, you can see, what files it has open, read its memory, kill it and much much more but you can not send it simple ping "hello, anybody out there?". But why can't you just have some additional info in the pid file. I.e. the datadir or path to some sort of ping-socket? 
> 
> Pavel

Comment 7 Pavel Raiskup 2015-10-15 09:39:31 UTC

(In reply to Radek Hladik from comment #6)
> If something before postgres would launch one child process more/less, we
> could run into matching PIDs too.

Right, sorry.  The problem are not processes started _before_ postgres (and its
sub-processes) but rather the processes started concurrently at the same time.
This means that there is no _constant_ difference between PID_A and PID_B
value among different reboots -- even with 'After=' construct.

> > We have only 'PID' of the concurrent process.  We know that it is some
> > process under UID 26 (postgres), otherwise we would succeed.  But we have
> > no obvious place where to ping for additional info (i.e. we do not even
> > know what is the datadir of that postmaster).
>
> Its interesting that being root and having PID of someone

The postgres-owned process knows the PID of concurrent process (we do not
parse the postmaster.pid under root).  Well -- we could parse the
postmaster.pid somehow within PreExec (or other root action), but it does not
look like the best idea (systemd has had no idea about postmaster.pid so far
and its fine because postgres/postmaster is the right process to take care of
it).

> but you can not send it simple ping "hello, anybody out there?".  But why
> can't you just have some additional info in the pid file. I.e. the datadir
> or path to some sort of ping-socket?

That all sounds like it could work but I'm not aware of some obvious and
_portable_ way to do that.  And taking the overall risk into account, it is
probably better to let the admin do some manual steps in situations like
power-outage-recovery.

Comment 8 Fedora Admin user for bugzilla script actions 2021-07-01 15:37:46 UTC

This package has changed maintainer in Fedora. Reassigning to the new maintainer of this component.

Comment 9 Ales Nezbeda 2024-05-30 17:38:28 UTC

Based on what was said here and in the mailing thread the safe choice would be to use different users. Manipulation with PIDs is dangerous and should be done only manually when we know that the PID is not attached to a running process.

It also sounds like a very specific scenario in which this conflict would/could happen and safest choice is using different configuration based on feedback here and in the aforementioned mailing list.