Description of problem: I am running two PostgreSQL server instances on a Fedora 20 box. The second one is controlled by second systemd unit file that is a copy of default one with small modifications (see additional info for the unit file) - should be only PGPORT and PGDATA. This works fine but if the server crashes and reboots, second instance sometimes refuses to start because of a stale PID file - depends if the first instance will get conflicting PID. Version-Release number of selected component (if applicable): systemd-libs-208-20.fc20.x86_64 systemd-208-20.fc20.x86_64 postgresql-server-9.3.4-1.fc20.x86_64 Server is basically out-of-the-box Fedora installation with minimum modifications. How reproducible: Install two postgresql instances on one server with two data directories Steps to Reproduce: 1. Reboot server 2. Wait for both postgresql instances to start up 3. Perform unclean shuttdown of the server 4. Boot server Actual results: As systemd does parallel service startup both postgresql instances will get cca the same PID every boot but in undefined order. For example in step 2 instance A got 445 and instance B got 446. In step 4 first instance (i.e. A) may get PID 446. When instance B starts, it sees a stale PID file with PID=446 and it also sees running postmaster with PID 446 so it decides not to start: (Unfortunately I do have only shortened log file at the moment) Aug 17 10:30:46 xxxxx pg_ctl[1500]: pg_ctl: another server might be running; trying to start server any Aug 17 10:30:46 xxxxx pg_ctl[1500]: FATAL: lock file "postmaster.pid" already exists Aug 17 10:30:46 xxxxx pg_ctl[1500]: HINT: Is another postmaster (PID 446) running in data directory "/ Aug 17 10:30:51 xxxxx pg_ctl[1500]: pg_ctl: this data directory appears to be running a pre-existing po Aug 17 10:30:51 xxxxx pg_ctl[1500]: pg_ctl: could not start server Aug 17 10:30:51 xxxxx pg_ctl[1500]: Examine the log output. Aug 17 10:30:51 xxxxx systemd[1]: postgresql.service: control process exited, code=exited status=1 Aug 17 10:30:51 xxxxx systemd[1]: Failed to start PostgreSQL database server. Aug 17 10:30:51 xxxxx systemd[1]: Unit postgresql.service entered failed state. Expected results: Pg_ctl or systemd unit file will delete the stale lock file as it belongs to other postgresql instance and the instance will start up. Additional info: [Unit] Description=PostgreSQL database server After=network.target [Service] Type=forking User=postgres Group=postgres Environment=PGPORT=5430 Environment=PGDATA=/data/pgsql-test OOMScoreAdjust=-1000 ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGDATA} ExecStart=/usr/bin/pg_ctl start -D ${PGDATA} -s -o "-p ${PGPORT}" -w -t 300 ExecStop=/usr/bin/pg_ctl stop -D ${PGDATA} -s -m fast ExecReload=/usr/bin/pg_ctl reload -D ${PGDATA} -s TimeoutSec=300 [Install] WantedBy=multi-user.target Additional info 2: I know that it would be better for the server to not crash but sometimes there is a issue with power that is out of my reach to fix. And the server is used only for read-only access to a local copy of remote data.
Hmm, this sounds like tough problem to solve. IMO, this situation might happen also if there was _just one_ PostgreSQL instance running on Fedora box.. due to parallel nature of systemd -- the PID of old postmaster could already be used by any other service. I don't think there is much we can do (sanely) here, but I tried to ping upstream with this: http://www.postgresql.org/message-id/1711927.hbtYs8Lf7C@nb.usersys.redhat.com
If the PID's been reused by a different service, it's not a problem. The postmaster will only think that a postmaster.pid file represents a live conflict if the PID mentioned in it is live, has the same userid as the new postmaster, and is not the postmaster's immediate parent process (typically pg_ctl). I think we also had a kluge in there once upon a time to detect the grandparent process, in case that was a shell also executing as the postgres user. I'm not sure if that kluge survived the transition to systemd, or whether it'd still be of use in systemd-land. In the case described, much the safest solution would be to run the two postmasters under two different userids. If there's some step in systemd bootup that is responsible for blowing away pidfiles of other services, *and it is iron clad guaranteed to run only at system boot and never again*, you could think about having that step also delete Postgres pidfiles. However, I've seen way too many people lose data due to implementing such a manual removal incorrectly --- if there is any way for it to happen after system start, you risk removing a pidfile that corresponds to a live postmaster, and if you do that you will end up with a corrupt database eventually. Do not even think of just adding that to the "service start" sequence.
Ah, wait, the kluge I mentioned is in the postgres sources, it wasn't added by Red Hat. pg_ctl tells the postmaster the PID of its parent process. It's probably worth quoting the comment in miscinit.c: * If the PID in the lockfile is our own PID or our parent's or * grandparent's PID, then the file must be stale (probably left over from * a previous system boot cycle). We need to check this because of the * likelihood that a reboot will assign exactly the same PID as we had in * the previous reboot, or one that's only one or two counts larger and * hence the lockfile's PID now refers to an ancestor shell process. We * allow pg_ctl to pass down its parent shell PID (our grandparent PID) * via the environment variable PG_GRANDPARENT_PID; this is so that * launching the postmaster via pg_ctl can be just as reliable as * launching it directly. There is no provision for detecting * further-removed ancestor processes, but if the init script is written * carefully then all but the immediate parent shell will be root-owned * processes and so the kill test will fail with EPERM. Note that we * cannot get a false negative this way, because an existing postmaster * would surely never launch a competing postmaster or pg_ctl process * directly. It'd be worth studying the systemd script with this in mind to make sure there's not more than one level of postgres-owned process above pg_ctl. However, the whole thing is risky anyway if you are launching multiple postmasters concurrently with identical userids; there's no way to tell that some other postmaster isn't actually interested in your data directory.
It is no problem for me to delete the stalled PID file manually. And I will probably use the workaround with different userid and change uid of one of the instance. Or I have been suggested to use "external_pid_file" option to move PID files to /var/run (where they should get deleted on system start) - however I am not sure if this will work. But one thing I do not understand. I am not a skilled unix programmer but I do not understand why pg_ctl can not simply query the "candidate" postmaster from PID file if it is using the data directory in question. Or instead of using PID file just use some socket in data directory to "ping" the postmaster...
(In reply to Radek Hladik from comment #4) > It is no problem for me to delete the stalled PID file manually. And I will > probably use the workaround with different userid and change uid of one of > the instance. I wouldn't call this workaround, we should however somehow document this -- README.rpm-dist seems to be proper place for this..? > Or I have been suggested to use "external_pid_file" option to > move PID files to /var/run (where they should get deleted on system start) - > however I am not sure if this will work. You can also use 'After=' construct in one of your service specifications, if you don't mind you'll loose a bit of parallelism (for 'postgresql.service' and 'postgresql'): # mkdir /etc/systemd/system/postgresql.d # cd !$ # cat >> 90-order.conf <<EOF [Unit] After=postgresql.service EOF (might be worth documenting too) > But one thing I do not understand. I am not a skilled unix programmer but I > do not understand why pg_ctl can not simply query the "candidate" postmaster > from PID file if it is using the data directory in question. Or instead of > using PID file just use some socket in data directory to "ping" the > postmaster... We have only 'PID' of the concurrent process. We know that it is some process under UID 26 (postgres), otherwise we would succeed. But we have no obvious place where to ping for additional info (i.e. we do not even know what is the datadir of that postmaster). Pavel
(In reply to Pavel Raiskup from comment #5) > (In reply to Radek Hladik from comment #4) > > It is no problem for me to delete the stalled PID file manually. And I will > > probably use the workaround with different userid and change uid of one of > > the instance. > > I wouldn't call this workaround, we should however somehow document this -- > README.rpm-dist seems to be proper place for this..? Good place would be default systemd postgres unit file. Now there is a example how to change a port, pgdata ,etc... So note about this would be very useful > > > Or I have been suggested to use "external_pid_file" option to > > move PID files to /var/run (where they should get deleted on system start) - > > however I am not sure if this will work. > > You can also use 'After=' construct in one of your service specifications, if > you don't mind you'll loose a bit of parallelism (for 'postgresql.service' > and > 'postgresql'): > > # mkdir /etc/systemd/system/postgresql.d > # cd !$ > # cat >> 90-order.conf <<EOF > [Unit] > After=postgresql.service > EOF > > (might be worth documenting too) I tried to do this the simpliest yet the most standard Fedora way. And to be honest I am still quite unsure about various sideeffects of ordering unit files (like removing circular unit files, etc...). And I am not sure if this would solve the problem in all cases. If something before postgres would launch one child process more/less, we could run into matching PIDs too. > > > But one thing I do not understand. I am not a skilled unix programmer but I > > do not understand why pg_ctl can not simply query the "candidate" postmaster > > from PID file if it is using the data directory in question. Or instead of > > using PID file just use some socket in data directory to "ping" the > > postmaster... > > We have only 'PID' of the concurrent process. We know that it is some > process > under UID 26 (postgres), otherwise we would succeed. But we have no obvious > place where to ping for additional info (i.e. we do not even know what is the > datadir of that postmaster). Its interesting that being root and having PID of someone, you can send signals to the process, you can see, what files it has open, read its memory, kill it and much much more but you can not send it simple ping "hello, anybody out there?". But why can't you just have some additional info in the pid file. I.e. the datadir or path to some sort of ping-socket? > > Pavel
(In reply to Radek Hladik from comment #6) > If something before postgres would launch one child process more/less, we > could run into matching PIDs too. Right, sorry. The problem are not processes started _before_ postgres (and its sub-processes) but rather the processes started concurrently at the same time. This means that there is no _constant_ difference between PID_A and PID_B value among different reboots -- even with 'After=' construct. > > We have only 'PID' of the concurrent process. We know that it is some > > process under UID 26 (postgres), otherwise we would succeed. But we have > > no obvious place where to ping for additional info (i.e. we do not even > > know what is the datadir of that postmaster). > > Its interesting that being root and having PID of someone The postgres-owned process knows the PID of concurrent process (we do not parse the postmaster.pid under root). Well -- we could parse the postmaster.pid somehow within PreExec (or other root action), but it does not look like the best idea (systemd has had no idea about postmaster.pid so far and its fine because postgres/postmaster is the right process to take care of it). > but you can not send it simple ping "hello, anybody out there?". But why > can't you just have some additional info in the pid file. I.e. the datadir > or path to some sort of ping-socket? That all sounds like it could work but I'm not aware of some obvious and _portable_ way to do that. And taking the overall risk into account, it is probably better to let the admin do some manual steps in situations like power-outage-recovery.
This package has changed maintainer in Fedora. Reassigning to the new maintainer of this component.