Bug 800534 - During postgresql start() there is a sleep 2, this sleep is not long enough on slower systems
During postgresql start() there is a sleep 2, this sleep is not long enough o...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: postgresql (Show other bugs)
6.4
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Pavel Raiskup
Jakub Prokes
: Patch, Reopened
Depends On:
Blocks: 1070830 1159824 1359256
  Show dependency treegraph
 
Reported: 2012-03-06 11:44 EST by Eric Sammons
Modified: 2017-03-21 05:29 EDT (History)
9 users (show)

See Also:
Fixed In Version: postgresql-8.4.20-7.el6
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
RHEL 6.2 Virt KVM system 2 vCPUs 4G of RAM Defined RAW HDD storage as IDE postgresql-8.4.9-1.el6_1.1.x86_64
Last Closed: 2017-03-21 05:29:15 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
proposed patch (973 bytes, patch)
2013-11-27 11:56 EST, Jozef Mlich
no flags Details | Diff

  None (edit)
Description Eric Sammons 2012-03-06 11:44:22 EST
Description of problem:
When installing Katello or Headpin postgresql restart is called, the 2s sleep specified is not long enough on these slower systems and as a result katello and headpin will fail to install.

How reproducible:
Every time

Steps to Reproduce:
1. yum install -y katello-headpin-all 
2. katello-configure --deployment=headpin

  
Actual results:
# katello-configure --deployment=headpin
Starting Katello configuration
The top-level log file is
[/var/log/katello/katello-configure-20120120-105724/main.log]
err: /Stage[main]/Postgres::Service/Service[postgresql]: Failed to call
refresh: Could not restart Service[postgresql]: Execution of '/sbin/service
postgresql restart' returned 1:  at
/usr/share/katello/install/puppet/modules/postgres/manifests/service.pp:6
err: /Stage[main]/Certs::Config/Exec[create-nss-db]/returns: change from notrun
to 0 failed: /bin/rm -f /etc/pki/katello/nssdb//*; certutil -N -d
'/etc/pki/katello/nssdb/' -f '/etc/katello/nss_db_password-file'; certutil -A
-d '/etc/pki/katello/nssdb/' -n 'ca' -t 'TCu,Cu,Tuw' -a -i
'/usr/share/katello/KATELLO-TRUSTED-SSL-CERT'; certutil -A -d
'/etc/pki/katello/nssdb/' -n 'broker' -t ',,' -a -i
'/etc/pki/tls/certs/qpid-broker.crt'; certutil -A -d '/etc/pki/katello/nssdb/'
-n 'tomcat' -t ',,' -a -i '/etc/pki/tls/certs/httpd-ssl.crt' returned 255
instead of one of [0] at
/usr/share/katello/install/puppet/modules/certs/manifests/config.pp:184
Creating Candlepin database user
############################################################ ... OK
Creating Candlepin database
############################################################ ... OK
Candlepin setup
###########################################################
  Failed, please check [/var/log/katello/katello-configure/cpsetup.log]
Creating Katello database user
############################################################ ... OK
Creating Katello database
############################################################ ... OK
err: /Stage[main]/Apache2/Exec[reload-apache2]: Failed to call refresh:
/etc/init.d/httpd reload returned 7 instead of one of [0] at
/usr/share/katello/install/puppet/modules/apache2/manifests/init.pp:14

Expected results:
Successful configuration

Additional info:
guest.xml

[snip]

<domain type='kvm'>
  <name>pinky</name>
  <uuid>82691438-ccfc-e3e0-3d3d-c1c98280edfa</uuid>
  <memory>4096000</memory>
  <currentMemory>4096000</currentMemory>
  <vcpu>2</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.14'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/vm.vg/pinky'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x1'/>
[/snip]
Comment 1 Lukas Zapletal 2012-03-06 11:48:07 EST
Quick note - Eric meant the "sleep 2" line in the postgresql init script. For some reason (on slow system) postgres needs more time than 2 to start up. Then it fails and breaks our whole installation.

Upstream is already using systemd, but we might need this to be fixed in RHEL6. Wait for pid or something would be much nicer than a sleep. Quick workaround in setting this to 10 seconds does not look right and could cause more issues to others I guess.
Comment 2 Tom Lane 2012-03-06 12:29:12 EST
It would take an awfully slow machine for 2 seconds to not be enough, which leads me to think there is some other problem here.
Comment 3 Lukas Zapletal 2012-03-07 09:15:36 EST
Well, postmaster has a flag (-w) that tells postmaster not to return until the db is initialized. We should be using that. We confirmed the original bug, changing the delay to 4 seconds helped to resolve the problem.
Comment 4 Tom Lane 2012-03-07 09:36:31 EST
(In reply to comment #3)
> Well, postmaster has a flag (-w) that tells postmaster not to return until the
> db is initialized. We should be using that.

No, we shouldn't.  That would cause startup to block until the database server is actually ready to accept connections, which could be a very long time (minutes).

On the other hand, the time until the postmaster creates its PID file should be measured in milliseconds.  So I remain of the opinion that there's some unexplained problem here, and that changing the delay in the init script is only papering it over not fixing it.
Comment 5 Lukas Zapletal 2012-03-07 09:47:31 EST
It is feasable to change sleep to a loop that would wait for pid for 10 seconds? Something like (bash-pseudo code):

while [ -f xxx.pid ]; do sleep 1; done

So it would work also on IO-overloaded guests?

Do you have an advice how to implement a wait process that would return after postmaster accepts connection? We have a problem then - we need to start postgresql, wait until it's ready and then immediatelly seed its database.

Thanks
Comment 6 Tom Lane 2012-03-07 10:36:24 EST
(In reply to comment #5)
> It is feasable to change sleep to a loop that would wait for pid for 10
> seconds? Something like (bash-pseudo code):
> 
> while [ -f xxx.pid ]; do sleep 1; done

This is ignoring the question: what is really causing the problem?  In a SysV initscript world, there is no reason for an initscript operation to suddenly take orders of magnitude more time than usual, because the scripts are all serialized.  Without understanding the real problem there is no way to know how much time is appropriate to wait.

> Do you have an advice how to implement a wait process that would return after
> postmaster accepts connection? We have a problem then - we need to start
> postgresql, wait until it's ready and then immediatelly seed its database.

pg_ctl -w does not do anything particularly magic, it just tries to connect to the database server and waits some more if that doesn't succeed.  I'd suggest the same in whatever else you're doing (although doing it in an initscript seems like a pretty bad idea).

ISTM that your life would be a whole lot easier with systemd, btw, where these constraints don't exist.
Comment 7 Suzanne Yeghiayan 2012-03-07 16:28:54 EST
This request was evaluated by Red Hat Product Management for inclusion in the
current release of Red Hat Enterprise Linux. Because the affected component is
not scheduled to be updated in the current release, Red Hat is unfortunately
unable to address this request at this time.  It has been proposed for the next
release. If you would like it considered as an exception in the current
release, please ask your support representative.
Comment 8 Lukas Zapletal 2012-03-08 03:29:25 EST
Regarding your question - its the virtualization what is causing the problem. Without it even slower possible discs is usually fast enough to write a pid file, but in this case with overloaded hypervisors with a guest set to IDE HDD controller it can talke more than 2 seconds to write the file.

It's a pain for us because we are using puppet for our installer and if a single step in the whole dependency chain fails, everything is gone. I guess the only possible way for us is to add this error to the release notes. Changing the init script is not the best way, maybe to change the puppet part not to use sysv init script for postgres, but pg_ctl.
Comment 11 Tom Lane 2012-03-08 11:55:16 EST
(In reply to comment #8)
> Regarding your question - its the virtualization what is causing the problem.
> Without it even slower possible discs is usually fast enough to write a pid
> file, but in this case with overloaded hypervisors with a guest set to IDE HDD
> controller it can talke more than 2 seconds to write the file.

Well, if we're talking about arbitrarily overloaded machines then it's difficult to believe that 4 seconds, or 10, or any reasonable-for-normal-startup number will be sufficient.

After re-reading the thread I realize that what we're talking about here is not normal system boot, though, but a scripted sequence of operations.  Would it be workable from your end to have a separate initscript command, say "service postgresql start-wait", that is willing to wait indefinitely for the server to come ready?  That would solve both of your issues while not affecting the bootup behavior.

This will still need to be revisited when you migrate to the systemd world, which won't support nonstandard service commands; but a more direct solution is possible there.
Comment 12 Lukas Zapletal 2012-03-08 15:53:05 EST
Command like that would be outstanding, but after I tried pg_ctrl -w command I found out it does not support this option with status command. Because I still want to start the server using sysv init script, I need something that would just tell me "now it is running and you can connect". Even status command returns only if the pid is present. I guess I will need to add special check to our installation process - something like:

  echo "show port" | psql -U katellouser katelloschema

in a loop or something. I just want to be sure it is a) running; b) accepting connections.

So even a new initscript command would not help me. I plan to close this bug with NOTABUG if there are no objections.
Comment 14 Lukas Zapletal 2012-03-12 06:10:45 EDT
Implementing explicit wait:

for i in {1..5}; do echo "select count(*) from pg_tables" | PGCONNECT_TIMEOUT=10 psql -U katellouser katelloschema -h localhost >/dev/null 2>&1 || sleep 5; done
Comment 15 Lukas Zapletal 2012-03-23 04:22:03 EDT
Tom,

after some research it turns out my wait code wont help and we have to remove the failure call from the sysvinit script before installation - the init script is still giving non-zero code, which is causing big issues. Our installer is written in Puppet, the whole dependency chain fails.

I think this could bite another teams and implementing the special option you have described above would be nice. I'd be also happy with start-fast option that would just remove the delay and return 1 when the pid does not exist. That would help too, because we are able to wait for postgre socket in our code.
Comment 17 Tom Lavigne 2012-09-18 11:26:46 EDT
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.
    
Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.
Comment 18 RHEL Product and Program Management 2013-10-13 20:46:40 EDT
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.

Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.
Comment 19 Jozef Mlich 2013-11-27 11:56:25 EST
Created attachment 829823 [details]
proposed patch

The PID will be checked for 30 seconds. It is configurable via PG_START_WAIT_TIME variable.
Comment 21 Pavel Raiskup 2015-09-02 07:01:53 EDT
FTR, in 'rh-postgresql94' RHSCL-2.0 (RHEL6!) collection, there is
configuration env variable PGSTARTTIMEOUT that holds integer value in
seconds (defaults to 30).  Init script by default waits up to 30s for the
pidfile.  If you want to make the init script wait until PostgreSQL is
accepting connections, use the PGSTARTWAIT=1 environment variable - the
initscript will then wait $PGSTARTTIMEOUT seconds for full DB start.
Comment 38 errata-xmlrpc 2017-03-21 05:29:15 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0603.html

Note You need to log in before you can comment on or make changes to this bug.