Bug 800534
Summary: | During postgresql start() there is a sleep 2, this sleep is not long enough on slower systems | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Eric Sammons <esammons> | ||||
Component: | postgresql | Assignee: | Pavel Raiskup <praiskup> | ||||
Status: | CLOSED ERRATA | QA Contact: | Jakub Prokes <jprokes> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.4 | CC: | databases-maint, hhorak, inecas, jprokes, lzap, pkubat, praiskup, psklenar, socketpair | ||||
Target Milestone: | rc | Keywords: | Patch, Reopened | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | postgresql-8.4.20-7.el6 | Doc Type: | No Doc Update | ||||
Doc Text: |
undefined
|
Story Points: | --- | ||||
Clone Of: | Environment: |
RHEL 6.2
Virt KVM system
2 vCPUs
4G of RAM
Defined RAW HDD storage as IDE
postgresql-8.4.9-1.el6_1.1.x86_64
|
|||||
Last Closed: | 2017-03-21 09:29:15 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1070830, 1159824, 1359256 | ||||||
Attachments: |
|
Description
Eric Sammons
2012-03-06 16:44:22 UTC
Quick note - Eric meant the "sleep 2" line in the postgresql init script. For some reason (on slow system) postgres needs more time than 2 to start up. Then it fails and breaks our whole installation. Upstream is already using systemd, but we might need this to be fixed in RHEL6. Wait for pid or something would be much nicer than a sleep. Quick workaround in setting this to 10 seconds does not look right and could cause more issues to others I guess. It would take an awfully slow machine for 2 seconds to not be enough, which leads me to think there is some other problem here. Well, postmaster has a flag (-w) that tells postmaster not to return until the db is initialized. We should be using that. We confirmed the original bug, changing the delay to 4 seconds helped to resolve the problem. (In reply to comment #3) > Well, postmaster has a flag (-w) that tells postmaster not to return until the > db is initialized. We should be using that. No, we shouldn't. That would cause startup to block until the database server is actually ready to accept connections, which could be a very long time (minutes). On the other hand, the time until the postmaster creates its PID file should be measured in milliseconds. So I remain of the opinion that there's some unexplained problem here, and that changing the delay in the init script is only papering it over not fixing it. It is feasable to change sleep to a loop that would wait for pid for 10 seconds? Something like (bash-pseudo code): while [ -f xxx.pid ]; do sleep 1; done So it would work also on IO-overloaded guests? Do you have an advice how to implement a wait process that would return after postmaster accepts connection? We have a problem then - we need to start postgresql, wait until it's ready and then immediatelly seed its database. Thanks (In reply to comment #5) > It is feasable to change sleep to a loop that would wait for pid for 10 > seconds? Something like (bash-pseudo code): > > while [ -f xxx.pid ]; do sleep 1; done This is ignoring the question: what is really causing the problem? In a SysV initscript world, there is no reason for an initscript operation to suddenly take orders of magnitude more time than usual, because the scripts are all serialized. Without understanding the real problem there is no way to know how much time is appropriate to wait. > Do you have an advice how to implement a wait process that would return after > postmaster accepts connection? We have a problem then - we need to start > postgresql, wait until it's ready and then immediatelly seed its database. pg_ctl -w does not do anything particularly magic, it just tries to connect to the database server and waits some more if that doesn't succeed. I'd suggest the same in whatever else you're doing (although doing it in an initscript seems like a pretty bad idea). ISTM that your life would be a whole lot easier with systemd, btw, where these constraints don't exist. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. It has been proposed for the next release. If you would like it considered as an exception in the current release, please ask your support representative. Regarding your question - its the virtualization what is causing the problem. Without it even slower possible discs is usually fast enough to write a pid file, but in this case with overloaded hypervisors with a guest set to IDE HDD controller it can talke more than 2 seconds to write the file. It's a pain for us because we are using puppet for our installer and if a single step in the whole dependency chain fails, everything is gone. I guess the only possible way for us is to add this error to the release notes. Changing the init script is not the best way, maybe to change the puppet part not to use sysv init script for postgres, but pg_ctl. (In reply to comment #8) > Regarding your question - its the virtualization what is causing the problem. > Without it even slower possible discs is usually fast enough to write a pid > file, but in this case with overloaded hypervisors with a guest set to IDE HDD > controller it can talke more than 2 seconds to write the file. Well, if we're talking about arbitrarily overloaded machines then it's difficult to believe that 4 seconds, or 10, or any reasonable-for-normal-startup number will be sufficient. After re-reading the thread I realize that what we're talking about here is not normal system boot, though, but a scripted sequence of operations. Would it be workable from your end to have a separate initscript command, say "service postgresql start-wait", that is willing to wait indefinitely for the server to come ready? That would solve both of your issues while not affecting the bootup behavior. This will still need to be revisited when you migrate to the systemd world, which won't support nonstandard service commands; but a more direct solution is possible there. Command like that would be outstanding, but after I tried pg_ctrl -w command I found out it does not support this option with status command. Because I still want to start the server using sysv init script, I need something that would just tell me "now it is running and you can connect". Even status command returns only if the pid is present. I guess I will need to add special check to our installation process - something like: echo "show port" | psql -U katellouser katelloschema in a loop or something. I just want to be sure it is a) running; b) accepting connections. So even a new initscript command would not help me. I plan to close this bug with NOTABUG if there are no objections. Implementing explicit wait: for i in {1..5}; do echo "select count(*) from pg_tables" | PGCONNECT_TIMEOUT=10 psql -U katellouser katelloschema -h localhost >/dev/null 2>&1 || sleep 5; done Tom, after some research it turns out my wait code wont help and we have to remove the failure call from the sysvinit script before installation - the init script is still giving non-zero code, which is causing big issues. Our installer is written in Puppet, the whole dependency chain fails. I think this could bite another teams and implementing the special option you have described above would be nice. I'd be also happy with start-fast option that would just remove the delay and return 1 when the pid does not exist. That would help too, because we are able to wait for postgre socket in our code. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate, in the next release of Red Hat Enterprise Linux. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate, in the next release of Red Hat Enterprise Linux. Created attachment 829823 [details]
proposed patch
The PID will be checked for 30 seconds. It is configurable via PG_START_WAIT_TIME variable.
FTR, in 'rh-postgresql94' RHSCL-2.0 (RHEL6!) collection, there is configuration env variable PGSTARTTIMEOUT that holds integer value in seconds (defaults to 30). Init script by default waits up to 30s for the pidfile. If you want to make the init script wait until PostgreSQL is accepting connections, use the PGSTARTWAIT=1 environment variable - the initscript will then wait $PGSTARTTIMEOUT seconds for full DB start. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0603.html |