Created attachment 469235 [details] postgres-8.sh patch Description of problem: The postgres-8 resource agent does not detect if postgres failes to start. Version-Release number of selected component (if applicable): rgmanager-2.0.52-6.el5_5.8 How reproducible: 100% of the time. If there is invalid postgres configuration in postgresql.conf or other reasons that prevents postgres from starting, for example, corrupt or missing binaries, postgres-8 will always think postgres was started successfully. The reason for this is because the success or failure or the start is evaluated from if "command &" returned 0 or not. "command &" will always return 0. From postgres-8.sh: ---snipp--- su - "$OCF_RESKEY_postmaster_user" -c "$PSQL_POSTMASTER -c config_file=\"$PSQL_gen_config_file\" \ $OCF_RESKEY_postmaster_options" &> /dev/null & if [ $? -ne 0 ]; then clog_service_start $CLOG_FAILED return $OCF_ERR_GENERIC fi ---end snipp--- Steps to Reproduce: 1. run: will_never_work & 2. run: echo $? 3. observe how return code always is 0. Steps to Reproduce: 1. mv /usr/bin/postgres /usr/bin/postgres.renamed 2. start postgres using the postgres-8 resource agent (clusvcadm -e mypostgres) 3. run: clustat -l / check: /var/log/messages and see how the postgres service was started successfully. Actual results: The postgres-8 resource agent reports the postgres server started successfully when it did not. Expected results: postgres-8 resource agent detects the service failure and the cluster then tries to do something about it, like failing over to another node. Additional info: A possible patch, written and contributed by Michel Sijmons (@nibble-it.nl) is attached (tested and works for postgresql-server 8 and 9). Using pg_ctl to start the postgres server a correct exit code can be fetched. The patch also includes a "more correct" way of shutting postgres down (SIGQUIT) which is also handled by Bug 587735. The hole script is also attached (files: postgres-9.sh/metadata).
Created attachment 469236 [details] full version of patched postgres-8.sh script
Created attachment 469237 [details] renamed metadata file for full version script
A note for anyone evaluating this. This patch adds 1 second of sleep to allow pg_ctl to detect a started server. Needs to be evaluated if this is enough in most scenarios or if pg_ctl is better to be used to start-up the server as well. Also, pg_ctl currently needs to fetch "-D /path/to/pgsql/data" from $OCF_RESKEY_postmaster_options to work.
Changed severity to high, as this bug renders postgres-8.sh dangerous to use - as end user may not notice this issue if a postgres config/binary corruption issue is not tested.
Example cluster.conf extract: postmaster_options="-D /path/to/pgsql/data" needs to be defined for the patch to work. <resources> <postgres-9 config_file="/etc/cluster/postgresql.conf" name="database" postmaster_options="-D /var/lib/pgsql/9.0/data" postmaster_user="postgres" shutdown_wait="5"/> </resources> <service autostart="1" name="testservice"> <postgres-9 ref="database"/> </service>
@Magnus: Thanks for a generous amount of information and proposed patch. If there is a problem with configuration we are not able to detect it at a start time but we will find a problem after first checking of status. -- after patch review I found out three possible problems: 1) stop_postgres() I don't like that code duplication. If using stop_generic_sigkill (kill -TERM, wait, kill -QUIT) is not a good solution. Then I will prefer to change existing function to support not running 'kill -TERM' when stop_timeout = 0. 2) part with ccs_fd=$(ccs_connect) ... get_service_ip_keys "$ccs_fd" $OCF_RESKEY_service_name. Redundant because if ccs is not working properly we are not able to obtain IP address. Correct service configuration should be: <service autostart="1" name="testservice"> <postgres-9 ref="database"/> <ip addr="1.1.1.1" monitor_link="yes" /> </service> so postgres can bind to proper IP address(es). 3) changing postmaster to pgctl - most important change. No real objections as it makes sense to change it. Timeout will have to be configured (1 second looks like good default value). pgctl is ignoring generated configuration file (that's why it working for you even without ip address) is this an intention? I can make modifications for 1) and 3) after you agree on them. Mainly 1) as I'm not a posgres expert. Thanks
Marek, 1) So, if we got "stop_timeout = 0" then "kill -QUIT" will be issued? If so, that's fine. 3) 1 second timeout is a good default value. Not sure I understand "pgctl is ignoring generated configuration file". Can you please evolve on this?
@Magnus: 3) All resource agents should have ability to run several instances of application on same server (eg 2x apache + 3x postres on 1 node). As application usually do not accept configuration options (configuration file, locking file, ...) directly we have to 'patch' existing configuration and create a new one which used for application.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1000.html