663827 – postgres-8 resource agent does not detect a failed start of postgres server

Bug 663827 - postgres-8 resource agent does not detect a failed start of postgres server

Summary: postgres-8 resource agent does not detect a failed start of postgres server

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Marek Grac
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	694816 917781
TreeView+	depends on / blocked

Reported:	2010-12-17 00:09 UTC by Magnus Glantz
Modified:	2013-03-04 18:33 UTC (History)
CC List:	3 users (show)
Fixed In Version:	rgmanager-2.0.52-18.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	694816 (view as bug list)
Environment:
Last Closed:	2011-07-21 10:48:12 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
postgres-8.sh patch (12.05 KB, patch) 2010-12-17 00:09 UTC, Magnus Glantz	no flags	Details \| Diff
full version of patched postgres-8.sh script (8.32 KB, application/x-sh) 2010-12-17 00:17 UTC, Magnus Glantz	no flags	Details
renamed metadata file for full version script (3.24 KB, application/octet-stream) 2010-12-17 00:19 UTC, Magnus Glantz	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1000	0	normal	SHIPPED_LIVE	Low: rgmanager security, bug fix, and enhancement update	2011-07-21 10:43:18 UTC

Description Magnus Glantz 2010-12-17 00:09:24 UTC

Created attachment 469235 [details]
postgres-8.sh patch

Description of problem:
The postgres-8 resource agent does not detect if postgres failes to start.

Version-Release number of selected component (if applicable):
rgmanager-2.0.52-6.el5_5.8

How reproducible:
100% of the time. If there is invalid postgres configuration in postgresql.conf or other reasons that prevents postgres from starting, for example, corrupt or missing binaries, postgres-8 will always think postgres was started successfully.

The reason for this is because the success or failure or the start is evaluated from if "command &" returned 0 or not. "command &" will always return 0.

From postgres-8.sh:
---snipp---
su - "$OCF_RESKEY_postmaster_user" -c "$PSQL_POSTMASTER -c config_file=\"$PSQL_gen_config_file\" \
$OCF_RESKEY_postmaster_options" &> /dev/null &

if [ $? -ne 0 ]; then
    clog_service_start $CLOG_FAILED
    return $OCF_ERR_GENERIC
fi
---end snipp---

Steps to Reproduce:
1. run: will_never_work &
2. run: echo $?
3. observe how return code always is 0.

Steps to Reproduce:
1. mv /usr/bin/postgres /usr/bin/postgres.renamed
2. start postgres using the postgres-8 resource agent (clusvcadm -e mypostgres)
3. run: clustat -l / check: /var/log/messages and see how the postgres service was started successfully.
  
Actual results:
The postgres-8 resource agent reports the postgres server started successfully when it did not.

Expected results:
postgres-8 resource agent detects the service failure and the cluster then tries to do something about it, like failing over to another node.

Additional info:
A possible patch, written and contributed by Michel Sijmons (@nibble-it.nl) is attached (tested and works for postgresql-server 8 and 9).
Using pg_ctl to start the postgres server a correct exit code can be fetched.

The patch also includes a "more correct" way of shutting postgres down (SIGQUIT) which is also handled by Bug 587735.

The hole script is also attached (files: postgres-9.sh/metadata).

Comment 1 Magnus Glantz 2010-12-17 00:17:29 UTC

Created attachment 469236 [details]
full version of patched postgres-8.sh script

Comment 2 Magnus Glantz 2010-12-17 00:19:11 UTC

Created attachment 469237 [details]
renamed metadata file for full version script

Comment 3 Magnus Glantz 2010-12-17 00:19:42 UTC

A note for anyone evaluating this. This patch adds 1 second of sleep to allow pg_ctl to detect a started server. Needs to be evaluated if this is enough in most scenarios or if pg_ctl is better to be used to start-up the server as well.

Also, pg_ctl currently needs to fetch "-D /path/to/pgsql/data" from $OCF_RESKEY_postmaster_options to work.

Comment 6 Magnus Glantz 2010-12-17 18:51:02 UTC

Changed severity to high, as this bug renders postgres-8.sh dangerous to use - as end user may not notice this issue if a postgres config/binary corruption issue is not tested.

Comment 7 Magnus Glantz 2011-01-03 14:16:02 UTC

Example cluster.conf extract:
postmaster_options="-D /path/to/pgsql/data" needs to be defined for the patch to work.

<resources>
<postgres-9 config_file="/etc/cluster/postgresql.conf" name="database" postmaster_options="-D /var/lib/pgsql/9.0/data" postmaster_user="postgres" shutdown_wait="5"/>
</resources>

<service autostart="1" name="testservice">
<postgres-9 ref="database"/>
</service>

Comment 8 Marek Grac 2011-03-07 15:24:36 UTC

@Magnus: 

Thanks for a generous amount of information and proposed patch.

If there is a problem with configuration we are not able to detect it at a start time but we will find a problem after first checking of status.

--
after patch review I found out three possible problems:

1) stop_postgres() I don't like that code duplication. If using stop_generic_sigkill (kill -TERM, wait, kill -QUIT) is not a good solution. Then I will prefer to change existing function to support not running 'kill -TERM' when stop_timeout = 0. 

2) part with ccs_fd=$(ccs_connect) ... get_service_ip_keys "$ccs_fd" $OCF_RESKEY_service_name. Redundant because if ccs is not working properly we are not able to obtain IP address. Correct service configuration should be:

<service autostart="1" name="testservice">
<postgres-9 ref="database"/>
<ip addr="1.1.1.1" monitor_link="yes" />
</service>

so postgres can bind to proper IP address(es). 

3) changing postmaster to pgctl - most important change. No real objections as it makes sense to change it. Timeout will have to be configured (1 second looks like good default value). pgctl is ignoring generated configuration file (that's why it working for you even without ip address) is this an intention?

I can make modifications for 1) and 3) after you agree on them. Mainly 1) as I'm not a posgres expert.

Thanks

Comment 9 Magnus Glantz 2011-03-07 19:56:11 UTC

Marek,

1) So, if we got "stop_timeout = 0" then "kill -QUIT" will be issued? If so, that's fine.

3) 1 second timeout is a good default value. Not sure I understand "pgctl is ignoring generated configuration file". Can you please evolve on this?

Comment 10 Marek Grac 2011-03-14 10:27:52 UTC

@Magnus:

3) All resource agents should have ability to run several instances of application on same server (eg 2x apache + 3x postres on 1 node). As application usually do not accept configuration options (configuration file, locking file,  ...) directly we have to 'patch' existing configuration and create a new one which used for application.

Comment 13 errata-xmlrpc 2011-07-21 10:48:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1000.html

Note You need to log in before you can comment on or make changes to this bug.