1295830 – Increase timeout defaults for pacemaker

Bug 1295830 - Increase timeout defaults for pacemaker

Summary: Increase timeout defaults for pacemaker

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	8.0 (Liberty)
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ga
Target Release:	8.0 (Liberty)
Assignee:	Marios Andreou
QA Contact:	Giulio Fidente
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1295835
TreeView+	depends on / blocked

Reported:	2016-01-05 15:03 UTC by Michele Baldessari
Modified:	2016-04-07 21:44 UTC (History)
CC List:	7 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-0.8.8-1.el7ost
Doc Type:	Bug Fix
Doc Text:	Pacemaker used a 100s timeout for service resources. However, a systemd timeout requires an additional timeout period after the initial timeout to accommodate for a SIGTERM and then a SIGKILL. This fix increases the Pacemaker timeout to 200s to accommodate two full systemd timeout periods. Now the timeout period is enough for systemd to perform a SIGTERM and then a SIGKILL.
Clone Of:
Clones:	1295835 (view as bug list)
Environment:
Last Closed:	2016-04-07 21:44:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1531204	None	None	None	2016-01-05 16:39:18 UTC
OpenStack gerrit	263751	None	None	None	2016-01-05 16:39:47 UTC
OpenStack gerrit	272026	None	None	None	2016-01-25 11:46:54 UTC
Red Hat Product Errata	RHEA-2016:0604	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 8 director Enhancement Advisory	2016-04-08 01:03:56 UTC

Description Michele Baldessari 2016-01-05 15:03:47 UTC

Via BZ https://bugzilla.redhat.com/show_bug.cgi?id=1275324 we increased the 
stop timeout to 100 seconds.

Initially the 100s recommendation came from the DefaultTimeoutStopSec=90s 
setting in /etc/systemd/system.conf, and I believe the 120s recommendation 
(https://bugzilla.redhat.com/show_bug.cgi?id=1275324#c15) came from
anedoctal evidence observed during test runs, but was lost in the noise of the above BZ.
                                                        
So I took a look at the RHEL 7.2 systemd's source and noticed that the correct formula is actually:
DefaultTimeoutStopSec * 2 + <scheduling-delta*>
                                                                                
* I assume we need a bit of time to make sure that systemd is scheduled, 
that it sends a SIGKILL and that everything (process structures, mainly) is 
gone and five seconds seems quite reasonable (aka if systemd does not get to run within 5 seconds you likely have other issues anyways)
                                                       
This is because in src/core/service.c:static int service_dispatch_timer(sd_event_source *source, usec_t usec, void *userdata) {
...
case SERVICE_STOP_SIGTERM:                                                  
   if (s->kill_context.send_sigkill) {
	   log_unit_warning(UNIT(s)->id, "%s stop-sigterm timed out. Killing.", UNIT(s)->id);
	   service_enter_signal(s, SERVICE_STOP_SIGKILL, SERVICE_FAILURE_TIMEOUT);  
   } else {                                             
	   log_unit_warning(UNIT(s)->id, "%s stop-sigterm timed out.  Skipping SIGKILL.", UNIT(s)->id);
	   service_enter_stop_post(s, SERVICE_FAILURE_TIMEOUT);
   }                            
break;                                                    
...                                                                                                                                    
The man page seems to confirm that systemd will wait one Timout timespan 
for the initial stop request. Then it will send a SIGTERM and wait for another
Timeout to occur and, if the service is still around, then we send a SIGKILL.
"""
TimeoutStopSec=                                                             
   Configures the time to wait for stop. If a service is asked to stop      
   but does not terminate in the specified time, it will be terminated      
   forcibly via SIGTERM, and after another delay of this time with          
   SIGKILL (See KillMode= in systemd.kill(5)). Takes a unit-less value      
   in seconds, or a time span value such as "5min 20s". Pass 0 to           
   disable the timeout logic. Defaults to TimeoutStartSec= in manager       
   configuration file.                                                      
"""

This also confirms that we have seen services still around even after > 100 seconds.

So we need to change the pcs default timeout according to the following formula:
DefaultTimeoutStopSec * 2 + X = 180 + X ~= 185s

This is under the assumption that DefaultTimeoutStopSec in system.conf is
left at the RHEL default of 90 seconds.

Since pacemaker will fence a node when a service fails to stop within the 
configured timeout, this change should avoid most of the spurious fencing 
events when a service was still around after the old 100 seconds timeout.

Comment 1 Hugh Brock 2016-01-05 15:24:45 UTC

Reassigning to Marios since he's already working on it.

Comment 4 Giulio Fidente 2016-03-29 13:32:42 UTC

In openstack-tripleo-heat-templates-0.8.12-2.el7ost.noarch.rpm all the systemd resources have a start and stop timeout set to 200s (except mongodb which uses an even higher start timeout, 370s).

Note that non-systemd resources (like galera, redis, rabbitmq or IPs) will use shorter timeouts, as defined in the respective resource agents.

Comment 6 errata-xmlrpc 2016-04-07 21:44:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0604.html

Note You need to log in before you can comment on or make changes to this bug.