Bug 1378391
| Summary: | [RFE] Increase 'stop timeout' values of pacemaker resources on overcloud | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Chaitanya Shastri <cshastri> |
| Component: | puppet-tripleo | Assignee: | Michele Baldessari <michele> |
| Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.0 (Kilo) | CC: | cshastri, dmacpher, emacchi, fdinitto, jcoufal, jjoyce, jschluet, mburns, michele, oblaut, pkomarov, rcernin, rhel-osp-director-maint, royoung, slinaber, tvignaud |
| Target Milestone: | rc | Keywords: | FutureFeature, Triaged |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | puppet-tripleo-5.3.0-1.el7ost | Doc Type: | Bug Fix |
| Doc Text: |
Both Redis and RabbitMQ had a start and stop timeouts of 120s in Pacemaker. In some environments, this was not enough and caused restarts to fail. This fix increases the timeout to 200s, which is the same for the other systemd resources. Now Redis and RabbitMQ should have enough time to restart on the majority of environments.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-14 16:03:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I've got a workaround with Heat templates
[stack@undercloud ~]$ cat ~/heat/templates/post_config_env.yaml
resource_registry:
OS::TripleO::NodeExtraConfigPost: /home/stack/heat/templates/post_deploy/rabbitmq-op-params.yaml
[stack@undercloud ~]$ cat ~/heat/templates/post_deploy/rabbitmq-op-params.yaml
heat_template_version: 2014-10-16
description: >
Example extra config for post-deployment
parameters:
servers:
type: json
resources:
ExtraConfig:
type: OS::Heat::SoftwareConfig
properties:
group: script
config: |
#!/bin/sh
pcs cluster status
retval=$?
if [ $retval -eq 0 ]; then
pcs resource update rabbitmq op stop timeout=300
fi
ExtraDeployments:
type: OS::Heat::SoftwareDeployments
properties:
servers: {get_param: servers}
config: {get_resource: ExtraConfig}
actions: ['CREATE','UPDATE'] # Do this on CREATE, UPDATE
openstack overcloud deploy --templates ... -e $HOME/heat/templates/post_config_env.yaml
Resulted in:
[heat-admin@overcloud-controller-0 ~]$ sudo pcs config | grep rabbitmq -A7
Clone: rabbitmq-clone
Meta Attrs: ordered=true interleave=true
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
Meta Attrs: notify=true
Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s)
monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)
stop interval=0s timeout=300 (rabbitmq-stop-interval-0s)
Clone: openstack-core-clone
Meta Attrs: interleave=true
Resource: openstack-core (class=ocf provider=heartbeat type=Dummy)
Operations: start interval=0s timeout=20 (openstack-core-start-interval-0s)
stop interval=0s timeout=20 (openstack-core-stop-interval-0s)
monitor interval=10 timeout=20 (openstack-core-monitor-interval-10)
Resource: ip-172.16.21.10 (class=ocf provider=heartbeat type=IPaddr2)
--
start rabbitmq-clone then start openstack-core-clone (kind:Mandatory) (id:order-rabbitmq-clone-openstack-core-clone-mandatory)
promote galera-master then start openstack-core-clone (kind:Mandatory) (id:order-galera-master-openstack-core-clone-mandatory)
start openstack-core-clone then start openstack-gnocchi-metricd-clone (kind:Mandatory) (id:order-openstack-core-clone-openstack-gnocchi-metricd-clone-mandatory)
start neutron-l3-agent-clone then start neutron-metadata-agent-clone (kind:Mandatory) (id:order-neutron-l3-agent-clone-neutron-metadata-agent-clone-mandatory)
start openstack-core-clone then start openstack-nova-consoleauth-clone (kind:Mandatory) (id:order-openstack-core-clone-openstack-nova-consoleauth-clone-mandatory)
start haproxy-clone then start openstack-core-clone (kind:Mandatory) (id:order-haproxy-clone-openstack-core-clone-mandatory)
start neutron-ovs-cleanup-clone then start neutron-netns-cleanup-clone (kind:Mandatory) (id:order-neutron-ovs-cleanup-clone-neutron-netns-cleanup-clone-mandatory)
start openstack-ceilometer-notification-clone then start openstack-heat-api-clone (kind:Mandatory) (id:order-openstack-ceilometer-notification-clone-openstack-heat-api-clone-mandatory)
--
overcloud-controller-0: rmq-node-attr-last-known-rabbitmq=rabbit@overcloud-controller-0
overcloud-controller-1: rmq-node-attr-last-known-rabbitmq=rabbit@overcloud-controller-1
overcloud-controller-2: rmq-node-attr-last-known-rabbitmq=rabbit@overcloud-controller-2
[heat-admin@overcloud-controller-0 ~]$
Thanks Robin. This works for me. So there are actually a couple of issues at hand here when we talk about "The overcloud deployment sometimes fails due to this with the error". Namely: 1) A normal deployment should not even call the following code (but this code predates me so I need to double check the intentions here with Jiri and Marios): 03:19:48 + grep haproxy-clone 03:19:48 + pcs resource restart haproxy-clone 03:19:48 + pcs resource restart redis-master 03:19:48 + pcs resource restart mongod-clone 03:19:48 + pcs resource restart rabbitmq-clone The reason we call this code is, I believe, because we call /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/pacemaker_resource_restart.sh on deployment and we should only invoke it when doing an "overcloud stack update". Let's track issue 1) here: https://bugzilla.redhat.com/show_bug.cgi?id=1384068 2) The second issue which we can track in this BZ is the timeouts for rabbitmq (and potentially any other service that does not use the systemd ocf provider). Now, the reason for the 200s is because the default systemd timeout is 90s. So it will stop the service, then wait 90s send SIGTERM, wait another 90s send SIGKILL. So basically with 90s*2 + ~20s we are basically guaranteed that the service is really stopped (unless the process is stuck in kernel space) and pacemaker won't fail any stop actions. With the non systemd resources it is more of a trial and error to get a reasonable timeout. The current status should be the following: * Liberty/Mitaka/Newton are all the same - redis start interval=0s timeout=120 (redis-start-interval-0s) stop interval=0s timeout=120 (redis-stop-interval-0s) monitor interval=45 timeout=60 (redis-monitor-interval-45) monitor interval=20 role=Master timeout=60 (redis-monitor-interval-20) monitor interval=60 role=Slave timeout=60 (redis-monitor-interval-60) promote interval=0s timeout=120 (redis-promote-interval-0s) demote interval=0s timeout=120 (redis-demote-interval-0s) - galera start interval=0s timeout=120 (galera-start-interval-0s) stop interval=0s timeout=120 (galera-stop-interval-0s) monitor interval=20 timeout=30 (galera-monitor-interval-20) monitor interval=10 role=Master timeout=30 (galera-monitor-interval-10) monitor interval=30 role=Slave timeout=30 (galera-monitor-interval-30) demote interval=0s timeout=120 (galera-demote-interval-0s) promote interval=0s timeout=300s on-fail=block (galera-promote-interval-0s) - rabbitmq start interval=0s timeout=100 (rabbitmq-start-interval-0s) stop interval=0s timeout=90 (rabbitmq-stop-interval-0s) monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10) So I am okay to increase both redis and rabbitmq to ~double their value in their start/stop actions (will comment on the associated review). I discussed with Mike Bayer if we should not increase galera as well and he's never had a single report of that timing out so let's leave that alone. This has been merged for newton via: https://review.openstack.org/#/c/386618/ Verified (based on comment #6). The timeouts have been doubled for redis&rabbitmq. [stack@serverx tmp]$ rpm -qa|grep puppet-tripleo puppet-tripleo-5.3.0-1.el7ost.noarch [root@overcloud-controller-0 ~]# pcs resource --full | grep stop -C 1 ... start interval=0s timeout=200s (rabbitmq-start-interval-0s) stop interval=0s timeout=200s (rabbitmq-stop-interval-0s) Master: redis-master -- start interval=0s timeout=200s (redis-start-interval-0s) stop interval=0s timeout=200s (redis-stop-interval-0s) Resource: ip-192.0.2.15 (class=ocf provider=heartbeat type=IPaddr2) -- ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |
Description of problem: When we observe the 'stop timeout' values of pacemaker resources: rabbitmq and redis, they are set to 90s. But for all other services, it is set to 200s. The overcloud deployment sometimes fails due to this with the error: 03:19:48 + grep haproxy-clone 03:19:48 + pcs resource restart haproxy-clone 03:19:48 + pcs resource restart redis-master 03:19:48 + pcs resource restart mongod-clone 03:19:48 + pcs resource restart rabbitmq-clone 03:19:48 Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining 03:19:48 Error performing operation: Timer expired 03:19:48 03:19:48 Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped 03:19:48 Waiting for 1 resources to stop: 03:19:48 * rabbitmq-clone Also, we get the following error message in the /var/log/messages on the overcloud controllers: localhost os-collect-config: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining localhost os-collect-config: Error performing operation: Timer expired Increasing the rabbitmq timeouts manually solves the issue. Version-Release number of selected component (if applicable): OSP 7 How reproducible: Always Steps to Reproduce: 1. Deploy the overcloud HA 2. On one of the controllers, run: #pcs resource --full | grep stop -C 1 Operations: start interval=0s timeout=20s (ip-192.0.2.6-start-interval-0s) stop interval=0s timeout=20s (ip-192.0.2.6-stop-interval-0s) monitor interval=10s timeout=20s (ip-192.0.2.6-monitor-interval-10s) -- Operations: start interval=0s timeout=200s (haproxy-start-interval-0s) stop interval=0s timeout=200s (haproxy-stop-interval-0s) monitor interval=60s (haproxy-monitor-interval-60s) -- Operations: start interval=0s timeout=20s (ip-192.0.2.7-start-interval-0s) stop interval=0s timeout=20s (ip-192.0.2.7-stop-interval-0s) monitor interval=10s timeout=20s (ip-192.0.2.7-monitor-interval-10s) -- Operations: start interval=0s timeout=120 (galera-start-interval-0s) stop interval=0s timeout=120 (galera-stop-interval-0s) monitor interval=20 timeout=30 (galera-monitor-interval-20) -- Operations: start interval=0s timeout=120 (redis-start-interval-0s) stop interval=0s timeout=120 (redis-stop-interval-0s) monitor interval=45 timeout=60 (redis-monitor-interval-45) -- Operations: start interval=0s timeout=370s (mongod-start-interval-0s) stop interval=0s timeout=200s (mongod-stop-interval-0s) monitor interval=60s (mongod-monitor-interval-60s) -- monitor interval=60s timeout=300 (rabbitmq-monitor-interval-60s) stop interval=0s timeout=90 (rabbitmq-stop-interval-0s) Clone: memcached-clone -- Operations: start interval=0s timeout=200s (memcached-start-interval-0s) stop interval=0s timeout=200s (memcached-stop-interval-0s) monitor interval=60s (memcached-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-nova-scheduler-start-interval-0s) stop interval=0s timeout=200s (openstack-nova-scheduler-stop-interval-0s) monitor interval=60s start-delay=10s (openstack-nova-scheduler-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (neutron-l3-agent-start-interval-0s) stop interval=0s timeout=200s (neutron-l3-agent-stop-interval-0s) monitor interval=60s (neutron-l3-agent-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-ceilometer-alarm-notifier-start-interval-0s) stop interval=0s timeout=200s (openstack-ceilometer-alarm-notifier-stop-interval-0s) monitor interval=60s (openstack-ceilometer-alarm-notifier-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-heat-engine-start-interval-0s) stop interval=0s timeout=200s (openstack-heat-engine-stop-interval-0s) monitor interval=60s (openstack-heat-engine-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-ceilometer-api-start-interval-0s) stop interval=0s timeout=200s (openstack-ceilometer-api-stop-interval-0s) monitor interval=60s (openstack-ceilometer-api-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (neutron-metadata-agent-start-interval-0s) stop interval=0s timeout=200s (neutron-metadata-agent-stop-interval-0s) monitor interval=60s (neutron-metadata-agent-monitor-interval-60s) -- Operations: start interval=0s timeout=40 (neutron-ovs-cleanup-start-interval-0s) stop interval=0s timeout=300 (neutron-ovs-cleanup-stop-interval-0s) monitor interval=10 timeout=20 (neutron-ovs-cleanup-monitor-interval-10) -- Operations: start interval=0s timeout=40 (neutron-netns-cleanup-start-interval-0s) stop interval=0s timeout=300 (neutron-netns-cleanup-stop-interval-0s) monitor interval=10 timeout=20 (neutron-netns-cleanup-monitor-interval-10) -- Operations: start interval=0s timeout=200s (openstack-heat-api-start-interval-0s) stop interval=0s timeout=200s (openstack-heat-api-stop-interval-0s) monitor interval=60s (openstack-heat-api-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-cinder-scheduler-start-interval-0s) stop interval=0s timeout=200s (openstack-cinder-scheduler-stop-interval-0s) monitor interval=60s (openstack-cinder-scheduler-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-nova-api-start-interval-0s) stop interval=0s timeout=200s (openstack-nova-api-stop-interval-0s) monitor interval=60s start-delay=10s (openstack-nova-api-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-heat-api-cloudwatch-start-interval-0s) stop interval=0s timeout=200s (openstack-heat-api-cloudwatch-stop-interval-0s) monitor interval=60s (openstack-heat-api-cloudwatch-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-ceilometer-collector-start-interval-0s) stop interval=0s timeout=200s (openstack-ceilometer-collector-stop-interval-0s) monitor interval=60s (openstack-ceilometer-collector-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-keystone-start-interval-0s) stop interval=0s timeout=200s (openstack-keystone-stop-interval-0s) monitor interval=60s (openstack-keystone-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-nova-consoleauth-start-interval-0s) stop interval=0s timeout=200s (openstack-nova-consoleauth-stop-interval-0s) monitor interval=60s start-delay=10s (openstack-nova-consoleauth-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-glance-registry-start-interval-0s) stop interval=0s timeout=200s (openstack-glance-registry-stop-interval-0s) monitor interval=60s (openstack-glance-registry-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-ceilometer-notification-start-interval-0s) stop interval=0s timeout=200s (openstack-ceilometer-notification-stop-interval-0s) monitor interval=60s (openstack-ceilometer-notification-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-cinder-api-start-interval-0s) stop interval=0s timeout=200s (openstack-cinder-api-stop-interval-0s) monitor interval=60s (openstack-cinder-api-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (neutron-dhcp-agent-start-interval-0s) stop interval=0s timeout=200s (neutron-dhcp-agent-stop-interval-0s) monitor interval=60s (neutron-dhcp-agent-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-glance-api-start-interval-0s) stop interval=0s timeout=200s (openstack-glance-api-stop-interval-0s) monitor interval=60s (openstack-glance-api-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (neutron-openvswitch-agent-start-interval-0s) stop interval=0s timeout=200s (neutron-openvswitch-agent-stop-interval-0s) monitor interval=60s (neutron-openvswitch-agent-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-nova-novncproxy-start-interval-0s) stop interval=0s timeout=200s (openstack-nova-novncproxy-stop-interval-0s) monitor interval=60s start-delay=10s (openstack-nova-novncproxy-monitor-interval-60s) -- Operations: start interval=0s timeout=30 (delay-start-interval-0s) stop interval=0s timeout=30 (delay-stop-interval-0s) monitor interval=10 timeout=30 (delay-monitor-interval-10) -- Operations: start interval=0s timeout=200s (neutron-server-start-interval-0s) stop interval=0s timeout=200s (neutron-server-stop-interval-0s) monitor interval=60s (neutron-server-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (httpd-start-interval-0s) stop interval=0s timeout=200s (httpd-stop-interval-0s) monitor interval=60s (httpd-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-ceilometer-central-start-interval-0s) stop interval=0s timeout=200s (openstack-ceilometer-central-stop-interval-0s) monitor interval=60s (openstack-ceilometer-central-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-ceilometer-alarm-evaluator-start-interval-0s) stop interval=0s timeout=200s (openstack-ceilometer-alarm-evaluator-stop-interval-0s) monitor interval=60s (openstack-ceilometer-alarm-evaluator-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-heat-api-cfn-start-interval-0s) stop interval=0s timeout=200s (openstack-heat-api-cfn-stop-interval-0s) monitor interval=60s (openstack-heat-api-cfn-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-cinder-volume-start-interval-0s) stop interval=0s timeout=200s (openstack-cinder-volume-stop-interval-0s) monitor interval=60s (openstack-cinder-volume-monitor-interval-60s) -- Operations: start interval=0s timeout=200s (openstack-nova-conductor-start-interval-0s) stop interval=0s timeout=200s (openstack-nova-conductor-stop-interval-0s) monitor interval=60s start-delay=10s (openstack-nova-conductor-monitor-interval-60s) Actual results: Default rabbitmq stop timeout is 90s, redis timeout is 120 Expected results: rabbitmq and redis timeouts should be set to 300 Additional info: Workaround to set timeouts during overcloud deployment: (Unsupported) Open the /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/pacemaker_resource_restart.sh file on the undercloud node. - Search for the line 'pcs resource restart rabbitmq-clone' - Put these two lines before 'pcs resource restart rabbitmq-clone': pcs resource op remove rabbitmq stop interval=0s timeout=90 pcs resource op add rabbitmq stop interval=0s timeout=300 - Save the file and exit. - Start the overcloud deployment as usual. After overcloud is deployed, check the 'stop timeout' value for rabbitmq resource on the overcloud and check if its set to 300: #pcs resource show rabbitmq-clone Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" Meta Attrs: notify=true Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s) monitor interval=60s timeout=300 (rabbitmq-monitor-interval-60s) stop interval=0s timeout=300 (rabbitmq-stop-interval-0s) <=======