Bug 1378391
Summary: | [RFE] Increase 'stop timeout' values of pacemaker resources on overcloud | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Chaitanya Shastri <cshastri> |
Component: | puppet-tripleo | Assignee: | Michele Baldessari <michele> |
Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.0 (Kilo) | CC: | cshastri, dmacpher, emacchi, fdinitto, jcoufal, jjoyce, jschluet, mburns, michele, oblaut, pkomarov, rcernin, rhel-osp-director-maint, royoung, slinaber, tvignaud |
Target Milestone: | rc | Keywords: | FutureFeature, Triaged |
Target Release: | 10.0 (Newton) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | puppet-tripleo-5.3.0-1.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Both Redis and RabbitMQ had a start and stop timeouts of 120s in Pacemaker. In some environments, this was not enough and caused restarts to fail. This fix increases the timeout to 200s, which is the same for the other systemd resources. Now Redis and RabbitMQ should have enough time to restart on the majority of environments.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 16:03:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Chaitanya Shastri
2016-09-22 10:31:28 UTC
I've got a workaround with Heat templates [stack@undercloud ~]$ cat ~/heat/templates/post_config_env.yaml resource_registry: OS::TripleO::NodeExtraConfigPost: /home/stack/heat/templates/post_deploy/rabbitmq-op-params.yaml [stack@undercloud ~]$ cat ~/heat/templates/post_deploy/rabbitmq-op-params.yaml heat_template_version: 2014-10-16 description: > Example extra config for post-deployment parameters: servers: type: json resources: ExtraConfig: type: OS::Heat::SoftwareConfig properties: group: script config: | #!/bin/sh pcs cluster status retval=$? if [ $retval -eq 0 ]; then pcs resource update rabbitmq op stop timeout=300 fi ExtraDeployments: type: OS::Heat::SoftwareDeployments properties: servers: {get_param: servers} config: {get_resource: ExtraConfig} actions: ['CREATE','UPDATE'] # Do this on CREATE, UPDATE openstack overcloud deploy --templates ... -e $HOME/heat/templates/post_config_env.yaml Resulted in: [heat-admin@overcloud-controller-0 ~]$ sudo pcs config | grep rabbitmq -A7 Clone: rabbitmq-clone Meta Attrs: ordered=true interleave=true Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" Meta Attrs: notify=true Operations: start interval=0s timeout=100 (rabbitmq-start-interval-0s) monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10) stop interval=0s timeout=300 (rabbitmq-stop-interval-0s) Clone: openstack-core-clone Meta Attrs: interleave=true Resource: openstack-core (class=ocf provider=heartbeat type=Dummy) Operations: start interval=0s timeout=20 (openstack-core-start-interval-0s) stop interval=0s timeout=20 (openstack-core-stop-interval-0s) monitor interval=10 timeout=20 (openstack-core-monitor-interval-10) Resource: ip-172.16.21.10 (class=ocf provider=heartbeat type=IPaddr2) -- start rabbitmq-clone then start openstack-core-clone (kind:Mandatory) (id:order-rabbitmq-clone-openstack-core-clone-mandatory) promote galera-master then start openstack-core-clone (kind:Mandatory) (id:order-galera-master-openstack-core-clone-mandatory) start openstack-core-clone then start openstack-gnocchi-metricd-clone (kind:Mandatory) (id:order-openstack-core-clone-openstack-gnocchi-metricd-clone-mandatory) start neutron-l3-agent-clone then start neutron-metadata-agent-clone (kind:Mandatory) (id:order-neutron-l3-agent-clone-neutron-metadata-agent-clone-mandatory) start openstack-core-clone then start openstack-nova-consoleauth-clone (kind:Mandatory) (id:order-openstack-core-clone-openstack-nova-consoleauth-clone-mandatory) start haproxy-clone then start openstack-core-clone (kind:Mandatory) (id:order-haproxy-clone-openstack-core-clone-mandatory) start neutron-ovs-cleanup-clone then start neutron-netns-cleanup-clone (kind:Mandatory) (id:order-neutron-ovs-cleanup-clone-neutron-netns-cleanup-clone-mandatory) start openstack-ceilometer-notification-clone then start openstack-heat-api-clone (kind:Mandatory) (id:order-openstack-ceilometer-notification-clone-openstack-heat-api-clone-mandatory) -- overcloud-controller-0: rmq-node-attr-last-known-rabbitmq=rabbit@overcloud-controller-0 overcloud-controller-1: rmq-node-attr-last-known-rabbitmq=rabbit@overcloud-controller-1 overcloud-controller-2: rmq-node-attr-last-known-rabbitmq=rabbit@overcloud-controller-2 [heat-admin@overcloud-controller-0 ~]$ Thanks Robin. This works for me. So there are actually a couple of issues at hand here when we talk about "The overcloud deployment sometimes fails due to this with the error". Namely: 1) A normal deployment should not even call the following code (but this code predates me so I need to double check the intentions here with Jiri and Marios): 03:19:48 + grep haproxy-clone 03:19:48 + pcs resource restart haproxy-clone 03:19:48 + pcs resource restart redis-master 03:19:48 + pcs resource restart mongod-clone 03:19:48 + pcs resource restart rabbitmq-clone The reason we call this code is, I believe, because we call /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/pacemaker_resource_restart.sh on deployment and we should only invoke it when doing an "overcloud stack update". Let's track issue 1) here: https://bugzilla.redhat.com/show_bug.cgi?id=1384068 2) The second issue which we can track in this BZ is the timeouts for rabbitmq (and potentially any other service that does not use the systemd ocf provider). Now, the reason for the 200s is because the default systemd timeout is 90s. So it will stop the service, then wait 90s send SIGTERM, wait another 90s send SIGKILL. So basically with 90s*2 + ~20s we are basically guaranteed that the service is really stopped (unless the process is stuck in kernel space) and pacemaker won't fail any stop actions. With the non systemd resources it is more of a trial and error to get a reasonable timeout. The current status should be the following: * Liberty/Mitaka/Newton are all the same - redis start interval=0s timeout=120 (redis-start-interval-0s) stop interval=0s timeout=120 (redis-stop-interval-0s) monitor interval=45 timeout=60 (redis-monitor-interval-45) monitor interval=20 role=Master timeout=60 (redis-monitor-interval-20) monitor interval=60 role=Slave timeout=60 (redis-monitor-interval-60) promote interval=0s timeout=120 (redis-promote-interval-0s) demote interval=0s timeout=120 (redis-demote-interval-0s) - galera start interval=0s timeout=120 (galera-start-interval-0s) stop interval=0s timeout=120 (galera-stop-interval-0s) monitor interval=20 timeout=30 (galera-monitor-interval-20) monitor interval=10 role=Master timeout=30 (galera-monitor-interval-10) monitor interval=30 role=Slave timeout=30 (galera-monitor-interval-30) demote interval=0s timeout=120 (galera-demote-interval-0s) promote interval=0s timeout=300s on-fail=block (galera-promote-interval-0s) - rabbitmq start interval=0s timeout=100 (rabbitmq-start-interval-0s) stop interval=0s timeout=90 (rabbitmq-stop-interval-0s) monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10) So I am okay to increase both redis and rabbitmq to ~double their value in their start/stop actions (will comment on the associated review). I discussed with Mike Bayer if we should not increase galera as well and he's never had a single report of that timing out so let's leave that alone. This has been merged for newton via: https://review.openstack.org/#/c/386618/ Verified (based on comment #6). The timeouts have been doubled for redis&rabbitmq. [stack@serverx tmp]$ rpm -qa|grep puppet-tripleo puppet-tripleo-5.3.0-1.el7ost.noarch [root@overcloud-controller-0 ~]# pcs resource --full | grep stop -C 1 ... start interval=0s timeout=200s (rabbitmq-start-interval-0s) stop interval=0s timeout=200s (rabbitmq-stop-interval-0s) Master: redis-master -- start interval=0s timeout=200s (redis-start-interval-0s) stop interval=0s timeout=200s (redis-stop-interval-0s) Resource: ip-192.0.2.15 (class=ocf provider=heartbeat type=IPaddr2) -- ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |