rhel-osp-director: Overcloud update fails with "httpd has stopped: ERROR: cluster remained unstable for more than 1800 seconds, exiting" Environment: instack-undercloud-2.2.7-7.el7ost.noarch openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch openstack-puppet-modules-7.1.3-1.el7ost.noarch openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch Steps to reproduce: 1. Deploy overcloud version lock enabled using this command: openstack overcloud deploy --log-file ~/pilot/overcloud_deployment.log -t 400 --stack overcloud \ --templates ~/pilot/templates/overcloud \ -e ~/pilot/templates/overcloud/environments/network-isolation.yaml \ -e ~/pilot/templates/network-environment.yaml \ -e ~/pilot/templates/overcloud/environments/storage-environment.yaml \ -e ~/pilot/templates/dell-environment.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \ --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --swift-storage-flavor swift-storage --block-storage-flavor block-storage --neutron-public-interface bond1 --neutron-network-type vlan --neutron-disable-tunneling --os-auth-url http://192.168.120.101:5000/v2.0 --os-project-name admin --os-user-id admin --os-password c658b5e1e0e4434faa685a5b2f36d5436ff4f2bf --control-scale 3 --compute-scale 3 --ceph-storage-scale 3 --ntp-server 0.centos.pool.ntp.org --neutron-network-vlan-ranges physint:201:220,physext --neutron-bridge-mappings physint:br-tenant,physext:br-ex 2. Update undercloud + reboot the node. 3. Attempt to update overcloud with: yes ""|openstack overcloud update stack overcloud -i --templates ~/pilot/templates/overcloud -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e ~/pilot/templates/overcloud/environments/network-isolation.yaml -e ~/pilot/templates/network-environment.yaml -e ~/pilot/templates/overcloud/environments/storage-environment.yaml -e ~/pilot/templates/dell-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml Result: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS FAILED update finished with status FAILED [stack@director ~]$ heat resource-list -n5 overcloud|grep -v COMPLE +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | ControllerNodesPostDeployment | 141bb713-aa62-42da-a558-8b034005b43d | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-10-11T16:58:21 | overcloud | | ControllerPostPuppet | 3f925b70-88b9-479b-bdc3-28da3c855710 | OS::TripleO::Tasks::ControllerPostPuppet | UPDATE_FAILED | 2016-10-11T17:14:28 | overcloud-ControllerNodesPostDeployment-2q6gufczyxgf | | ControllerPostPuppetRestartDeployment | fe59717a-d256-459a-b078-eba60d145f97 | OS::Heat::SoftwareDeployments | UPDATE_FAILED | 2016-10-11T17:15:32 | overcloud-ControllerNodesPostDeployment-2q6gufczyxgf-ControllerPostPuppet-2zl5nadvshsd | | 0 | 358c2c31-a445-47f7-8458-8aaeb4df47b1 | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2016-10-11T17:15:33 | overcloud-ControllerNodesPostDeployment-2q6gufczyxgf-ControllerPostPuppet-2zl5nadvshsd-ControllerPostPuppetRestartDeployment-pjeelb7mewyp | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ [stack@director ~]$ [stack@director ~]$ echo -e `heat deployment-show 358c2c31-a445-47f7-8458-8aaeb4df47b1` { "status": "FAILED", "server_id": "12f49962-64d7-4b0f-b9e2-b5f981009456", "config_id": "341220b9-5bc5-4f42-a9a0-1853e53fa195", "output_values": { "deploy_stdout": "httpd has stopped ERROR: cluster remained unstable for more than 1800 seconds, exiting. ", "deploy_stderr": "++ systemctl is-active pacemaker + pacemaker_status=active ++ hiera bootstrap_nodeid ++ facter hostname ++ hiera update_identifier + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a 1476202761 '!=' nil ']' + pcs constraint order show + grep 'start neutron-server-clone then start neutron-ovs-cleanup-clone' + pcs resource disable httpd + check_resource httpd stopped 300 + '[' 3 -ne 3 ']' + service=httpd + state=stopped + timeout=300 + '[' stopped = stopped ']' + match_for_incomplete=Started + timeout -k 10 300 crm_resource --wait ++ pcs status --full ++ grep httpd ++ grep -v Clone + node_states=' httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped' + echo ' httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped httpd (systemd:httpd): (target-role:Stopped) Stopped' + grep -q Started + echo 'httpd has stopped' + pcs resource disable openstack-keystone + check_resource openstack-keystone stopped 1800 + '[' 3 -ne 3 ']' + service=openstack-keystone + state=stopped + timeout=1800 + '[' stopped = stopped ']' + match_for_incomplete=Started + timeout -k 10 1800 crm_resource --wait + echo_error 'ERROR: cluster remained unstable for more than 1800 seconds, exiting.' + echo 'ERROR: cluster remained unstable for more than 1800 seconds, exiting.' + tee /dev/fd2 + exit 1 ", "deploy_status_code": 1 }, "creation_time": "2016-10-08T04:01:53", "updated_time": "2016-10-11T17:46:41", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "358c2c31-a445-47f7-8458-8aaeb4df47b1" }
Created attachment 1209283 [details] versionlock list
anything in the journal on the controllers ?
So without sosreports I will try to put out some thought here in the meantime. We fail in the following snippet of code: + pcs resource disable openstack-keystone + check_resource openstack-keystone stopped 1800 + '[' 3 -ne 3 ']' + service=openstack-keystone + state=stopped + timeout=1800 + '[' stopped = stopped ']' + match_for_incomplete=Started + timeout -k 10 1800 crm_resource --wait Now when we call the disable for openstack-keystone in Liberty, we are basically asking to stop all the child services of the resource: http://acksyn.org/files/tripleo/liberty-new-install.pdf A few possibilities come to mind: - In OSP 8 we do not have the correct stop timeout for systemd resources (200s), so one of the child services failed to stop and this broke the process. Will need a sosreport to doublecheck this - We actually hit a known pacemaker bug that makes crm_resource --wait never terminate: https://bugzilla.redhat.com/show_bug.cgi?id=1349493
(In reply to Michele Baldessari from comment #5) > A few possibilities come to mind: > - In OSP 8 we do not have the correct stop timeout for systemd resources > (200s), so one of the child services failed to stop and this broke the > process. This is what happened. The new nova-compute clone has the default timeouts instead of 200s or 300s. These operations timed out, and without fencing enabled the cluster was unable to do anything to continue recovery. This prevented openstack-nova-conductor-clone, libvirtd-compute-clone, and their dependancies from being stopped and the update to bork. You want to run the following and re-test: for RESOURCE in neutron-openvswitch-agent-compute-clone libvirtd-compute-clone ceilometer-compute-clone nova-compute-clone; do sudo pcs resource update $RESOURCE op start timeout=200s op stop timeout=200s done *** This bug has been marked as a duplicate of bug 1386186 ***
for RESOURCE in neutron-openvswitch-agent-compute-clone libvirtd-compute-clone ceilometer-compute-clone nova-compute-clone; do sudo pcs resource update $RESOURCE op start timeout=200s op stop timeout=200s done These Resource do not exist. Can you define the correct Resources?
If you have done this successfully can you post your commands?
Created attachment 1212867 [details] Timeouts from Updated and Upgraded Install This is the timeouts from my install as you can see they are all set with the 200+ where asked.
Created attachment 1212871 [details] resource timeouts from stock JS-5.0 install (OSP8) timeout values for resources in my stock JS-5.0 (OSP8) install, fyi. Generated by: sudo pcs resource | grep -v r8 | awk '{print $3}' | while read sedon ; do sudo pcs resource show $sedon; done > timeouts.dat Are these ok? Seems like it AFAIK
(In reply to Wayne Allen from comment #11) > Created attachment 1212871 [details] > resource timeouts from stock JS-5.0 install (OSP8) > > timeout values for resources in my stock JS-5.0 (OSP8) install, fyi. > Generated by: > > sudo pcs resource | grep -v r8 | awk '{print $3}' | while read sedon ; do > sudo pcs resource show $sedon; done > timeouts.dat > > Are these ok? Seems like it AFAIK Yes, the starts and stops are all set to 200 or higher.
(In reply to Randy Perryman from comment #8) > for RESOURCE in neutron-openvswitch-agent-compute-clone > libvirtd-compute-clone ceilometer-compute-clone nova-compute-clone; do > sudo pcs resource update $RESOURCE op start timeout=200s op stop > timeout=200s > done > > > These Resource do not exist. Can you define the correct Resources? Hi Randy, sorry for the delay. Those are all created as part of the instance HA overlay feature which wont be part of a basic triple-o installation. Somehow I missed that this was an overcloud update, we do not currently expect updates or upgrades work when the IHA feature has been configured (because it's not integrated with puppet and confuses the update logic) - although we are working on addressing that. There was a thread on this in mid October that I will bounce to you again which details the current process for updates.
Reopening this BZ. We need fix backported to OSP8 and OSP9 not just OSP10. Use this BZ for OSP8/Liberty fix. Will dup 1386186 for OSP9.
We haven't seen this in 6.0.1 update since we started patching it to set timeout to 300s. I'll try to remember to see what the current rabbitmq timeout default is in a fresh 6.0.1 install to see if it should be closed...
In fresh OSP 10 deployment rabbit stop timeout is still set to 200 by default we had issues with anything under 300 Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" Meta Attrs: notify=true Operations: monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10) start interval=0s timeout=200s (rabbitmq-start-interval-0s) stop interval=0s timeout=200s (rabbitmq-stop-interval-0s)
Setting needinfo to Andrew
200s is already quite a long time. Can we get some updated logs that I can pass on to our rabbit engineers to ensure there isn't some deeper issue?
Can we get the logs asked for on https://bugzilla.redhat.com/show_bug.cgi?id=1383780#c19 ? We need to understand why 200s are not enough for a rabbit to stop.
Sorry but those logs are no longer available. The stamp in question has been re-built.
Thanks David, I'll close this one out for now and we can revisit if needed.