Bug 1383780 - rhel-osp-director: Overcloud update fails with "httpd has stopped: ERROR: cluster remained unstable for more than 1800 seconds, exiting"
Summary: rhel-osp-director: Overcloud update fails with "httpd has stopped: ERROR: c...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: async
: 8.0 (Liberty)
Assignee: Andrew Beekhof
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks: 1305654 1335596
TreeView+ depends on / blocked
 
Reported: 2016-10-11 17:55 UTC by Alexander Chuzhoy
Modified: 2018-08-02 07:59 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-02 07:59:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
versionlock list (29.51 KB, text/plain)
2016-10-11 18:23 UTC, Alexander Chuzhoy
no flags Details
Timeouts from Updated and Upgraded Install (15.78 KB, text/plain)
2016-10-21 14:02 UTC, Randy Perryman
no flags Details
resource timeouts from stock JS-5.0 install (OSP8) (14.43 KB, text/plain)
2016-10-21 14:31 UTC, Wayne Allen
no flags Details

Description Alexander Chuzhoy 2016-10-11 17:55:53 UTC
rhel-osp-director:   Overcloud update fails with "httpd has stopped: ERROR: cluster remained unstable for more than 1800 seconds, exiting"


Environment:
instack-undercloud-2.2.7-7.el7ost.noarch
openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch
openstack-puppet-modules-7.1.3-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch

Steps to reproduce:

1. Deploy overcloud  version lock enabled using this command:
openstack overcloud deploy --log-file ~/pilot/overcloud_deployment.log -t 400 --stack overcloud \
--templates ~/pilot/templates/overcloud \
-e ~/pilot/templates/overcloud/environments/network-isolation.yaml \
-e ~/pilot/templates/network-environment.yaml \
-e ~/pilot/templates/overcloud/environments/storage-environment.yaml \
-e ~/pilot/templates/dell-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
--control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --swift-storage-flavor swift-storage --block-storage-flavor block-storage --neutron-public-interface bond1 --neutron-network-type vlan --neutron-disable-tunneling --os-auth-url http://192.168.120.101:5000/v2.0 --os-project-name admin --os-user-id admin --os-password c658b5e1e0e4434faa685a5b2f36d5436ff4f2bf --control-scale 3 --compute-scale 3 --ceph-storage-scale 3 --ntp-server 0.centos.pool.ntp.org --neutron-network-vlan-ranges physint:201:220,physext --neutron-bridge-mappings physint:br-tenant,physext:br-ex
2. Update undercloud + reboot the node.

3. Attempt to update overcloud with:
yes ""|openstack overcloud update stack overcloud -i --templates ~/pilot/templates/overcloud -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e ~/pilot/templates/overcloud/environments/network-isolation.yaml -e ~/pilot/templates/network-environment.yaml -e ~/pilot/templates/overcloud/environments/storage-environment.yaml -e ~/pilot/templates/dell-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml


Result:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED




[stack@director ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                                                   | resource_status | updated_time        | stack_name                                                                                                                                        |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ControllerNodesPostDeployment                 | 141bb713-aa62-42da-a558-8b034005b43d          | OS::TripleO::ControllerPostDeployment                                           | UPDATE_FAILED   | 2016-10-11T16:58:21 | overcloud                                                                                                                                         |
| ControllerPostPuppet                          | 3f925b70-88b9-479b-bdc3-28da3c855710          | OS::TripleO::Tasks::ControllerPostPuppet                                        | UPDATE_FAILED   | 2016-10-11T17:14:28 | overcloud-ControllerNodesPostDeployment-2q6gufczyxgf                                                                                              |
| ControllerPostPuppetRestartDeployment         | fe59717a-d256-459a-b078-eba60d145f97          | OS::Heat::SoftwareDeployments                                                   | UPDATE_FAILED   | 2016-10-11T17:15:32 | overcloud-ControllerNodesPostDeployment-2q6gufczyxgf-ControllerPostPuppet-2zl5nadvshsd                                                            |
| 0                                             | 358c2c31-a445-47f7-8458-8aaeb4df47b1          | OS::Heat::SoftwareDeployment                                                    | UPDATE_FAILED   | 2016-10-11T17:15:33 | overcloud-ControllerNodesPostDeployment-2q6gufczyxgf-ControllerPostPuppet-2zl5nadvshsd-ControllerPostPuppetRestartDeployment-pjeelb7mewyp         |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
[stack@director ~]$









[stack@director ~]$ echo -e `heat deployment-show 358c2c31-a445-47f7-8458-8aaeb4df47b1`
{ "status": "FAILED", "server_id": "12f49962-64d7-4b0f-b9e2-b5f981009456", "config_id": "341220b9-5bc5-4f42-a9a0-1853e53fa195", "output_values": { "deploy_stdout": "httpd has stopped
ERROR: cluster remained unstable for more than 1800 seconds, exiting.
", "deploy_stderr": "++ systemctl is-active pacemaker
+ pacemaker_status=active
++ hiera bootstrap_nodeid
++ facter hostname
++ hiera update_identifier
+ '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a 1476202761 '!=' nil ']'
+ pcs constraint order show
+ grep 'start neutron-server-clone then start neutron-ovs-cleanup-clone'
+ pcs resource disable httpd
+ check_resource httpd stopped 300
+ '[' 3 -ne 3 ']'
+ service=httpd
+ state=stopped
+ timeout=300
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
+ timeout -k 10 300 crm_resource --wait
++ pcs status --full
++ grep httpd
++ grep -v Clone
+ node_states=' httpd   (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped'
+ echo ' httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped
 httpd  (systemd:httpd):        (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'httpd has stopped'
+ pcs resource disable openstack-keystone
+ check_resource openstack-keystone stopped 1800
+ '[' 3 -ne 3 ']'
+ service=openstack-keystone
+ state=stopped
+ timeout=1800
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
+ timeout -k 10 1800 crm_resource --wait
+ echo_error 'ERROR: cluster remained unstable for more than 1800 seconds, exiting.'
+ echo 'ERROR: cluster remained unstable for more than 1800 seconds, exiting.'
+ tee /dev/fd2
+ exit 1
", "deploy_status_code": 1 }, "creation_time": "2016-10-08T04:01:53", "updated_time": "2016-10-11T17:46:41", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "358c2c31-a445-47f7-8458-8aaeb4df47b1" }

Comment 1 Alexander Chuzhoy 2016-10-11 18:23:36 UTC
Created attachment 1209283 [details]
versionlock list

Comment 3 Gael Rehault 2016-10-12 20:49:58 UTC
anything in the journal on the controllers ?

Comment 5 Michele Baldessari 2016-10-12 22:33:42 UTC
So without sosreports I will try to put out some thought here in the meantime.
We fail in the following snippet of code:
+ pcs resource disable openstack-keystone
+ check_resource openstack-keystone stopped 1800
+ '[' 3 -ne 3 ']'
+ service=openstack-keystone
+ state=stopped
+ timeout=1800
+ '[' stopped = stopped ']'
+ match_for_incomplete=Started
+ timeout -k 10 1800 crm_resource --wait


Now when we call the disable for openstack-keystone in Liberty, we are basically asking to stop all the child services of the resource: http://acksyn.org/files/tripleo/liberty-new-install.pdf

A few possibilities come to mind:
- In OSP 8 we do not have the correct stop timeout for systemd resources (200s), so one of the child services failed to stop and this broke the process. Will need a sosreport to doublecheck this
- We actually hit a known pacemaker bug that makes crm_resource --wait never terminate: https://bugzilla.redhat.com/show_bug.cgi?id=1349493

Comment 7 Andrew Beekhof 2016-10-19 01:27:15 UTC
(In reply to Michele Baldessari from comment #5)
> A few possibilities come to mind:
> - In OSP 8 we do not have the correct stop timeout for systemd resources
> (200s), so one of the child services failed to stop and this broke the
> process.

This is what happened.

The new nova-compute clone has the default timeouts instead of 200s or 300s.
These operations timed out, and without fencing enabled the cluster was unable to do anything to continue recovery.

This prevented openstack-nova-conductor-clone, libvirtd-compute-clone, and their dependancies from being stopped and the update to bork.


You want to run the following and re-test:

for RESOURCE in neutron-openvswitch-agent-compute-clone  libvirtd-compute-clone  ceilometer-compute-clone  nova-compute-clone; do
    sudo pcs resource update $RESOURCE op start timeout=200s op stop timeout=200s
done

*** This bug has been marked as a duplicate of bug 1386186 ***

Comment 8 Randy Perryman 2016-10-21 13:48:02 UTC
for RESOURCE in neutron-openvswitch-agent-compute-clone  libvirtd-compute-clone  ceilometer-compute-clone  nova-compute-clone; do
    sudo pcs resource update $RESOURCE op start timeout=200s op stop timeout=200s
done


These Resource do not exist.  Can you define the correct Resources?

Comment 9 Randy Perryman 2016-10-21 13:48:53 UTC
If you have done this successfully can you post your commands?

Comment 10 Randy Perryman 2016-10-21 14:02:10 UTC
Created attachment 1212867 [details]
Timeouts from Updated and Upgraded Install

This is the timeouts from my install as you can see they are all set with the 200+ where asked.

Comment 11 Wayne Allen 2016-10-21 14:31:05 UTC
Created attachment 1212871 [details]
resource timeouts from stock JS-5.0 install (OSP8)

timeout values for resources in my stock JS-5.0 (OSP8) install, fyi. Generated by:

sudo pcs resource | grep -v r8 | awk '{print $3}' | while read sedon ; do sudo pcs resource show $sedon; done > timeouts.dat

Are these ok?  Seems like it AFAIK

Comment 12 Andrew Beekhof 2016-11-08 02:24:08 UTC
(In reply to Wayne Allen from comment #11)
> Created attachment 1212871 [details]
> resource timeouts from stock JS-5.0 install (OSP8)
> 
> timeout values for resources in my stock JS-5.0 (OSP8) install, fyi.
> Generated by:
> 
> sudo pcs resource | grep -v r8 | awk '{print $3}' | while read sedon ; do
> sudo pcs resource show $sedon; done > timeouts.dat
> 
> Are these ok?  Seems like it AFAIK

Yes, the starts and stops are all set to 200 or higher.

Comment 13 Andrew Beekhof 2016-11-08 02:30:31 UTC
(In reply to Randy Perryman from comment #8)
> for RESOURCE in neutron-openvswitch-agent-compute-clone 
> libvirtd-compute-clone  ceilometer-compute-clone  nova-compute-clone; do
>     sudo pcs resource update $RESOURCE op start timeout=200s op stop
> timeout=200s
> done
> 
> 
> These Resource do not exist.  Can you define the correct Resources?

Hi Randy, sorry for the delay.

Those are all created as part of the instance HA overlay feature which wont be part of a basic triple-o installation.

Somehow I missed that this was an overcloud update, we do not currently expect updates or upgrades work when the IHA feature has been configured (because it's not integrated with puppet and confuses the update logic) - although we are working on addressing that.

There was a thread on this in mid October that I will bounce to you again which details the current process for updates.

Comment 14 arkady kanevsky 2016-11-30 04:03:27 UTC
Reopening this BZ. We need fix backported to OSP8 and OSP9 not just OSP10.
Use this BZ for OSP8/Liberty fix.

Will dup 1386186 for OSP9.

Comment 15 Wayne Allen 2017-07-14 21:56:15 UTC
We haven't seen this in 6.0.1 update since we started patching it to set timeout to 300s.  I'll try to remember to see what the current rabbitmq timeout default is in a fresh 6.0.1 install to see if it should be closed...

Comment 17 David Paterson 2018-04-03 17:03:35 UTC
In fresh OSP 10 deployment rabbit stop timeout is still set to 200 by default we had issues with anything under 300
 Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
  Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
  Meta Attrs: notify=true
  Operations: monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)
              start interval=0s timeout=200s (rabbitmq-start-interval-0s)
              stop interval=0s timeout=200s (rabbitmq-stop-interval-0s)

Comment 18 Sean Merrow 2018-05-21 13:49:31 UTC
Setting needinfo to Andrew

Comment 19 Andrew Beekhof 2018-05-22 01:19:19 UTC
200s is already quite a long time.
Can we get some updated logs that I can pass on to our rabbit engineers to ensure there isn't some deeper issue?

Comment 20 Michele Baldessari 2018-07-20 08:39:36 UTC
Can we get the logs asked for on https://bugzilla.redhat.com/show_bug.cgi?id=1383780#c19 ? We need to understand why 200s are not enough for a rabbit to stop.

Comment 21 David Paterson 2018-07-30 13:21:47 UTC
Sorry but those logs are no longer available.  The stamp in question has been re-built.

Comment 22 Michele Baldessari 2018-08-02 07:59:21 UTC
Thanks David, I'll close this one out for now and we can revisit if needed.


Note You need to log in before you can comment on or make changes to this bug.