Bug 1258192

Summary: heat non-functional in pacemaker cluster deployed by OSP 7 director
Product: Red Hat OpenStack Reporter: jliberma <jliberma>
Component: rhosp-directorAssignee: James Slagle <jslagle>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: ebagdasa, hbrock, jcoufal, mburns, rhel-osp-director-maint
Target Milestone: gaKeywords: TestOnly, Triaged
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-07 21:39:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jliberma@redhat.com 2015-08-30 06:37:28 UTC
Created attachment 1068384 [details]
heat templates used for deployment

Description of problem:

heat commands issued from overcloud servers (not from undercloud to deploy overcloud, but from overcloud to deploy instances) fail until openstack-heat-api-clone restarted on a controller. 

Version-Release number of selected component (if applicable):
Latest shipping: 08-07.3 OSP-d 08-14.1 OSP 7.

python-rdomanager-oscplugin-0.0.8-44.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-2015.1.0-4.el7ost.noarch

How reproducible:
Every time for me

Steps to Reproduce:
1. Deploy non-SSL undercloud
2. Deploy HA overcloud with at least 3 controller nodes
3. Launch a nested heat stack that contains multiple instances and networking components
4. during heat stack creation, issue heat commands such as heat stack-list or heat resource-list <stack_name>.  

Actual results:

It will return error:

[stack@rhos0 ~(demo_member)]$ heat stack-list
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

And stack creation will fail. In some cases the stack shows create complete but the cloud-init actions fail.

Expected results:

heat should continue to work.

Additional info:
1. In single-node HA (OSP director deployed control-scale=1) the same heat templates work EVERY TIME that fail with 3 control nodes.
2. 'pcs show' shows all heat services running on all controllers, as does openstack-status
3. Setting 'verbose = true' in heat.conf and restarting heat with pcs resource restart openstack-heat-api-clone showed this error after reproducing problem:

2015-08-30 01:32:07.174 575 DEBUG heat.common.serializers [req-5a2495c0-0b39-46c3-9f7b-f6e89922172d - demo-tenant] JSON response : {"explanation": "The server has either erred or is incapable of performing the requested operation.", "code": 500, "error": {"message": "Timed out waiting for a reply to message ID a5a874de59c4410db68d5f22bc067e8f", "traceback": "Traceback (most recent call last):\n
error: [Errno 104] Connection reset by peer
2015-08-30 01:33:58.103 11574 DEBUG heat-api [-] error_wait_time                = 240 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2191
2015-08-30 01:33:58.119 11574 DEBUG heat-api [-] publish_errors                 = False log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2191

4. restarting heat gets API responsive again until heat commands such as 'heat stack-delete <stack>' are issued, then the problem returns
5. restarting neutron-server-clone resource group also corrects the problem
6. Sometimes the deploy makes it further into the stack than others before the services stop responding

7. deployment command:
openstack overcloud deploy -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph --ntp-server 10.16.255.2 --control-scale 3 --compute-scale 4 --ceph-storage-scale 4 --block-storage-scale 0 --swift-storage-scale 0 -t 90 --templates /home/stack/templates/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml --rhel-reg --reg-method satellite --reg-sat-url http://se-sat6.syseng.bos.redhat.com --reg-org syseng --reg-activation-key OSP7-Overcloud
Deploying templates in the directory /home/stack/templates/openstack-tripleo-heat-templates

Comment 3 jliberma@redhat.com 2015-08-31 12:51:11 UTC
More investigation:

Deployed and then changed
openstack-config --set /etc/heat/heat.conf DEFAULT engine_life_check_timeout 30
openstack-config --set /etc/heat/heat.conf DEFAULT rpc_response_timeout 600
openstack-config --set /etc/heat/heat.conf DEFAULT debug true

on controllers and restarted heat-{engine,api}

Heat stack-create successful but 2 of 6 instances do not execute cloud-init, no ssh key injection. After create completes heat stack-list returns:

ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

Error in /var/log/messages on compute nodes:
Aug 31 00:52:40 localhost journal: internal error: missing storage backend for network files using rbd protocol
Aug 31 00:52:40 localhost ceilometer-agent-compute: libvirt: Storage Driver error : internal error: missing storage backend for network files using rbd protocol

Problem may be related to ephemeral storage on ceph.

Also numerous AMQP errors in /var/log/nova/nova-compute.log on compute nodes:
2015-08-31 00:07:01.718 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.18:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2015-08-31 00:07:02.741 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.17:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds.
2015-08-31 00:07:04.760 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.17:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2015-08-31 00:07:05.778 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.17:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.

However, rabbitmq seems to be running and other services are not affected.

Redeploying without ceph to test. My previous deployments with single pacemaker controller and LVM backend are 100% successful.

Comment 4 jliberma@redhat.com 2015-08-31 13:01:13 UTC
asked slagle to test in scale lab with my eap 6 nested heat templates

Comment 6 jliberma@redhat.com 2015-09-01 19:56:54 UTC
zaneb asked me to try increasing the HAproxy connection timeout values along with the heat parameters.

1. deployed overcloud
2. configured the following on all controller nodes:
sed -i "/heat/a  \  timeout  connect 30s" /etc/haproxy/haproxy.cfg 
openstack-config --set /etc/heat/heat.conf DEFAULT engine_life_check_timeout 30
openstack-config --set /etc/heat/heat.conf DEFAULT rpc_response_timeout 600
openstack-config --set /etc/heat/heat.conf DEFAULT verbose true
openstack-config --get /etc/heat/heat.conf DEFAULT engine_life_check_timeout
openstack-config --get /etc/heat/heat.conf DEFAULT rpc_response_timeout
openstack-config --get /etc/heat/heat.conf DEFAULT verbose
pcs resource restart haproxy-clone
pcs resource restart openstack-heat-api-clone
pcs resource restart openstack-heat-engine-clone
3. deployed EAP6 stack, failed with same unreachable errors

[stack@rhos0 ~(demo_member)]$ source demorc

[stack@rhos0 ~(demo_member)]$ heat stack-list
ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

Comment 8 Amit Ugol 2016-03-22 06:35:05 UTC
unable to reproduce

Comment 10 errata-xmlrpc 2016-04-07 21:39:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0604.html