Created attachment 1068384 [details] heat templates used for deployment Description of problem: heat commands issued from overcloud servers (not from undercloud to deploy overcloud, but from overcloud to deploy instances) fail until openstack-heat-api-clone restarted on a controller. Version-Release number of selected component (if applicable): Latest shipping: 08-07.3 OSP-d 08-14.1 OSP 7. python-rdomanager-oscplugin-0.0.8-44.el7ost.noarch python-heatclient-0.6.0-1.el7ost.noarch openstack-heat-api-2015.1.0-4.el7ost.noarch How reproducible: Every time for me Steps to Reproduce: 1. Deploy non-SSL undercloud 2. Deploy HA overcloud with at least 3 controller nodes 3. Launch a nested heat stack that contains multiple instances and networking components 4. during heat stack creation, issue heat commands such as heat stack-list or heat resource-list <stack_name>. Actual results: It will return error: [stack@rhos0 ~(demo_member)]$ heat stack-list ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> And stack creation will fail. In some cases the stack shows create complete but the cloud-init actions fail. Expected results: heat should continue to work. Additional info: 1. In single-node HA (OSP director deployed control-scale=1) the same heat templates work EVERY TIME that fail with 3 control nodes. 2. 'pcs show' shows all heat services running on all controllers, as does openstack-status 3. Setting 'verbose = true' in heat.conf and restarting heat with pcs resource restart openstack-heat-api-clone showed this error after reproducing problem: 2015-08-30 01:32:07.174 575 DEBUG heat.common.serializers [req-5a2495c0-0b39-46c3-9f7b-f6e89922172d - demo-tenant] JSON response : {"explanation": "The server has either erred or is incapable of performing the requested operation.", "code": 500, "error": {"message": "Timed out waiting for a reply to message ID a5a874de59c4410db68d5f22bc067e8f", "traceback": "Traceback (most recent call last):\n error: [Errno 104] Connection reset by peer 2015-08-30 01:33:58.103 11574 DEBUG heat-api [-] error_wait_time = 240 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2191 2015-08-30 01:33:58.119 11574 DEBUG heat-api [-] publish_errors = False log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2191 4. restarting heat gets API responsive again until heat commands such as 'heat stack-delete <stack>' are issued, then the problem returns 5. restarting neutron-server-clone resource group also corrects the problem 6. Sometimes the deploy makes it further into the stack than others before the services stop responding 7. deployment command: openstack overcloud deploy -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph --ntp-server 10.16.255.2 --control-scale 3 --compute-scale 4 --ceph-storage-scale 4 --block-storage-scale 0 --swift-storage-scale 0 -t 90 --templates /home/stack/templates/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml --rhel-reg --reg-method satellite --reg-sat-url http://se-sat6.syseng.bos.redhat.com --reg-org syseng --reg-activation-key OSP7-Overcloud Deploying templates in the directory /home/stack/templates/openstack-tripleo-heat-templates
More investigation: Deployed and then changed openstack-config --set /etc/heat/heat.conf DEFAULT engine_life_check_timeout 30 openstack-config --set /etc/heat/heat.conf DEFAULT rpc_response_timeout 600 openstack-config --set /etc/heat/heat.conf DEFAULT debug true on controllers and restarted heat-{engine,api} Heat stack-create successful but 2 of 6 instances do not execute cloud-init, no ssh key injection. After create completes heat stack-list returns: ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Error in /var/log/messages on compute nodes: Aug 31 00:52:40 localhost journal: internal error: missing storage backend for network files using rbd protocol Aug 31 00:52:40 localhost ceilometer-agent-compute: libvirt: Storage Driver error : internal error: missing storage backend for network files using rbd protocol Problem may be related to ephemeral storage on ceph. Also numerous AMQP errors in /var/log/nova/nova-compute.log on compute nodes: 2015-08-31 00:07:01.718 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.18:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2015-08-31 00:07:02.741 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.17:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 2 seconds. 2015-08-31 00:07:04.760 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.17:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2015-08-31 00:07:05.778 17670 ERROR oslo_messaging._drivers.impl_rabbit [req-47f2e2fc-ab09-4324-b4cd-5f5a8c5743fa - - - - -] AMQP server on 172.16.1.17:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. However, rabbitmq seems to be running and other services are not affected. Redeploying without ceph to test. My previous deployments with single pacemaker controller and LVM backend are 100% successful.
asked slagle to test in scale lab with my eap 6 nested heat templates
zaneb asked me to try increasing the HAproxy connection timeout values along with the heat parameters. 1. deployed overcloud 2. configured the following on all controller nodes: sed -i "/heat/a \ timeout connect 30s" /etc/haproxy/haproxy.cfg openstack-config --set /etc/heat/heat.conf DEFAULT engine_life_check_timeout 30 openstack-config --set /etc/heat/heat.conf DEFAULT rpc_response_timeout 600 openstack-config --set /etc/heat/heat.conf DEFAULT verbose true openstack-config --get /etc/heat/heat.conf DEFAULT engine_life_check_timeout openstack-config --get /etc/heat/heat.conf DEFAULT rpc_response_timeout openstack-config --get /etc/heat/heat.conf DEFAULT verbose pcs resource restart haproxy-clone pcs resource restart openstack-heat-api-clone pcs resource restart openstack-heat-engine-clone 3. deployed EAP6 stack, failed with same unreachable errors [stack@rhos0 ~(demo_member)]$ source demorc [stack@rhos0 ~(demo_member)]$ heat stack-list ERROR: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html>
unable to reproduce
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0604.html