Description of problem: I am doing a 3 ctrl + 1 compute node deployment with an external ssl enabled load balancer on ipv4. After deployment is done all the pacemaker Heat resources are stopped thus the heat service is not accessible. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.6-119.el7ost.noarch How reproducible: 100% Steps to Reproduce: openstack overcloud deploy --templates ~/templates/my-overcloud \ -e ~/templates/my-overcloud/environments/network-isolation.yaml \ -e ~/templates/network-environment.yaml \ -e ~/templates/enable-tls-external-lb.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e ~/templates/my-overcloud/environments/external-loadbalancer-vip.yaml \ -e ~/templates/external-lb.yaml \ -e /home/stack/templates/firstboot-environment.yaml \ --control-scale 3 \ --compute-scale 1 \ --ntp-server 10.5.26.10 \ --libvirt-type qemu stack@instack:~>>> cat ~/templates/network-environment.yaml resource_registry: OS::TripleO::BlockStorage::Net::SoftwareConfig: /home/stack/templates/nic-configs/cinder-storage.yaml OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/templates/nic-configs/compute.yaml OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/templates/nic-configs/controller.yaml OS::TripleO::ObjectStorage::Net::SoftwareConfig: /home/stack/templates/nic-configs/swift-storage.yaml OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/templates/nic-configs/ceph-storage.yaml parameter_defaults: InternalApiNetCidr: 172.16.20.0/24 StorageNetCidr: 172.16.21.0/24 StorageMgmtNetCidr: 172.16.19.0/24 TenantNetCidr: 172.16.22.0/24 ExternalNetCidr: 172.16.23.0/24 InternalApiAllocationPools: [{'start': '172.16.20.10', 'end': '172.16.20.100'}] StorageAllocationPools: [{'start': '172.16.21.10', 'end': '172.16.21.100'}] StorageMgmtAllocationPools: [{'start': '172.16.19.10', 'end': '172.16.19.100'}] TenantAllocationPools: [{'start': '172.16.22.10', 'end': '172.16.22.100'}] ExternalAllocationPools: [{'start': '172.16.23.10', 'end': '172.16.23.100'}] ExternalInterfaceDefaultRoute: 172.16.23.251 NeutronExternalNetworkBridge: "''" ControlPlaneSubnetCidr: "24" ControlPlaneDefaultRoute: 192.0.2.1 EC2MetadataIp: 192.0.2.1 DnsServers: ["10.16.36.29","10.11.5.19"] CloudName: rxtx.ro stack@instack:~>>> cat ~/templates/external-lb.yaml parameters: ServiceNetMap: NeutronTenantNetwork: tenant CeilometerApiNetwork: internal_api MongoDbNetwork: internal_api CinderApiNetwork: internal_api CinderIscsiNetwork: storage GlanceApiNetwork: storage GlanceRegistryNetwork: internal_api KeystoneAdminApiNetwork: internal_api KeystonePublicApiNetwork: internal_api NeutronApiNetwork: internal_api HeatApiNetwork: internal_api NovaApiNetwork: internal_api NovaMetadataNetwork: internal_api NovaVncProxyNetwork: internal_api SwiftMgmtNetwork: storage_mgmt SwiftProxyNetwork: storage HorizonNetwork: internal_api MemcachedNetwork: internal_api RabbitMqNetwork: internal_api RedisNetwork: internal_api MysqlNetwork: internal_api CephClusterNetwork: storage_mgmt CephPublicNetwork: storage ControllerHostnameResolveNetwork: internal_api ComputeHostnameResolveNetwork: internal_api BlockStorageHostnameResolveNetwork: internal_api ObjectStorageHostnameResolveNetwork: internal_api CephStorageHostnameResolveNetwork: storage parameter_defaults: ControlPlaneIP: 192.0.2.250 ExternalNetworkVip: 172.16.23.250 InternalApiNetworkVip: 172.16.20.250 StorageNetworkVip: 172.16.21.250 StorageMgmtNetworkVip: 172.16.19.250 ServiceVips: redis: 172.16.20.249 ControllerIPs: external_cidr: "24" internal_api_cidr: "24" storage_cidr: "24" storage_mgmt_cidr: "24" tenant_cidr: "24" external: - 172.16.23.150 - 172.16.23.151 - 172.16.23.152 internal_api: - 172.16.20.150 - 172.16.20.151 - 172.16.20.152 storage: - 172.16.21.150 - 172.16.21.151 - 172.16.21.152 storage_mgmt: - 172.16.19.150 - 172.16.19.151 - 172.16.19.152 tenant: - 172.16.22.150 - 172.16.22.151 - 172.16.22.152 Actual results: [root@overcloud-controller-0 ~]# pcs status | grep -A1 heat Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Clone Set: openstack-heat-api-clone [openstack-heat-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Expected results: The heat services are running.
I think this is related to BZ#1306623 as ceilometer fails to start and there is the following constraint: start openstack-ceilometer-notification-clone then start openstack-heat-api-clone (kind:Mandatory)
I'm going to close this as a duplicate of bz1306623 then, feel free to reopen it if it turns out that is not the cause. *** This bug has been marked as a duplicate of bug 1306623 ***
Reopening this one - after deployment is finished heat-engine failed to start: [root@overcloud-controller-2 ~]# pcs status | grep -A1 heat-engine Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- * openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=392, status=complete, exitreason='none', last-rc-change='Fri Feb 12 18:35:14 2016', queued=0ms, exec=2086ms * openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=381, status=complete, exitreason='none', last-rc-change='Fri Feb 12 18:35:14 2016', queued=0ms, exec=2058ms * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=397, status=complete, exitreason='none', last-rc-change='Fri Feb 12 18:35:14 2016', queued=0ms, exec=2022ms I updated to sosreports in the location mentioned in comment#1.
Workaround to get it started: pcs resource cleanup openstack-heat-engine pcs resource restart openstack-heat-engine
Been poking at the sos reports for this. It isn't afaics related to BZ#1306623 as per comment 3 since ceilometer is running here and only heat-engine isn't. There is another similar 'heat-engine crashed' bz at 1307125 which I thought may be related but heat-engine fails for a different reason here (not because of a DBConnectionError). Also to clarify since in the description it says "all the pacemaker Heat resources are stopped" afaics only heat-engine is stopped. I *think* there may be an issue with the ssl certificates, more info below. From both controller-0 and controller-1 I see a problem with cloud-init: Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/4515569f-12ad-498e-bb0a-853fff2c8c0b/sem/config_ssh_authkey_fingerprints - wb: [420] 20 bytes Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/4515569f-12ad-498e-bb0a-853fff2c8c0b/sem/config_ssh_authkey_fingerprints (recursive=False) Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/4515569f-12ad-498e-bb0a-853fff2c8c0b/sem/config_ssh_authkey_fingerprints (recursive=False) Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] helpers.py[DEBUG]: Running config-ssh-authkey-fingerprints using lock (<FileLock using file '/var/lib/cloud/instances/4515569f-12ad-498e-bb0a-853fff2c8c0b/sem/config_ssh_authkey_fingerprints'>) Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False) Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Read 4359 bytes from /etc/ssh/sshd_config Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Restoring selinux mode for /home/heat-admin/.ssh (recursive=True) Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Reading from /home/heat-admin/.ssh/authorized_keys (quiet=False) Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Read 407 bytes from /home/heat-admin/.ssh/authorized_keys Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: 2016-02-12 12:59:20,144 - util.py[WARNING]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) failed Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[WARNING]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) failed Feb 12 17:59:20 overcloud-controller-1.localdomain cloud-init[2926]: [CLOUDINIT] util.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) failed Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/cloudinit/stages.py", line 660, in _run_modules cc.run(run_name, mod.handle, func_args, freq=freq) File "/usr/lib/python2.7/site-packages/cloudinit/cloud.py", line 63, in run return self._runners.run(name, functor, args, freq, clear_on_fail) File "/usr/lib/python2.7/site-packages/cloudinit/helpers.py", line 197, in run results = functor(*args) File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py", line 105, in handle key_entries, hash_meth) File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py", line 91, in _pprint_key_entries stderr=False, console=True) File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 346, in multi_log wfh.flush() IOError: [Errno 5] Input/output error ============================================== From pcs status, *only* heat-engine is down (Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]). There is nothing interesting in heat-engine.log itself - in its entirety like: 2016-02-12 18:18:36.947 1244 WARNING heat.common.config [-] stack_user_domain_id or stack_user_domain_name not set in heat.conf falling back to using default 2016-02-12 18:18:38.506 1244 WARNING oslo_config.cfg [req-88da6c7f-af50-48e6-b4ef-51e0628b94af - -] Option "db_backend" from group "DEFAULT" is deprecated. Use option "backend" from group "database". 2016-02-12 18:24:29.207 19065 WARNING oslo_config.cfg [req-7656b9c2-49ea-4f8d-95c4-360911b4be0e - -] Option "db_backend" from group "DEFAULT" is deprecated. Use option "backend" from group "database". 2016-02-12 18:35:20.780 20580 WARNING oslo_config.cfg [req-429ee551-ea41-4ff4-aefa-7c99a8347d9b - -] Option "db_backend" from group "DEFAULT" is deprecated. Use option "backend" from group "database". but in heat-api.log there is an error like 2016-02-12 18:38:59.951 20289 ERROR heat.common.wsgi [req-14f06bed-09fb-49ac-a44a-f8443152c4a2 c57fcbb893e2455bb9192861a8785083 1dea66563aba4c51be0618285f906fe8] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 96fbcd0f8f9a4206bac74f2881f358ab ============================================== From os-collect-config, heat-config fails with 'unable to load certificate...': Feb 12 18:01:44 overcloud-controller-0.localdomain os-collect-config[8315]: [2016-02-12 13:01:44,327] (heat-config) [INFO] {"key_modulus": "d41d8cd98f00b204e9800998ecf8427e\n", "deploy_stdout": "", "deploy_stderr": "unable to load certificate\n139749505574816:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:703:Expecting: TRUSTED CERTIFICATE\nunable to load Private Key\n140472523294624:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:703:Expecting: ANY PRIVATE KEY\n", "chain_md5sum": "68b329da9893e34099c7d8ad5cb9c940 /etc/pki/tls/private/overcloud_endpoint.pem\n", "cert_modulus": "d41d8cd98f00b204e9800998ecf8427e\n", "deploy_status_code": 0} Feb 12 18:01:44 overcloud-controller-0.localdomain os-collect-config[8315]: 139749505574816:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:703:Expecting: TRUSTED CERTIFICATE Feb 12 18:01:44 overcloud-controller-0.localdomain os-collect-config[8315]: unable to load Private Key Feb 12 18:01:44 overcloud-controller-0.localdomain os-collect-config[8315]: 140472523294624:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:703:Expecting: ANY PRIVATE KEY Feb 12 18:01:44 overcloud-controller-0.localdomain os-collect-config[8315]: [2016-02-12 13:01:44,32 @mcornea, is the TLS cert being set correctly with the -e ~/templates/enable-tls-external-lb.yaml \ -e ~/templates/inject-trust-anchor.yaml \ (i.e. can you sanity check the certificate data being passed).
Indeed I am passing an empty string for the certificates and key in the enable-tls-external-lb.yaml: parameter_defaults: SSLCertificate: "" SSLIntermediateCertificate: "" SSLKey: "" This is intended though as the certificates get configured on the external load balancer and communication between the external load balancer and the controllers is unencrypted. I can try a deployment by passing all the certs and key data in enable-tls-external-lb.yaml to rule out if it is caused by this. Will update asap.
I am assuming enable-tls-external-lb.yaml is based on the enable-tls.yaml environment in tree, and that probably won't work. The _only_ thing that needs to be set for external loadbalancer ssl is the EndpointMap. The rest of enable-tls.yaml is either meaningless or downright harmful. The environment file should look something like this: parameter_defaults: EndpointMap: [endpoint map entries from enable-tls.yaml] Nothing else should be needed there. The only other thing that might need to be passed is the inject-trust-anchor.yaml if the certificate is self-signed. In that case the included environment file can be used as-is (with the appropriate value filled in of course).
Added release note doctext explaining how this should work.
Trying again without the collision due to not refreshing the page...
Passing the enable-tls.yaml below results in: Deploying templates in the directory /home/stack/templates/my-overcloud Stack failed with status: Resource CREATE failed: resources[1]: resources.Controller.Property error: resources.NodeTLSData.properties: Property SSLKey not assigned ERROR: openstack Heat Stack create failed. parameter_defaults: EndpointMap: CeilometerAdmin: {protocol: 'http', port: '8777', host: 'IP_ADDRESS'} CeilometerInternal: {protocol: 'http', port: '8777', host: 'IP_ADDRESS'} CeilometerPublic: {protocol: 'https', port: '13777', host: 'CLOUDNAME'} CinderAdmin: {protocol: 'http', port: '8776', host: 'IP_ADDRESS'} CinderInternal: {protocol: 'http', port: '8776', host: 'IP_ADDRESS'} CinderPublic: {protocol: 'https', port: '13776', host: 'CLOUDNAME'} GlanceAdmin: {protocol: 'http', port: '9292', host: 'IP_ADDRESS'} GlanceInternal: {protocol: 'http', port: '9292', host: 'IP_ADDRESS'} GlancePublic: {protocol: 'https', port: '13292', host: 'CLOUDNAME'} GlanceRegistryAdmin: {protocol: 'http', port: '9191', host: 'IP_ADDRESS'} GlanceRegistryInternal: {protocol: 'http', port: '9191', host: 'IP_ADDRESS'} GlanceRegistryPublic: {protocol: 'https', port: '9191', host: 'IP_ADDRESS'} # Not set on the loadbalancer yet. HeatAdmin: {protocol: 'http', port: '8004', host: 'IP_ADDRESS'} HeatInternal: {protocol: 'http', port: '8004', host: 'IP_ADDRESS'} HeatPublic: {protocol: 'https', port: '13004', host: 'CLOUDNAME'} HorizonPublic: {protocol: 'https', port: '443', host: 'CLOUDNAME'} KeystoneAdmin: {protocol: 'http', port: '35357', host: 'IP_ADDRESS'} KeystoneInternal: {protocol: 'http', port: '5000', host: 'IP_ADDRESS'} KeystonePublic: {protocol: 'https', port: '13000', host: 'CLOUDNAME'} NeutronAdmin: {protocol: 'http', port: '9696', host: 'IP_ADDRESS'} NeutronInternal: {protocol: 'http', port: '9696', host: 'IP_ADDRESS'} NeutronPublic: {protocol: 'https', port: '13696', host: 'CLOUDNAME'} NovaAdmin: {protocol: 'http', port: '8774', host: 'IP_ADDRESS'} NovaInternal: {protocol: 'http', port: '8774', host: 'IP_ADDRESS'} NovaPublic: {protocol: 'https', port: '13774', host: 'CLOUDNAME'} NovaEC2Admin: {protocol: 'http', port: '8773', host: 'IP_ADDRESS'} NovaEC2Internal: {protocol: 'http', port: '8773', host: 'IP_ADDRESS'} NovaEC2Public: {protocol: 'https', port: '13773', host: 'CLOUDNAME'} NovaVNCProxyAdmin: {protocol: 'http', port: '6080', host: 'IP_ADDRESS'} NovaVNCProxyInternal: {protocol: 'http', port: '6080', host: 'IP_ADDRESS'} NovaVNCProxyPublic: {protocol: 'https', port: '13080', host: 'CLOUDNAME'} SwiftAdmin: {protocol: 'http', port: '8080', host: 'IP_ADDRESS'} SwiftInternal: {protocol: 'http', port: '8080', host: 'IP_ADDRESS'} SwiftPublic: {protocol: 'https', port: '13808', host: 'CLOUDNAME'} resource_registry: OS::TripleO::NodeTLSData: /home/stack/templates/my-overcloud/puppet/extraconfig/tls/tls-cert-inject.yaml
Hi Marius, Does the workaround from Comment 5 still work? Angus
(In reply to Angus Thomas from comment #17) > Hi Marius, > > Does the workaround from Comment 5 still work? > > > Angus Not when using an enable-tls.yaml according to comment#16. It fails early so the overcloud doesn't get deployed. I'm currently trying to pass all the certificates and key in the enable-tls.yaml environment and get back with the result.
I did some more testing and I reached the conclusion that the issue was caused by an undersized virtual host where the environment was running. I ran the same tests on a beefier hardware and I wasn't able to reproduce it. So I'm closing this as not a bug. The discussion about the right contents of enable-tls.yaml remains open but let's move it to the docs BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1307045
Since the customer case this was reopened for is now closed, I'm going to re-close the bug.
The issue happening on 32GB memory controller with SSD . I guess pcs try to start the openstack-heat-engine in wrong time, when not everything is ready .. The pcs status (from CI not from a live system): + pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-0 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Apr 13 10:12:43 2018 Last change: Fri Apr 13 09:40:52 2018 by root via cibadmin on controller-0 1 node configured 42 resources configured Online: [ controller-0 ] Full list of resources: ip-172.17.1.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-192.168.24.6 (ocf::heartbeat:IPaddr2): Started controller-0 Clone Set: haproxy-clone [haproxy] Started: [ controller-0 ] ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-0 Master/Slave Set: galera-master [galera] Masters: [ controller-0 ] ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Started controller-0 Master/Slave Set: redis-master [redis] Masters: [ controller-0 ] ... ... Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] Started: [ controller-0 ] Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Stopped: [ controller-0 ] Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] Started: [ controller-0 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Started: [ controller-0 ] ... ... Started: [ controller-0 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Started: [ controller-0 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started controller-0 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Started: [ controller-0 ] Failed Actions: * openstack-heat-engine_start_0 on controller-0 'not running' (7): call=181, status=complete, exitreason='none', last-rc-change='Fri Apr 13 09:33:45 2018', queued=0ms, exec=2115ms heat-engine does not have log file, the package is installed.
Since there is already a BZ to track this doc, I close this one as a duplicate in the meantime *** This bug has been marked as a duplicate of bug 1568037 ***