Hide Forgot
Description of problem: After upgrading the overcloud, overcloud glance in unavailable with a 500 Internal Server Error. Version-Release number of selected component (if applicable): How reproducible: Every time Steps to Reproduce: Deploy a 7.3 poodle HA overcloud with net-iso like: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server "0.fedora.pool.ntp.org" Then upgrade; first do the upgrade init and deliver the upgrade scripts to non-controllers: openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e rhos-release-8.yaml Then upgrade the controllers (no swift nodes so skip that step): openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml Then upgrade compute node (no ceph nodes): upgrade-non-controller.sh --upgrade overcloud-novacompute-0 Then run the final upgrade convergence step like: openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml This last step ^^^ is the only place, apart from the original deploy command above, where we are running the puppet/pacemaker config including the pacemaker resource restarts at https://github.com/openstack/tripleo-heat-templates/blob/1dd6de571c79625ccf5520895b764bb9c2dd75d3/extraconfig/tasks/pacemaker_resource_restart.sh (i.e. the entire ControllerPostDeployment is noop in the upgrade environment files, like https://github.com/openstack/tripleo-heat-templates/blob/1dd6de571c79625ccf5520895b764bb9c2dd75d3/environments/major-upgrade-pacemaker-init.yaml#L7 ). When upgrading to latest stable/liberty and after this last convergence step, overcloud glance is down. From the undercloud for example: source overcloudrc glance image-list 500 Internal Server Error: The server has either erred or is incapable of performing the requested operation. (HTTP 500) The glance logs from controller-0 are like [root@overcloud-controller-0 ~]# tail /var/log/glance/registry.log 2016-03-23 12:23:54.945 25180 WARNING keystonemiddleware.auth_token [-] Use of the auth_admin_prefix, auth_host, auth_port, auth_protocol, identity_uri, admin_token, admin_user, admin_password, and admin_tenant_name configuration options is deprecated in favor of auth_plugin and related options and may be removed in a future release. 2016-03-23 12:51:16.518 31067 WARNING keystonemiddleware.auth_token [-] Use of the auth_admin_prefix, auth_host, auth_port, auth_protocol, identity_uri, admin_token, admin_user, admin_password, and admin_tenant_name configuration options is deprecated in favor of auth_plugin and related options and may be removed in a future release. [root@overcloud-controller-0 ~]# tail /var/log/glance/api.log 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 71, in wrapped 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi return func(self, *args, **kwargs) 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 375, in do_request 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi headers=copy.deepcopy(headers)) 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 88, in wrapped 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi return func(self, method, url, body, headers) 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 540, in _do_request 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi raise exception.ClientConnectionError(e) 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi ClientConnectionError: [Errno 111] ECONNREFUSED 2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi Actual results: as above Expected results: expect overcloud glance to be working OK Additional info: The fix for me was to add a reload of haproxy to the pacemaker resource restart which is run by the post_puppet_pacemaker.yaml - review incoming (testing)
some debug info via jistr: " Yeah indeed, haproxy wasn't listening on glance-registry port (9191) until manually reloaded. The root cause isn't known at the moment. I didn't have to edit the config file or anything, just haproxy reload did the trick. [root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191 tcp 0 0 192.0.2.14:9191 0.0.0.0: LISTEN 20951/python2 [root@overcloud-controller-0 ~]# systemctl reload haproxy [root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191 tcp 0 0 192.0.2.6:9191 0.0.0.0: LISTEN 12135/haproxy tcp 0 0 192.0.2.14:9191 0.0.0.0:* LISTEN 20951/python2 "
update... just verified that the review at https://review.openstack.org/#/c/296491/ does fix this for now, however I also noticed that the update_identifier isn't set during this post-puppet-pacemaker (since it isn't an update...) meaning that the haproxy restart (which normally would happen) at https://github.com/openstack/tripleo-heat-templates/blob/1dd6de571c79625ccf5520895b764bb9c2dd75d3/extraconfig/tasks/pacemaker_resource_restart.sh does not happen: Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,186] (heat-config) [INFO] deploy_signal_verb=POST Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,187] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [INFO] Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [DEBUG] ++ systemctl is-active pacemaker Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + pacemaker_status=active Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ hiera bootstrap_nodeid Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ facter hostname Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ hiera update_identifier Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a nil '!=' nil ']' Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + '[' active = active ']' Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + systemctl reload haproxy Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-script/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,528] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/script Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,528] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae.json < /var/lib/heat-config/deployed/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae.notify.json
This change breaks external load balancer deployments because haproxy is not configured on the controller nodes: [2016-03-31 11:29:04,061] (heat-config) [DEBUG] ++ systemctl is-active pacemaker + pacemaker_status=active ++ hiera bootstrap_nodeid ++ facter hostname ++ hiera update_identifier + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a nil '!=' nil ']' + '[' active = active ']' + systemctl reload haproxy Job for haproxy.service invalid. [root@overcloud-controller-0 heat-admin]# systemctl status haproxy ● haproxy.service - HAProxy Load Balancer Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor preset: disabled) Active: inactive (dead) Mar 31 11:29:04 overcloud-controller-0.localdomain systemd[1]: Unit haproxy.service cannot be reloaded because it is inactive.
(In reply to Marius Cornea from comment #5) > This change breaks external load balancer deployments because haproxy is not > configured on the controller nodes: > > [2016-03-31 11:29:04,061] (heat-config) [DEBUG] ++ systemctl is-active > pacemaker > + pacemaker_status=active > ++ hiera bootstrap_nodeid > ++ facter hostname > ++ hiera update_identifier > + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a > nil '!=' nil ']' > + '[' active = active ']' > + systemctl reload haproxy > Job for haproxy.service invalid. > > [root@overcloud-controller-0 heat-admin]# systemctl status haproxy > ● haproxy.service - HAProxy Load Balancer > Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor > preset: disabled) > Active: inactive (dead) > > Mar 31 11:29:04 overcloud-controller-0.localdomain systemd[1]: Unit > haproxy.service cannot be reloaded because it is inactive. hey mcornea - I guess you were using the original fix here, which was https://review.openstack.org/#/c/297288/ "Add systemctl reload haproxy to the pacemaker_resource_restart.sh". Later we landed https://review.openstack.org/#/c/297175/ "Set UpdateIdentifier for upgrade converge, to prevent services down "This removes that original "systemctl reload haproxy" (which causes issues in your external load balancer env as you have described) and replaces it with the usual service restart at https://github.com/openstack/tripleo-heat-templates/blob/a12087715f0fe4251a95ab67120023d553c24a45/extraconfig/tasks/pacemaker_resource_restart.sh I am adding that /#/c/297175/ to the external links here too as ultimately it was the fix.
Unable to reproduce with: openstack-tripleo-heat-templates-0.8.14-7.el7ost.noarch [stack@undercloud72 ~]$source overcloudrc [stack@undercloud72 ~]$ glance image-list /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning +--------------------------------------+----------------------------------+-------------+------------------+----------+--------+ | ID | Name | Disk Format | Container Format | Size | Status | +--------------------------------------+----------------------------------+-------------+------------------+----------+--------+ | 7a3c64b0-eb2c-42a7-83ab-531b3dd488c5 | cirros-0.3.4-x86_64-disk.img | qcow2 | bare | 13287936 | active | | 7cb1e97b-4a29-4b33-b34e-51c3c933a012 | cirros-0.3.4-x86_64-disk.img_alt | qcow2 | bare | 13287936 | active | +--------------------------------------+----------------------------------+-------------+------------------+----------+--------+ [root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191 tcp 0 0 10.19.104.14:9191 0.0.0.0:* LISTEN 15079/python2 tcp 0 0 10.19.104.10:9191 0.0.0.0:* LISTEN 5044/haproxy