Bug 1320567 - After upgrading the overcloud, overcloud glance in unavailable with a 500 Internal Server Error.
Summary: After upgrading the overcloud, overcloud glance in unavailable with a 500 Int...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ga
: 8.0 (Liberty)
Assignee: Marios Andreou
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-23 14:04 UTC by Marios Andreou
Modified: 2016-04-20 11:22 UTC (History)
5 users (show)

Fixed In Version: openstack-tripleo-heat-templates-0.8.14-1.el7ost
Doc Type: Bug Fix
Doc Text:
Cause: The final stage of the upgrades process wasn't ensuring that pacemaker cluster resources were being restarted properly, in this case haproxy in particular. Consequence: The haproxy service wasn't restarted after the new 'upgrade' configuration was applied to all nodes. This caused a problem for the glance service which was unavailable (not listening on its configured port) as a result and returning a "500 Internal Server Error: The server has either erred or is incapable of performing the requested operation. (HTTP 500)" for cli operations such as 'glance image-list' Fix: Setting the UpdateIdentifier in the environment file used for the final stage of the upgrades process causes the pacemaker resources to be restarted after the new configuration has been applied, at https://github.com/openstack/tripleo-heat-templates/blob/a12087715f0fe4251a95ab67120023d553c24a45/extraconfig/tasks/pacemaker_resource_restart.sh#L11 Result: The pacemaker resources are restarted during the final part of the upgrade and the upgrade completes successfully. Overcloud glance and all other services are available as expected.
Clone Of:
Environment:
Last Closed: 2016-04-20 11:22:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1561012 0 None None None 2016-03-23 17:24:44 UTC
OpenStack gerrit 296491 0 'None' 'MERGED' 'Add systemctl reload haproxy to the pacemaker_resource_restart.sh' 2019-11-29 15:57:42 UTC
OpenStack gerrit 297175 0 'None' 'MERGED' 'Set UpdateIdentifier for upgrade converge, to prevent services down' 2019-11-29 15:57:42 UTC
OpenStack gerrit 297288 0 'None' 'MERGED' 'Add systemctl reload haproxy to the pacemaker_resource_restart.sh' 2019-11-29 15:57:42 UTC

Description Marios Andreou 2016-03-23 14:04:46 UTC
Description of problem:

After upgrading the overcloud, overcloud glance in unavailable with a 500 Internal Server Error. 


Version-Release number of selected component (if applicable):


How reproducible:
Every time

Steps to Reproduce:
Deploy a 7.3 poodle HA overcloud with net-iso like:

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server "0.fedora.pool.ntp.org"

Then upgrade; first do the upgrade init and deliver the upgrade scripts to non-controllers:

openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e  /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e rhos-release-8.yaml

Then upgrade the controllers (no swift nodes so skip that step):

openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e  /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml

Then upgrade compute node (no ceph nodes):

upgrade-non-controller.sh --upgrade overcloud-novacompute-0

Then run the final upgrade convergence step like:

openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e  /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml

This last step ^^^ is the only place, apart from the original deploy command above, where we are running the puppet/pacemaker config including the pacemaker resource restarts at  https://github.com/openstack/tripleo-heat-templates/blob/1dd6de571c79625ccf5520895b764bb9c2dd75d3/extraconfig/tasks/pacemaker_resource_restart.sh (i.e. the entire ControllerPostDeployment is noop in the upgrade environment files, like https://github.com/openstack/tripleo-heat-templates/blob/1dd6de571c79625ccf5520895b764bb9c2dd75d3/environments/major-upgrade-pacemaker-init.yaml#L7 ). 

When upgrading to latest stable/liberty and after this last convergence step, overcloud glance is down. From the undercloud for example:

source overcloudrc
glance image-list

500 Internal Server Error: The server has either erred or is incapable of performing the requested operation. (HTTP 500)

The glance logs from controller-0 are like 

[root@overcloud-controller-0 ~]# tail /var/log/glance/registry.log 
2016-03-23 12:23:54.945 25180 WARNING keystonemiddleware.auth_token [-] Use of the auth_admin_prefix, auth_host, auth_port, auth_protocol, identity_uri, admin_token, admin_user, admin_password, and admin_tenant_name configuration options is deprecated in favor of auth_plugin and related options and may be removed in a future release.
2016-03-23 12:51:16.518 31067 WARNING keystonemiddleware.auth_token [-] Use of the auth_admin_prefix, auth_host, auth_port, auth_protocol, identity_uri, admin_token, admin_user, admin_password, and admin_tenant_name configuration options is deprecated in favor of auth_plugin and related options and may be removed in a future release.
[root@overcloud-controller-0 ~]# tail /var/log/glance/api.log 
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi   File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 71, in wrapped
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi     return func(self, *args, **kwargs)
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi   File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 375, in do_request
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi     headers=copy.deepcopy(headers))
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi   File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 88, in wrapped
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi     return func(self, method, url, body, headers)
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi   File "/usr/lib/python2.7/site-packages/glance/common/client.py", line 540, in _do_request
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi     raise exception.ClientConnectionError(e)
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi ClientConnectionError: [Errno 111] ECONNREFUSED
2016-03-23 13:04:33.203 31114 ERROR glance.common.wsgi 

Actual results:
as above

Expected results:
expect overcloud glance to be working OK

Additional info:

The fix for me was to add a reload of haproxy to the pacemaker resource restart which is run by the post_puppet_pacemaker.yaml - review incoming (testing)

Comment 2 Marios Andreou 2016-03-23 14:40:05 UTC
some debug info via jistr:
"
Yeah indeed, haproxy wasn't listening on glance-registry port (9191) until manually reloaded. The root cause isn't known at the moment. I didn't have to edit the config file or anything, just haproxy reload did the trick.

[root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191
tcp 0 0 192.0.2.14:9191 0.0.0.0: LISTEN 20951/python2
[root@overcloud-controller-0 ~]# systemctl reload haproxy
[root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191
tcp 0 0 192.0.2.6:9191 0.0.0.0: LISTEN 12135/haproxy
tcp 0 0 192.0.2.14:9191 0.0.0.0:* LISTEN 20951/python2
"

Comment 3 Marios Andreou 2016-03-23 16:54:31 UTC
update... just verified that the review at https://review.openstack.org/#/c/296491/ does fix this for now, however I also noticed that the update_identifier isn't set during this post-puppet-pacemaker (since it isn't an update...) meaning that the haproxy restart (which normally would happen) at 
https://github.com/openstack/tripleo-heat-templates/blob/1dd6de571c79625ccf5520895b764bb9c2dd75d3/extraconfig/tasks/pacemaker_resource_restart.sh does not happen:

Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,186] (heat-config) [INFO] deploy_signal_verb=POST
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,187] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [INFO]
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [DEBUG] ++ systemctl is-active pacemaker
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + pacemaker_status=active
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ hiera bootstrap_nodeid
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ facter hostname
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ hiera update_identifier
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a nil '!=' nil ']'
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + '[' active = active ']'
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + systemctl reload haproxy
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-script/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,528] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/script
Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,528] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae.json < /var/lib/heat-config/deployed/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae.notify.json

Comment 5 Marius Cornea 2016-03-31 12:22:12 UTC
This change breaks external load balancer deployments because haproxy is not configured on the controller nodes:

[2016-03-31 11:29:04,061] (heat-config) [DEBUG] ++ systemctl is-active pacemaker
+ pacemaker_status=active
++ hiera bootstrap_nodeid
++ facter hostname
++ hiera update_identifier
+ '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a nil '!=' nil ']'
+ '[' active = active ']'
+ systemctl reload haproxy
Job for haproxy.service invalid.

[root@overcloud-controller-0 heat-admin]# systemctl status haproxy
● haproxy.service - HAProxy Load Balancer
   Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Mar 31 11:29:04 overcloud-controller-0.localdomain systemd[1]: Unit haproxy.service cannot be reloaded because it is inactive.

Comment 8 Marios Andreou 2016-04-13 13:58:03 UTC
(In reply to Marius Cornea from comment #5)
> This change breaks external load balancer deployments because haproxy is not
> configured on the controller nodes:
> 
> [2016-03-31 11:29:04,061] (heat-config) [DEBUG] ++ systemctl is-active
> pacemaker
> + pacemaker_status=active
> ++ hiera bootstrap_nodeid
> ++ facter hostname
> ++ hiera update_identifier
> + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a
> nil '!=' nil ']'
> + '[' active = active ']'
> + systemctl reload haproxy
> Job for haproxy.service invalid.
> 
> [root@overcloud-controller-0 heat-admin]# systemctl status haproxy
> ● haproxy.service - HAProxy Load Balancer
>    Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor
> preset: disabled)
>    Active: inactive (dead)
> 
> Mar 31 11:29:04 overcloud-controller-0.localdomain systemd[1]: Unit
> haproxy.service cannot be reloaded because it is inactive.

hey mcornea - I guess you were using the original fix here, which was https://review.openstack.org/#/c/297288/ "Add systemctl reload haproxy to the pacemaker_resource_restart.sh". Later we landed https://review.openstack.org/#/c/297175/ "Set UpdateIdentifier for upgrade converge, to prevent services down "This removes that original "systemctl reload haproxy" (which causes issues in your external load balancer env as you have described) and replaces it with the usual service restart at https://github.com/openstack/tripleo-heat-templates/blob/a12087715f0fe4251a95ab67120023d553c24a45/extraconfig/tasks/pacemaker_resource_restart.sh

I am adding that /#/c/297175/ to the external links here too as ultimately it was the fix.

Comment 11 Omri Hochman 2016-04-18 18:37:07 UTC
Unable to reproduce with: 
openstack-tripleo-heat-templates-0.8.14-7.el7ost.noarch


[stack@undercloud72 ~]$source overcloudrc 

[stack@undercloud72 ~]$ glance image-list
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SecurityWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SecurityWarning
+--------------------------------------+----------------------------------+-------------+------------------+----------+--------+
| ID                                   | Name                             | Disk Format | Container Format | Size     | Status |
+--------------------------------------+----------------------------------+-------------+------------------+----------+--------+
| 7a3c64b0-eb2c-42a7-83ab-531b3dd488c5 | cirros-0.3.4-x86_64-disk.img     | qcow2       | bare             | 13287936 | active |
| 7cb1e97b-4a29-4b33-b34e-51c3c933a012 | cirros-0.3.4-x86_64-disk.img_alt | qcow2       | bare             | 13287936 | active |
+--------------------------------------+----------------------------------+-------------+------------------+----------+--------+



[root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191
tcp        0      0 10.19.104.14:9191       0.0.0.0:*               LISTEN      15079/python2       
tcp        0      0 10.19.104.10:9191       0.0.0.0:*               LISTEN      5044/haproxy


Note You need to log in before you can comment on or make changes to this bug.