| Summary: | After upgrading the overcloud, overcloud glance in unavailable with a 500 Internal Server Error. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marios Andreou <mandreou> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Marios Andreou <mandreou> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Omri Hochman <ohochman> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 (Liberty) | CC: | jcoufal, jslagle, mburns, mcornea, rhel-osp-director-maint |
| Target Milestone: | ga | Keywords: | TestOnly |
| Target Release: | 8.0 (Liberty) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-heat-templates-0.8.14-1.el7ost | Doc Type: | Bug Fix |
| Doc Text: |
Cause: The final stage of the upgrades process wasn't ensuring that pacemaker cluster resources were being restarted properly, in this case haproxy in particular.
Consequence: The haproxy service wasn't restarted after the new 'upgrade' configuration was applied to all nodes. This caused a problem for the glance service which was unavailable (not listening on its configured port) as a result and returning a "500 Internal Server Error: The server has either erred or is incapable of performing the requested operation. (HTTP 500)" for cli operations such as 'glance image-list'
Fix: Setting the UpdateIdentifier in the environment file used for the final stage of the upgrades process causes the pacemaker resources to be restarted after the new configuration has been applied, at https://github.com/openstack/tripleo-heat-templates/blob/a12087715f0fe4251a95ab67120023d553c24a45/extraconfig/tasks/pacemaker_resource_restart.sh#L11
Result: The pacemaker resources are restarted during the final part of the upgrade and the upgrade completes successfully. Overcloud glance and all other services are available as expected.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-04-20 11:22:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Marios Andreou
2016-03-23 14:04:46 UTC
some debug info via jistr: " Yeah indeed, haproxy wasn't listening on glance-registry port (9191) until manually reloaded. The root cause isn't known at the moment. I didn't have to edit the config file or anything, just haproxy reload did the trick. [root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191 tcp 0 0 192.0.2.14:9191 0.0.0.0: LISTEN 20951/python2 [root@overcloud-controller-0 ~]# systemctl reload haproxy [root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191 tcp 0 0 192.0.2.6:9191 0.0.0.0: LISTEN 12135/haproxy tcp 0 0 192.0.2.14:9191 0.0.0.0:* LISTEN 20951/python2 " update... just verified that the review at https://review.openstack.org/#/c/296491/ does fix this for now, however I also noticed that the update_identifier isn't set during this post-puppet-pacemaker (since it isn't an update...) meaning that the haproxy restart (which normally would happen) at https://github.com/openstack/tripleo-heat-templates/blob/1dd6de571c79625ccf5520895b764bb9c2dd75d3/extraconfig/tasks/pacemaker_resource_restart.sh does not happen: Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,186] (heat-config) [INFO] deploy_signal_verb=POST Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,187] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [INFO] Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [DEBUG] ++ systemctl is-active pacemaker Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + pacemaker_status=active Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ hiera bootstrap_nodeid Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ facter hostname Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: ++ hiera update_identifier Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a nil '!=' nil ']' Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + '[' active = active ']' Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: + systemctl reload haproxy Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,525] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-script/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,528] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/script Mar 23 16:44:52 overcloud-controller-0.localdomain os-collect-config[4756]: [2016-03-23 16:44:52,528] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae.json < /var/lib/heat-config/deployed/31e75b2e-efc5-433c-8e32-2d8ff7a6e0ae.notify.json This change breaks external load balancer deployments because haproxy is not configured on the controller nodes: [2016-03-31 11:29:04,061] (heat-config) [DEBUG] ++ systemctl is-active pacemaker + pacemaker_status=active ++ hiera bootstrap_nodeid ++ facter hostname ++ hiera update_identifier + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a nil '!=' nil ']' + '[' active = active ']' + systemctl reload haproxy Job for haproxy.service invalid. [root@overcloud-controller-0 heat-admin]# systemctl status haproxy ● haproxy.service - HAProxy Load Balancer Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor preset: disabled) Active: inactive (dead) Mar 31 11:29:04 overcloud-controller-0.localdomain systemd[1]: Unit haproxy.service cannot be reloaded because it is inactive. (In reply to Marius Cornea from comment #5) > This change breaks external load balancer deployments because haproxy is not > configured on the controller nodes: > > [2016-03-31 11:29:04,061] (heat-config) [DEBUG] ++ systemctl is-active > pacemaker > + pacemaker_status=active > ++ hiera bootstrap_nodeid > ++ facter hostname > ++ hiera update_identifier > + '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a > nil '!=' nil ']' > + '[' active = active ']' > + systemctl reload haproxy > Job for haproxy.service invalid. > > [root@overcloud-controller-0 heat-admin]# systemctl status haproxy > ● haproxy.service - HAProxy Load Balancer > Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor > preset: disabled) > Active: inactive (dead) > > Mar 31 11:29:04 overcloud-controller-0.localdomain systemd[1]: Unit > haproxy.service cannot be reloaded because it is inactive. hey mcornea - I guess you were using the original fix here, which was https://review.openstack.org/#/c/297288/ "Add systemctl reload haproxy to the pacemaker_resource_restart.sh". Later we landed https://review.openstack.org/#/c/297175/ "Set UpdateIdentifier for upgrade converge, to prevent services down "This removes that original "systemctl reload haproxy" (which causes issues in your external load balancer env as you have described) and replaces it with the usual service restart at https://github.com/openstack/tripleo-heat-templates/blob/a12087715f0fe4251a95ab67120023d553c24a45/extraconfig/tasks/pacemaker_resource_restart.sh I am adding that /#/c/297175/ to the external links here too as ultimately it was the fix. Unable to reproduce with: openstack-tripleo-heat-templates-0.8.14-7.el7ost.noarch [stack@undercloud72 ~]$source overcloudrc [stack@undercloud72 ~]$ glance image-list /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning +--------------------------------------+----------------------------------+-------------+------------------+----------+--------+ | ID | Name | Disk Format | Container Format | Size | Status | +--------------------------------------+----------------------------------+-------------+------------------+----------+--------+ | 7a3c64b0-eb2c-42a7-83ab-531b3dd488c5 | cirros-0.3.4-x86_64-disk.img | qcow2 | bare | 13287936 | active | | 7cb1e97b-4a29-4b33-b34e-51c3c933a012 | cirros-0.3.4-x86_64-disk.img_alt | qcow2 | bare | 13287936 | active | +--------------------------------------+----------------------------------+-------------+------------------+----------+--------+ [root@overcloud-controller-0 ~]# netstat -tlpn | grep 9191 tcp 0 0 10.19.104.14:9191 0.0.0.0:* LISTEN 15079/python2 tcp 0 0 10.19.104.10:9191 0.0.0.0:* LISTEN 5044/haproxy |