1368546 – OSP-9/10 upgrades fails to restart sahara-api.

Bug 1368546 - OSP-9/10 upgrades fails to restart sahara-api.

Summary: OSP-9/10 upgrades fails to restart sahara-api.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	Jiri Stransky
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1337794
TreeView+	depends on / blocked

Reported:	2016-08-19 17:15 UTC by Sofer Athlan-Guyot
Modified:	2016-12-29 16:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-5.0.0-0.20160929150845.4cdc4fc.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-14 15:52:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1615056	None	None	None	2016-08-19 17:15:17 UTC
OpenStack gerrit	358022	None	None	None	2016-08-25 13:55:18 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Sofer Athlan-Guyot 2016-08-19 17:15:18 UTC

Description of problem: see upstream bug.

Comment 2 Sofer Athlan-Guyot 2016-08-25 13:55:18 UTC

Adding link to review.

Comment 3 Sofer Athlan-Guyot 2016-08-29 21:18:48 UTC

Hi,

after upgrade we are bitten by the deprecation of hdp. This is fixed
in the template there https://bugs.launchpad.net/tripleo/+bug/1611107.
Need to be fixed during the upgrade as well, I think.

Traceback for reference:

    2016-08-19 17:11:18.916 9258 ERROR sahara Traceback (most recent call last):
    2016-08-19 17:11:18.916 9258 ERROR sahara File "/usr/bin/sahara-api", line 10, in <module>
    2016-08-19 17:11:18.916 9258 ERROR sahara sys.exit(main())
    2016-08-19 17:11:18.916 9258 ERROR sahara File "/usr/lib/python2.7/site-packages/sahara/cli/sahara_api.py", line 53, in main
    2016-08-19 17:11:18.916 9258 ERROR sahara app = setup_api()
    2016-08-19 17:11:18.916 9258 ERROR sahara File "/usr/lib/python2.7/site-packages/sahara/cli/sahara_api.py", line 43, in setup_api
    2016-08-19 17:11:18.916 9258 ERROR sahara server.setup_common(possible_topdir, 'API')
    2016-08-19 17:11:18.916 9258 ERROR sahara File "/usr/lib/python2.7/site-packages/sahara/main.py", line 84, in setup_common
    2016-08-19 17:11:18.916 9258 ERROR sahara plugins_base.setup_plugins()
    2016-08-19 17:11:18.916 9258 ERROR sahara File "/usr/lib/python2.7/site-packages/sahara/plugins/base.py", line 163, in setup_plugins
    2016-08-19 17:11:18.916 9258 ERROR sahara PLUGINS = PluginManager()
    2016-08-19 17:11:18.916 9258 ERROR sahara File "/usr/lib/python2.7/site-packages/sahara/plugins/base.py", line 85, in __init__
    2016-08-19 17:11:18.916 9258 ERROR sahara self._load_cluster_plugins()
    2016-08-19 17:11:18.916 9258 ERROR sahara File "/usr/lib/python2.7/site-packages/sahara/plugins/base.py", line 111, in _load_cluster_plugins
    2016-08-19 17:11:18.916 9258 ERROR sahara ", ".join(requested_plugins - loaded_plugins))
    2016-08-19 17:11:18.916 9258 ERROR sahara ConfigurationError: Plugins couldn't be loaded: hdp
    2016-08-19 17:11:18.916 9258 ERROR sahara Error ID: a4a4e95a-6384-4af5-95

Comment 4 Luigi Toscano 2016-09-22 12:36:58 UTC

Even if it is fixed in the template, you still have problems.

The template value, if I'm not mistaken, is applied when Sahara is managed by TripleO, which means that you need to pass something like 
-e environments/services/sahara.yaml

BUT when you don't want Sahara anymore after the upgrade, the packages are not going to be removed, but upgraded to the latest version, which tries to run with the old configuration => BOOM. It's the bug described here. So we probably need to also remove Sahara packages and configuration by default on upgrade if it's not needed (i.e. sahara.yaml is not specified). Or re-enabled it by default. Or...?

Comment 5 Omri Hochman 2016-09-22 14:00:25 UTC

I was trying the suggested w/a by Tosky

(1) added -e  sahara.yaml : 

[stack@undercloud72 ~]$ sudo find / -name sahara.yaml
/usr/share/openstack-tripleo-heat-templates/environments/services/sahara.yaml
[stack@undercloud72 ~]$ openstack overcloud deploy --templates --control-scale 3 --compute-scale 1    --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/sahara.yaml


(2)  Applied that patch : 
https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0a3cd4dec3dcae5f8bc94e73436c2c76069762f1

Still got:
 deploy_status_code : Deployment exited with non-zero status code: 1
Stack overcloud UPDATE_FAILED
Heat Stack update failed.

Comment 6 Omri Hochman 2016-09-22 14:11:13 UTC

(In reply to Omri Hochman from comment #5)
> I was trying the suggested w/a by Tosky


It looks that sahara service is running after applying the patch ,
gnocchi is down. 

2016-09-22 14:10:57.320 19813 ERROR gnocchi.cli DBError: (pymysql.err.InternalError) (1054, u"Unknown column 'metric.unit' in 'field list'") [SQL: u'SELECT metric.id AS metric_id, metric.archive_policy_name AS metric_archive_policy_name, metric.created_by_user_id AS metric_created_by_user_id, metric.created_by_project_id AS metric_created_by_project_id, metric.resource_id AS metric_resource_id, metric.name AS metric_name, metric.unit AS metric_unit, metric.status AS metric_status, archive_policy_1.name AS archive_policy_1_name, archive_policy_1.back_window AS archive_policy_1_back_window, archive_policy_1.definition AS archive_policy_1_definition, archive_policy_1.aggregation_methods AS archive_policy_1_aggregation_methods \nFROM metric LEFT OUTER JOIN archive_policy AS archive_policy_1 ON archive_policy_1.name = metric.archive_policy_name \nWHERE metric.status = %s ORDER BY metric.id ASC'] [parameters: ('delete',)]
2016-09-22 14:10:57.320 19813 ERROR gnocchi.cli

Comment 7 Sofer Athlan-Guyot 2016-09-23 13:02:50 UTC

Ormi, the gnocchi bug is filled and fixed there https://bugzilla.redhat.com/show_bug.cgi?id=1378497

Comment 11 mlammon 2016-11-15 19:35:10 UTC

Deployed RHOS 9 latest
Upgraded to RHOS 10 with latest puddle (2016-11-14.1)

I no longer see this issue.

[stack@undercloud-0 ~]$ ssh heat-admin.2.10
Last login: Tue Nov 15 19:04:39 2016 from gateway
[heat-admin@controller-0 ~]$ sudo -i
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Tue Nov 15 19:08:40 2016		Last change: Tue Nov 15 01:10:37 2016 by root via crm_resource on controller-0

3 nodes and 19 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

 ip-fd00.fd00.fd00.4000..10	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-192.0.2.6	(ocf::heartbeat:IPaddr2):	Started controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ controller-0 controller-1 controller-2 ]
 ip-2620.52.0.13b8.5054.ff.fe3e.1	(ocf::heartbeat:IPaddr2):	Started controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ controller-0 ]
     Slaves: [ controller-1 controller-2 ]
 ip-fd00.fd00.fd00.3000..10	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-fd00.fd00.fd00.2000..10	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-fd00.fd00.fd00.2000..11	(ocf::heartbeat:IPaddr2):	Started controller-2
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 12 Luigi Toscano 2016-11-16 08:57:01 UTC

Apart from pcs, do you see the service running (iirc it should out of pacemaker now)? Can you contact it with something simple like:
openstack dataprocessing plugin list
?

Comment 13 Luigi Toscano 2016-11-16 15:12:03 UTC

After checking the upgraded environment, we found that Sahara services are indeed running on all controllers (directly with systemctl, no more pacemaker), and also Sahara answers to CLI commands (starting from `openstack dataprocessing plugin list`).

Comment 15 mlammon 2016-12-05 13:20:47 UTC

See above comments from Luigi.  We confirmed this on 2016-11-16.

Comment 17 errata-xmlrpc 2016-12-14 15:52:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.