1474749 – Openstack Upgrade to 11 is failing, need to find the root casue

Bug 1474749 - Openstack Upgrade to 11 is failing, need to find the root casue

Summary: Openstack Upgrade to 11 is failing, need to find the root casue

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Sofer Athlan-Guyot
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-25 10:26 UTC by Eduard Barrera
Modified:	2020-09-10 11:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-18 11:58:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1461350	0	unspecified	CLOSED	unable to finish upgrade to OSP11, hostnames changed at some point of the deployment	2023-09-14 03:59:07 UTC

Description Eduard Barrera 2017-07-25 10:26:19 UTC

Description of problem:

We need to find the root cause of this issue, not only verifying that the customer procedure works.

After successfully upgrading controllers to OSP11 but after upgrade-non-controller.sh stared to go bad, apparently because changes on the hostnames.

 Some of them got a "t" in front of %index and "plo-" as a prefix so from xxxxxxxx-controller-0 to plo-xxxxxxxxxxx-controller-t0

 
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.2.22 xxxxxxxxxx-compute-1.localdomain xxxxxxxxxx-compute-1
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.external.localdomain xxxxxxxxxx-compute-1.external
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.2.22 xxxxxxxxxx-compute-1.internalapi.localdomain xxxxxxxxxx-compute-1.internalapi
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.4.22 xxxxxxxxxx-compute-1.storage.localdomain xxxxxxxxxx-compute-1.storage
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.storagemgmt.localdomain xxxxxxxxxx-compute-1.storagemgmt
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.3.22 xxxxxxxxxx-compute-1.tenant.localdomain xxxxxxxxxx-compute-1.tenant
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.management.localdomain xxxxxxxxxx-compute-1.management
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.ctlplane.localdomain xxxxxxxxxx-compute-1.ctlplane
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: INFO: Updating hosts file /etc/cloud/templates/hosts.redhat.tmpl, check below for changes
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: 32,93c32,75
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.2.11 plo-xxxxxxxxxx-controller-t0.localdomain plo-xxxxxxxxxx-controller-t0
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 298.251.12.21 plo-xxxxxxxxxx-controller-t0.external.localdomain plo-xxxxxxxxxx-controller-t0.external
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.2.11 plo-xxxxxxxxxx-controller-t0.internalapi.localdomain plo-xxxxxxxxxx-controller-t0.internalapi
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.4.11 plo-xxxxxxxxxx-controller-t0.storage.localdomain plo-xxxxxxxxxx-controller-t0.storage
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.30.0.111 plo-xxxxxxxxxx-controller-t0.storagemgmt.localdomain plo-xxxxxxxxxx-controller-t0.storagemgmt
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.tenant.localdomain plo-xxxxxxxxxx-controller-t0.tenant
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.management.localdomain plo-xxxxxxxxxx-controller-t0.management
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.ctlplane.localdomain plo-xxxxxxxxxx-controller-t0.ctlplane

Now the overcloud upgrade final steps to osp11 are failing because we have incorrect hostnames everywhere, for example:

Failed to call refresh: /sbin/pcs cluster auth xxxxxxxxx-controller-0 xxxxxxx-controller-1 xxxxxxxxxx-controller-5 -u hacluster -p YYYYYYYYYYYYYY --force returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mError: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth xxxxxxxxxx-controller-0 xxxxxxx-controller-1 xxxxxxxx-controller-5 -u hacluster -p YYYYYYYYYYYYYYYY --force returned 1 instead of one of [0]\u001b[0m\n", 
    "deploy_status_code": 6
(extracted from debug2.txt)

As you can appreciate the old hostnames are used so the nodes can't be reached. Another thing that is not working correctly is the galera cluster, so the wsrep string contains wrong hostnames


Another thing I never saw before is this error:

$ openstack stack failures list --long testovercloud
ERROR: The specified reference "NetworkerAllNodesValidationDeployment" (in AllNodesExtraConfig) is incorrect.

~~~ heat-resource-list-output.txt (12 KB) / heat --debug resource-list pocjn output: < snip > "The server could not comply with the request since it is either malformed or otherwise incorrect.", "code": 400, < snip > InvalidTemplateReference: The specified reference \"NetworkerAllNodesValidationDeployment\" (in AllNodesExtraConfig) is incorrect.\n", "type": "InvalidTemplateReference"}, "title": "Bad Request"}

Another think we wonder if this is still supporte since we have new overcloud parameters:


# cat templates/tunning-usage.yaml
  ControllerExtraConfig:
      ceilometer::metering_time_to_live: 604800
      ceilometer::event_time_to_live: 604800
      nova::network::neutron::neutron_url_timeout: '60'
      neutron::plugins::ml2::path_mtu: 1550
  NovaComputeExtraConfig:
      neutron::plugins::ml2::path_mtu: 1550
  NetworkerExtraConfig:
      neutron::plugins::ml2::path_mtu: 1550
  ExtraConfig:
      neutron::plugins::ml2::path_mtu: 1550


Version-Release number of selected component (if applicable):
OSP 11

How reproducible:
Always

Steps to Reproduce:
1. Upgrade from OSP10 to OSP11
2.
3.

Actual results:
Hostname changed on some point, deployment unable to complete

Expected results:
Deployment finish successfully





Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Eduard Barrera 2017-07-25 11:16:20 UTC

Created attachment 1304171 [details]
last errors the customer got

Comment 9 Zane Bitter 2017-07-28 21:03:46 UTC

Here's the traceback:

RESP BODY: {"explanation": "The server could not comply with the request since it is either malformed or otherwise incorrect.", "code": 400, "error": {"message": "The specified reference \"NetworkerAllNodesValidationDeployment\" (in AllNodesExtraConfig) is incorrect.", "traceback": "Traceback (most recent call last):\n\n File \"/usr/lib/python2.7/site-packages/heat/common/context.py\", line 407, in wrapped\n return func(self, ctx, *args, **kwargs)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/service.py\", line 2001, in list_stack_resources\n for resource in rsrcs]\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/api.py\", line 345, in format_stack_resource\n rpc_api.RES_REQUIRED_BY: resource.required_by(),\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/resource.py\", line 665, in required_by\n return [r.name for r in self.stack.dependencies.required_by(self)]\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/stack.py\", line 403, in dependencies\n ignore_errors=self.id is not None)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/stack.py\", line 477, in _get_dependencies\n res.add_explicit_dependencies(deps)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/resource.py\", line 639, in add_explicit_dependencies\n for dep in self.t.dependencies(self.stack):\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/rsrc_defn.py\", line 238, in dependencies\n filter(None, (get_resource(dep) for dep in explicit_depends)),\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/rsrc_defn.py\", line 238, in <genexpr>\n filter(None, (get_resource(dep) for dep in explicit_depends)),\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/rsrc_defn.py\", line 215, in get_resource\n key=self.name)\n\nInvalidTemplateReference: The specified reference \"NetworkerAllNodesValidationDeployment\" (in AllNodesExtraConfig) is incorrect.\n", "type": "InvalidTemplateReference"}, "title": "Bad Request"}

So it's failing the calculate the dependency graph (that's bad - without it you can't do an update... this is not something that's ever supposed to get out of sync).

Looking at the templates:

  AllNodesExtraConfig:
    type: OS::TripleO::AllNodesExtraConfig
    depends_on:
      - UpdateWorkflow
{% for role in roles %}
      - {{role.name}}AllNodesValidationDeployment
{% endfor %}

This is the depends_on we're failing to resolve. Somehow the template got stored with depends_on NetworkerAllNodesValidationDeployment here in the AllNodesExtraConfig resource, but without actually having a 
NetworkerAllNodesValidationDeployment resource in the template. This shouldn't be possible even if an update fails part-way through: because of the dependency relationship if a role is added then the {{role.name}}AllNodesValidationDeployment should be copied into the template before the new AllNodesExtraConfig definition (with the new depends_on) is copied into the template, and vice-versa if a role is deleted.

There's no heat-engine log in the sosreport, so not much information we can go on to figure out what happened.

Comment 14 Eduard Barrera 2017-08-02 08:27:25 UTC

Created attachment 1308006 [details]
new errors

Note You need to log in before you can comment on or make changes to this bug.