Description of problem: We need to find the root cause of this issue, not only verifying that the customer procedure works. After successfully upgrading controllers to OSP11 but after upgrade-non-controller.sh stared to go bad, apparently because changes on the hostnames. Some of them got a "t" in front of %index and "plo-" as a prefix so from xxxxxxxx-controller-0 to plo-xxxxxxxxxxx-controller-t0 Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.2.22 xxxxxxxxxx-compute-1.localdomain xxxxxxxxxx-compute-1 Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.external.localdomain xxxxxxxxxx-compute-1.external Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.2.22 xxxxxxxxxx-compute-1.internalapi.localdomain xxxxxxxxxx-compute-1.internalapi Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.4.22 xxxxxxxxxx-compute-1.storage.localdomain xxxxxxxxxx-compute-1.storage Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.storagemgmt.localdomain xxxxxxxxxx-compute-1.storagemgmt Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.3.22 xxxxxxxxxx-compute-1.tenant.localdomain xxxxxxxxxx-compute-1.tenant Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.management.localdomain xxxxxxxxxx-compute-1.management Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.ctlplane.localdomain xxxxxxxxxx-compute-1.ctlplane Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: INFO: Updating hosts file /etc/cloud/templates/hosts.redhat.tmpl, check below for changes Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: 32,93c32,75 Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.2.11 plo-xxxxxxxxxx-controller-t0.localdomain plo-xxxxxxxxxx-controller-t0 Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 298.251.12.21 plo-xxxxxxxxxx-controller-t0.external.localdomain plo-xxxxxxxxxx-controller-t0.external Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.2.11 plo-xxxxxxxxxx-controller-t0.internalapi.localdomain plo-xxxxxxxxxx-controller-t0.internalapi Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.4.11 plo-xxxxxxxxxx-controller-t0.storage.localdomain plo-xxxxxxxxxx-controller-t0.storage Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.30.0.111 plo-xxxxxxxxxx-controller-t0.storagemgmt.localdomain plo-xxxxxxxxxx-controller-t0.storagemgmt Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.tenant.localdomain plo-xxxxxxxxxx-controller-t0.tenant Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.management.localdomain plo-xxxxxxxxxx-controller-t0.management Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.ctlplane.localdomain plo-xxxxxxxxxx-controller-t0.ctlplane Now the overcloud upgrade final steps to osp11 are failing because we have incorrect hostnames everywhere, for example: Failed to call refresh: /sbin/pcs cluster auth xxxxxxxxx-controller-0 xxxxxxx-controller-1 xxxxxxxxxx-controller-5 -u hacluster -p YYYYYYYYYYYYYY --force returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mError: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth xxxxxxxxxx-controller-0 xxxxxxx-controller-1 xxxxxxxx-controller-5 -u hacluster -p YYYYYYYYYYYYYYYY --force returned 1 instead of one of [0]\u001b[0m\n", "deploy_status_code": 6 (extracted from debug2.txt) As you can appreciate the old hostnames are used so the nodes can't be reached. Another thing that is not working correctly is the galera cluster, so the wsrep string contains wrong hostnames Another thing I never saw before is this error: $ openstack stack failures list --long testovercloud ERROR: The specified reference "NetworkerAllNodesValidationDeployment" (in AllNodesExtraConfig) is incorrect. ~~~ heat-resource-list-output.txt (12 KB) / heat --debug resource-list pocjn output: < snip > "The server could not comply with the request since it is either malformed or otherwise incorrect.", "code": 400, < snip > InvalidTemplateReference: The specified reference \"NetworkerAllNodesValidationDeployment\" (in AllNodesExtraConfig) is incorrect.\n", "type": "InvalidTemplateReference"}, "title": "Bad Request"} Another think we wonder if this is still supporte since we have new overcloud parameters: # cat templates/tunning-usage.yaml ControllerExtraConfig: ceilometer::metering_time_to_live: 604800 ceilometer::event_time_to_live: 604800 nova::network::neutron::neutron_url_timeout: '60' neutron::plugins::ml2::path_mtu: 1550 NovaComputeExtraConfig: neutron::plugins::ml2::path_mtu: 1550 NetworkerExtraConfig: neutron::plugins::ml2::path_mtu: 1550 ExtraConfig: neutron::plugins::ml2::path_mtu: 1550 Version-Release number of selected component (if applicable): OSP 11 How reproducible: Always Steps to Reproduce: 1. Upgrade from OSP10 to OSP11 2. 3. Actual results: Hostname changed on some point, deployment unable to complete Expected results: Deployment finish successfully Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1304171 [details] last errors the customer got
Here's the traceback: RESP BODY: {"explanation": "The server could not comply with the request since it is either malformed or otherwise incorrect.", "code": 400, "error": {"message": "The specified reference \"NetworkerAllNodesValidationDeployment\" (in AllNodesExtraConfig) is incorrect.", "traceback": "Traceback (most recent call last):\n\n File \"/usr/lib/python2.7/site-packages/heat/common/context.py\", line 407, in wrapped\n return func(self, ctx, *args, **kwargs)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/service.py\", line 2001, in list_stack_resources\n for resource in rsrcs]\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/api.py\", line 345, in format_stack_resource\n rpc_api.RES_REQUIRED_BY: resource.required_by(),\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/resource.py\", line 665, in required_by\n return [r.name for r in self.stack.dependencies.required_by(self)]\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/stack.py\", line 403, in dependencies\n ignore_errors=self.id is not None)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/stack.py\", line 477, in _get_dependencies\n res.add_explicit_dependencies(deps)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/resource.py\", line 639, in add_explicit_dependencies\n for dep in self.t.dependencies(self.stack):\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/rsrc_defn.py\", line 238, in dependencies\n filter(None, (get_resource(dep) for dep in explicit_depends)),\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/rsrc_defn.py\", line 238, in <genexpr>\n filter(None, (get_resource(dep) for dep in explicit_depends)),\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/rsrc_defn.py\", line 215, in get_resource\n key=self.name)\n\nInvalidTemplateReference: The specified reference \"NetworkerAllNodesValidationDeployment\" (in AllNodesExtraConfig) is incorrect.\n", "type": "InvalidTemplateReference"}, "title": "Bad Request"} So it's failing the calculate the dependency graph (that's bad - without it you can't do an update... this is not something that's ever supposed to get out of sync). Looking at the templates: AllNodesExtraConfig: type: OS::TripleO::AllNodesExtraConfig depends_on: - UpdateWorkflow {% for role in roles %} - {{role.name}}AllNodesValidationDeployment {% endfor %} This is the depends_on we're failing to resolve. Somehow the template got stored with depends_on NetworkerAllNodesValidationDeployment here in the AllNodesExtraConfig resource, but without actually having a NetworkerAllNodesValidationDeployment resource in the template. This shouldn't be possible even if an update fails part-way through: because of the dependency relationship if a role is added then the {{role.name}}AllNodesValidationDeployment should be copied into the template before the new AllNodesExtraConfig definition (with the new depends_on) is copied into the template, and vice-versa if a role is deleted. There's no heat-engine log in the sosreport, so not much information we can go on to figure out what happened.
Created attachment 1308006 [details] new errors