Bug 1461350 - unable to finish upgrade to OSP11, hostnames changed at some point of the deployment [NEEDINFO]
unable to finish upgrade to OSP11, hostnames changed at some point of the dep...
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo (Show other bugs)
11.0 (Ocata)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Carlos Camacho
Arik Chernetsky
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-14 05:35 EDT by Eduard Barrera
Modified: 2017-07-24 09:26 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-24 09:26:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ccamacho: needinfo? (ebarrera)


Attachments (Terms of Use)

  None (edit)
Description Eduard Barrera 2017-06-14 05:35:13 EDT
Description of problem:

After successfully upgrading controllers to OSP11 but after upgrade-non-controller.sh stared to go bad, apparently because changes on the hostnames.

 Some of them got a "t" in front of %index and "plo-" as a prefix so from xxxxxxxx-controller-0 to plo-xxxxxxxxxxx-controller-t0

 
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.2.22 xxxxxxxxxx-compute-1.localdomain xxxxxxxxxx-compute-1
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.external.localdomain xxxxxxxxxx-compute-1.external
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.2.22 xxxxxxxxxx-compute-1.internalapi.localdomain xxxxxxxxxx-compute-1.internalapi
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.4.22 xxxxxxxxxx-compute-1.storage.localdomain xxxxxxxxxx-compute-1.storage
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.storagemgmt.localdomain xxxxxxxxxx-compute-1.storagemgmt
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.3.22 xxxxxxxxxx-compute-1.tenant.localdomain xxxxxxxxxx-compute-1.tenant
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.management.localdomain xxxxxxxxxx-compute-1.management
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: > 10.10.1.16 xxxxxxxxxx-compute-1.ctlplane.localdomain xxxxxxxxxx-compute-1.ctlplane
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: INFO: Updating hosts file /etc/cloud/templates/hosts.redhat.tmpl, check below for changes
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: 32,93c32,75
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.2.11 plo-xxxxxxxxxx-controller-t0.localdomain plo-xxxxxxxxxx-controller-t0
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 298.251.12.21 plo-xxxxxxxxxx-controller-t0.external.localdomain plo-xxxxxxxxxx-controller-t0.external
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.2.11 plo-xxxxxxxxxx-controller-t0.internalapi.localdomain plo-xxxxxxxxxx-controller-t0.internalapi
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.4.11 plo-xxxxxxxxxx-controller-t0.storage.localdomain plo-xxxxxxxxxx-controller-t0.storage
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.30.0.111 plo-xxxxxxxxxx-controller-t0.storagemgmt.localdomain plo-xxxxxxxxxx-controller-t0.storagemgmt
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.tenant.localdomain plo-xxxxxxxxxx-controller-t0.tenant
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.management.localdomain plo-xxxxxxxxxx-controller-t0.management
Jun 01 11:41:24 plo-xxxxxxxxxx-controller-t0.localdomain os-collect-config[3148]: < 10.10.1.23 plo-xxxxxxxxxx-controller-t0.ctlplane.localdomain plo-xxxxxxxxxx-controller-t0.ctlplane

Now the overcloud upgrade final steps to osp11 are failing because we have incorrect hostnames everywhere, for example:

Failed to call refresh: /sbin/pcs cluster auth xxxxxxxxx-controller-0 xxxxxxx-controller-1 xxxxxxxxxx-controller-5 -u hacluster -p YYYYYYYYYYYYYY --force returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mError: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: /sbin/pcs cluster auth xxxxxxxxxx-controller-0 xxxxxxx-controller-1 xxxxxxxx-controller-5 -u hacluster -p YYYYYYYYYYYYYYYY --force returned 1 instead of one of [0]\u001b[0m\n", 
    "deploy_status_code": 6
(extracted from debug2.txt)

As you can appreciate the old hostnames are used so the nodes can't be reached. Another thing that is not working correctly is the galera cluster, so the wsrep string contains wrong hostnames


Another thing I never saw before is this error:

$ openstack stack failures list --long testovercloud
ERROR: The specified reference "NetworkerAllNodesValidationDeployment" (in AllNodesExtraConfig) is incorrect.

~~~ heat-resource-list-output.txt (12 KB) / heat --debug resource-list pocjn output: < snip > "The server could not comply with the request since it is either malformed or otherwise incorrect.", "code": 400, < snip > InvalidTemplateReference: The specified reference \"NetworkerAllNodesValidationDeployment\" (in AllNodesExtraConfig) is incorrect.\n", "type": "InvalidTemplateReference"}, "title": "Bad Request"}

Another think we wonder if this is still supporte since we have new overcloud parameters:


# cat templates/tunning-usage.yaml
  ControllerExtraConfig:
      ceilometer::metering_time_to_live: 604800
      ceilometer::event_time_to_live: 604800
      nova::network::neutron::neutron_url_timeout: '60'
      neutron::plugins::ml2::path_mtu: 1550
  NovaComputeExtraConfig:
      neutron::plugins::ml2::path_mtu: 1550
  NetworkerExtraConfig:
      neutron::plugins::ml2::path_mtu: 1550
  ExtraConfig:
      neutron::plugins::ml2::path_mtu: 1550


Version-Release number of selected component (if applicable):
OSP 11

How reproducible:
Always

Steps to Reproduce:
1. Upgrade from OSP10 to OSP11
2.
3.

Actual results:
Hostname changed on some point, deployment unable to complete

Expected results:
Deployment finish successfully

Additional info:
Comment 3 Carlos Camacho 2017-06-19 08:47:40 EDT
Hello,

Can you provide more information about your environment, we can se you have this heat hook:

NetworkerAllNodesValidationDeployment

Which means you should have a "Networker" role.

Is this right?

Can you provide info from your environment to be able to reproduce this?


Thanks.
Comment 5 Carlos Camacho 2017-06-26 09:05:39 EDT
Hey Irina,

Sure, I ran in my env an upgrade from OSP10 to OSP11 and wasn't able to reproduce this behavior in my controllers.

So, the thing is that the host names schema are defined in roles_data.yaml and this should not happen.

Also IIRC to change HostnameFormatDefault is not supported.

In the case of the Compute role we have:
https://code.engineering.redhat.com/gerrit/#/c/98491/
https://code.engineering.redhat.com/gerrit/#/c/86801/


So, can you tell me how you have configured your roles data file?
Comment 7 Carlos Camacho 2017-07-05 10:27:58 EDT
Hey Irina, I tried hard to reproduce this without much luck.

So, in my case I deployed my overcloud with the hostnames change at the beginning of the deploying, then ran the upgrade from 10 to 11 and went fine.

So, the question is if you changed the hostnames after the deployment??

If so Im afraid there should be some leftover from the previous hostname format.

Have you done that?

If so, Ill check upstream for a fix and then the backport.

Thanks.
Comment 8 Carlos Camacho 2017-07-10 08:00:18 EDT
Hey Irina, this is what I have executed in order to Upgrade my env from OSP10 to OSP11 with the hostname change.

The process seems to work fine, you can check and reproduce my steps here:

https://gist.github.com/ccamacho/0018a2746f9caab0ec239775ebfbc99c


Please if you need further checks just let me know.

Thanks,
Carlos.
Comment 10 Carlos Camacho 2017-07-24 09:25:00 EDT
# Hey, related to what you have stated the user did... 

$ grep index * -R
roles_data.yaml:# defaults to '%stackname%-{{role.name.lower()}}-%index%'
roles_data.yaml:  HostnameFormatDefault: 'plo-%stackname%-controller-t%index%'
roles_data.yaml:  HostnameFormatDefault: 'plo-%stackname%-compute-t%index%'
roles_data.yaml:  HostnameFormatDefault: 'plo-%stackname%-network-t%index%'
scheduler_hints_env.yaml:    'capabilities:node': 'controller-%index%'
scheduler_hints_env.yaml:    'capabilities:node': 'computer-%index%'
scheduler_hints_env.yaml:    'capabilities:node': 'networker-%index%'

What you have in the above lines it's actually WRONG!
######################################################
### YOU SHOULD NOT MODIFY THE roles_data.yaml FILE ###
######################################################

# So, please try to follow the instructions on: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-controlling_node_placement and you will succeed for sure...


#############################################################
# I'll show you in my local tests...

# So, let's do this again, following accordingly the documentation...
  
[stack@undercloud ~]$ ironic node-list
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name      | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| 9fb4afe9-13b8-44cd-881b-4c0aa8406821 | control-0 | None          | power off   | available          | False       |
| 40df682b-b8d8-4157-b2b7-b852b1da24d5 | compute-0 | None          | power off   | available          | False       |
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+


# Let's see what we have here...
# Director by default randomly selects nodes for each role,
# based of its profile tag...
# But, you can do custom placement.. 
# So, what we will do is to customize or define the node
# placement...

[stack@undercloud ~]$ ironic node-update 9fb4afe9-13b8-44cd-881b-4c0aa8406821 replace properties/capabilities='node:control-0,boot_option:local'
[stack@undercloud ~]$ ironic node-update 40df682b-b8d8-4157-b2b7-b852b1da24d5 replace properties/capabilities='node:compute-0,boot_option:local'

[stack@undercloud ~]$ ironic node-show control-0
.
.
| properties             | {u'memory_mb': u'8192', u'cpu_arch': u'x86_64', u'local_gb': u'49',      |
|                        | u'cpus': u'2', u'capabilities': u'node:control-0,boot_option:local'}     |
.
.
# Now we have the property configured correctly.

# And, we will update the scheduller hints env file...

[stack@undercloud ~]$ cat scheduler_hints_env.yaml
parameter_defaults:
  ControllerSchedulerHints:
    'capabilities:node': 'control-%index%'
  NovaComputeSchedulerHints:
    'capabilities:node': 'compute-%index%'
  HostnameMap:
    overcloud-controller-0: plo-overcloud-controller-t0
    overcloud-compute-0: plo-overcloud-compute-t0

# I think your error is probably when changing the roles_data.yaml file...
# overcloud-controller-0 matches the default HostnameFormatDefault parameter
# and then using HostnameMap you explicitly change the name to whatever you want...

# And we will deploy/upgrade as usual...

#If you do this you shouldn't be anything bad with your deployment/upgrade...

[stack@undercloud ~]$ openstack overcloud deploy \
--libvirt-type qemu \
--ntp-server pool.ntp.org \
--templates /usr/share/openstack-tripleo-heat-templates \
-e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
-e /home/stack/scheduler_hints_env.yaml

Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: 4a364705-298a-469c-a3f9-9a7e4fd53166
Plan updated
Deploying templates in the directory /tmp/tripleoclient-HjvNl8/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 6552a648-4da4-4a19-8d86-2350934915f1

.
.
.

[stack@undercloud ~]$ nova list
+--------------------------------------+-----------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name                        | Status | Task State | Power State | Networks              |
+--------------------------------------+-----------------------------+--------+------------+-------------+-----------------------+
| 8e1bd885-6789-4a47-941c-3ad918fabab0 | plo-overcloud-compute-t0    | ACTIVE | -          | Running     | ctlplane=192.168.24.7 |
| c56e40d3-bed3-4b57-bb65-ac5bc9cca3ef | plo-overcloud-controller-t0 | ACTIVE | -          | Running     | ctlplane=192.168.24.6 |
+--------------------------------------+-----------------------------+--------+------------+-------------+-----------------------+

[stack@undercloud ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+--------------------+----------------------+--------------+
| id                                   | stack_name | stack_status       | creation_time        | updated_time |
+--------------------------------------+------------+--------------------+----------------------+--------------+
| d1477214-1e95-466d-8fee-60aa742e6f70 | overcloud  | CREATE_IN_PROGRESS | 2017-07-24T10:30:26Z | None         |
+--------------------------------------+------------+--------------------+----------------------+--------------+


.
.
.
.
2017-07-24 11:04:22Z [overcloud.AllNodesDeploySteps.ControllerPostPuppet]: CREATE_COMPLETE  state changed
2017-07-24 11:04:22Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-07-24 11:04:23Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE  state changed
2017-07-24 11:04:23Z [overcloud]: CREATE_COMPLETE  Stack CREATE completed successfully

 Stack overcloud CREATE_COMPLETE 

Overcloud Endpoint: http://192.168.24.9:5000/v2.0
Overcloud Deployed


[stack@undercloud ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+-----------------+----------------------+--------------+
| id                                   | stack_name | stack_status    | creation_time        | updated_time |
+--------------------------------------+------------+-----------------+----------------------+--------------+
| d1477214-1e95-466d-8fee-60aa742e6f70 | overcloud  | CREATE_COMPLETE | 2017-07-24T10:30:26Z | None         |
+--------------------------------------+------------+-----------------+----------------------+--------------+
##
## OVERCLOUD INSTALLED CORRECTLY.. WITH THE NEW HOSTNAMES...
##
# Bleh bleh, now we need to upgrade our undercloud...


# Upgrade your Undercloud
# You need to run some previos checks/tasks, please follow the docs...


[stack@undercloud ~]$ openstack undercloud upgrade
.
.
2017-07-24 07:48:10,632 INFO: Not creating flavor "swift-storage" because it already exists.
2017-07-24 07:48:11,280 INFO: Not creating default plan "overcloud" because it already exists.
2017-07-24 07:48:12,621 INFO: 
#############################################################################
Undercloud upgrade complete.

The file containing this installation's passwords is at
/home/stack/undercloud-passwords.conf.

There is also a stackrc file at /home/stack/stackrc.

These files are needed to interact with the OpenStack services, and should be
secured.

#############################################################################



####
#### Overcloud Upgrade.
####




[stack@undercloud ~]$ /bin/bash -c "cat <<EOF>>$HOME/init-repo.yaml
parameter_defaults:
  UpgradeInitCommand: |
    set -e
    yum localinstall -y http://download.lab.bos.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm
    rhos-release 11
    yum-config-manager --disable 'rhelosp-10.0*'
    yum clean all
EOF"

# Check that your controller is able to connect to Internet with the DNS working properly..



# Be sure you have your nodes registered correctly!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[heat-admin@plo-overcloud-controller-t0 ~]$ sudo subscription-manager register --username xxx --password xxx --auto-attach
[heat-admin@plo-overcloud-compute-t0 ~]$ sudo subscription-manager register --username xxx --password xxx --auto-attach

# Install CDN redhat cert
wget https://www.dropbox.com/s/8z2z3eg3jt7ziz1/cdn.redhat.com.crt
sudo cp cdn.redhat.com.crt /etc/pki/ca-trust/source/anchors/
sudo update-ca-trust extract

# Upgrade command... 
[stack@undercloud ~]$ openstack overcloud deploy \
--libvirt-type qemu \
--ntp-server pool.ntp.org \
--templates /usr/share/openstack-tripleo-heat-templates \
-e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
-e /home/stack/scheduler_hints_env.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps.yaml \
-e init-repo.yaml

Removing the current plan files
Uploading new plan files
Started Mistral Workflow tripleo.plan_management.v1.update_deployment_plan. Execution ID: b72ac40b-1a34-405d-9bf7-72f93b640e11
Plan updated
Deploying templates in the directory /tmp/tripleoclient-IdnWc_/tripleo-heat-templates
Started Mistral Workflow tripleo.deployment.v1.deploy_plan. Execution ID: dca54311-6be2-4dc7-ba15-3d6b50e839b6
2017-07-24 12:25:05Z [Networks]: UPDATE_IN_PROGRESS  state changed
2017-07-24 12:25:05Z [ServiceNetMap]: UPDATE_IN_PROGRESS  state changed
.
.
.


 2017-07-24 13:18:39Z [overcloud-AllNodesDeploySteps-cry7iioinxkm.AllNodesPostUpgradeSteps.CephStoragePostConfig]: CREATE_COMPLETE  state changed
2017-07-24 13:18:39Z [overcloud-AllNodesDeploySteps-cry7iioinxkm.AllNodesPostUpgradeSteps.BlockStoragePostConfig]: CREATE_COMPLETE  state changed
2017-07-24 13:20:50Z [overcloud-AllNodesDeploySteps-cry7iioinxkm.AllNodesPostUpgradeSteps.ControllerPostConfig]: CREATE_COMPLETE  state changed
2017-07-24 13:20:50Z [overcloud-AllNodesDeploySteps-cry7iioinxkm.AllNodesPostUpgradeSteps]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-07-24 13:20:51Z [overcloud-AllNodesDeploySteps-cry7iioinxkm.AllNodesPostUpgradeSteps]: CREATE_COMPLETE  state changed
2017-07-24 13:20:59Z [overcloud-AllNodesDeploySteps-cry7iioinxkm]: UPDATE_COMPLETE  Stack UPDATE completed successfully
2017-07-24 13:21:00Z [AllNodesDeploySteps]: UPDATE_COMPLETE  state changed
2017-07-24 13:21:11Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Overcloud Endpoint: http://192.168.24.9:5000/v2.0
Overcloud Deployed
[stack@undercloud ~]$ 




So, this actually works.
Comment 11 Carlos Camacho 2017-07-24 09:26:38 EDT
Closing as this is working properly.

Note You need to log in before you can comment on or make changes to this bug.