Hide Forgot
Description of problem: Customer has updated their overcloud images post-deployment due to the following issue: https://access.redhat.com/security/vulnerabilities/2359821 However, they are concerned that future updating of the overcloud (like adding a node) may cause the existing overcloud nodes to be rebuilt (causing data loss). This is referenced in the following article under "The ugly experience with TripleO": http://alesnosek.com/blog/2016/03/27/tripleo-installer-the-good/ The customer would like to know if this concern is valid and the best way to ensure this doesn't happen. The article suggests running the following command: UPDATE instances SET disable_terminate = 1 WHERE uuid = '<uuid of the overcloud instance>'; I tried to reproduce this with the latest OSP 7 packages, unfortunately, I could not. *) According to https://review.openstack.org/#/c/288273/ "This change ensures that any property change in an OS::Nova::Server resource will not result in the server being replaced" Any change to the following properties should generate a complete redeploy: /usr/share/openstack-tripleo-heat-templates/controller.yaml: ~~~ Controller: type: OS::Nova::Server properties: image: {get_param: Image} image_update_policy: {get_param: ImageUpdatePolicy} flavor: {get_param: Flavor} key_name: {get_param: KeyName} networks: - network: ctlplane user_data_format: SOFTWARE_CONFIG user_data: {get_resource: NodeUserData} name: {get_param: Hostname} ~~~ With NodeUserData being defined as: ~~~ NodeUserData: type: OS::TripleO::NodeUserData ~~~ *) I tried to modify the contents of NodeUserData as well as the script's name in the resource registry file. Nothing triggered a rerun of user data, nor a redeploy of the environment. ~~~ [stack@undercloud-7 ~]$ cat templates/userdata-environment.yaml resource_registry: OS::TripleO::NodeUserData: userdata2.yaml [stack@undercloud-7 ~]$ cat templates/userdata2.yaml heat_template_version: 2014-10-16 description: > Extra hostname configuration resources: userdata2: type: OS::Heat::MultipartMime properties: parts: - config: {get_resource: nameserver_config2} nameserver_config2: type: OS::Heat::SoftwareConfig properties: config: | #!/bin/bash echo "nameserver 8.8.8.8" >> /etc/resolv.conf echo "userdata2.yaml" > /root/from-deploy.txt outputs: OS::stack_id: value: {get_resource: userdata2} ~~~ *) I then added to the parameter_defaults: section of network-environment: ~~~ ControllerHostnameFormat: '%stackname%-ctrl-%index%' ~~~ to force a change of the hostname. While this changed the hostname in `nova list`, it did not have any impact on the controller. ~~~ +--------------------------------------+---------------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+---------------------+--------+------------+-------------+---------------------+ | 962cec64-aa45-4773-97e0-bc80691d1955 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.0.2.20 | | 52ce710f-4203-43f1-a98d-89831be177f5 | overcloud-ctrl-0 | ACTIVE | - | Running | ctlplane=192.0.2.21 | +--------------------------------------+---------------------+--------+------------+-------------+---------------------+ ~~~ All it did is ... it made the deployment fail. ~~~ [stack@undercloud-7 ~]$ templates/deploy.sh control_scale=1, compute_scale=1, ceph_scale=0 Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates Stack failed with status: resources.ControllerNodesPostDeployment: resources.ControllerLoadBalancerDeployment_Step1: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 ERROR: openstack Heat Stack update failed. ~~~ ~~~ +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | parent_resource | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | ControllerNodesPostDeployment | d39ede89-1177-4691-b8c2-c98866164abd | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-09-21T16:34:16Z | | | ControllerLoadBalancerDeployment_Step1 | a57cb984-5d6d-46f1-b361-1fafac199ca8 | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2016-09-21T16:35:32Z | ControllerNodesPostDeployment | | 0 | 71cefc61-14cc-4085-896a-24c71a3b1c9b | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2016-09-21T16:35:34Z | ControllerLoadBalancerDeployment_Step1 | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ ~~~ Running the update again lead to the same error message. *) This is a blocker for the customer's go-live.
*) I installed OSP 7 with 7.3.1 overcloud images. This was a lab that I was already using for somethin else. I then downloaded 7.3.2 images (all 3 of them) and updated the images: [stack@undercloud-7 ~]$ glance image-list +--------------------------------------+----------------------------------------+-------------+------------------+------------+--------+ | ID | Name | Disk Format | Container Format | Size | Status | +--------------------------------------+----------------------------------------+-------------+------------------+------------+--------+ | ef7e2d56-639e-432e-a428-45f4e7b1dd8e | bm-deploy-kernel | aki | aki | 5027648 | active | | 9cdd3161-af4c-4d81-b986-4d538ac04d5c | bm-deploy-ramdisk | ari | ari | 56384263 | active | | f9380aa6-0398-4b3b-9861-fafc0ab4cb16 | overcloud-full | qcow2 | bare | 1031872512 | active | | ee1d0ab8-6948-4936-8e83-203b1347eda7 | overcloud-full-initrd | ari | ari | 40336571 | active | | c00f99d4-e0ca-4a01-abbc-9829c8ea1e91 | overcloud-full-initrd_20160907T224835 | ari | ari | 40325640 | active | | ffdc8a57-2033-4462-bce9-f085ae175541 | overcloud-full-vmlinuz | aki | aki | 5153536 | active | | d65ba41f-0e31-4f81-85b9-cf0ba261683b | overcloud-full-vmlinuz_20160907T224832 | aki | aki | 5153184 | active | | e7a3bb06-ed95-4623-b83d-c15e018b0932 | overcloud-full_20160907T224836 | qcow2 | bare | 977790976 | active | +--------------------------------------+----------------------------------------+-------------+------------------+------------+--------+ As you can see, images are different. I just launched a stack update and already got to the postdeployment stage ===> an image update did not cause heat to completely redeploy the computes ... And verifying in the database (just in case that the above workaround may have sneaked in via a patch ...) MariaDB [nova]> select disable_terminate from instances -> ; +-------------------+ | disable_terminate | +-------------------+ | 0 | | 0 | | 0 | | 0 | | 0 | | 0 | | 0 | | 0 | | 0 | +-------------------+
The ask here is to a) provide a definitive way to reproduce this issue b) provide a workaround for the customer so that they won't run into this issue