Bug 1378268

Summary: Unintended Overcloud node replacement after changes to OS::Nova::Server
Product: Red Hat OpenStack Reporter: Andreas Karis <akaris>
Component: openstack-tripleoAssignee: James Slagle <jslagle>
Status: CLOSED NOTABUG QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: jslagle, mburns, mcornea, rhel-osp-director-maint
Target Milestone: ---   
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-23 18:11:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Andreas Karis 2016-09-22 00:42:42 UTC
Description of problem:
Customer has updated their overcloud images post-deployment due to the following issue:  https://access.redhat.com/security/vulnerabilities/2359821
However, they are concerned that future updating of the overcloud (like adding a node) may cause the existing overcloud nodes to be rebuilt (causing data loss).  This is referenced in the following article under "The ugly experience with TripleO":
http://alesnosek.com/blog/2016/03/27/tripleo-installer-the-good/

The customer would like to know if this concern is valid and the best way to ensure this doesn't happen.  The article suggests running the following command:
UPDATE instances SET disable_terminate = 1 WHERE uuid = '<uuid of the overcloud instance>';

I tried to reproduce this with the latest OSP 7 packages, unfortunately, I could not.

*) According to 
https://review.openstack.org/#/c/288273/
"This change ensures that any property change in an OS::Nova::Server resource will not result in the server being replaced"

Any change to the following properties should generate a complete redeploy:
/usr/share/openstack-tripleo-heat-templates/controller.yaml:
~~~
  Controller:
    type: OS::Nova::Server
    properties:
      image: {get_param: Image}
      image_update_policy: {get_param: ImageUpdatePolicy}
      flavor: {get_param: Flavor}
      key_name: {get_param: KeyName}
      networks:
        - network: ctlplane
      user_data_format: SOFTWARE_CONFIG
      user_data: {get_resource: NodeUserData}
      name: {get_param: Hostname}
~~~

With NodeUserData being defined as:
~~~
  NodeUserData:
    type: OS::TripleO::NodeUserData
~~~

*) I tried to modify the contents of NodeUserData as well as the script's name in the resource registry file. Nothing triggered a rerun of user data, nor a redeploy of the environment.
~~~
[stack@undercloud-7 ~]$ cat templates/userdata-environment.yaml 
resource_registry:
  OS::TripleO::NodeUserData: userdata2.yaml
[stack@undercloud-7 ~]$ cat templates/userdata2.yaml 
heat_template_version: 2014-10-16
description: >
  Extra hostname configuration

resources:
  userdata2:
    type: OS::Heat::MultipartMime
    properties:
      parts:
      - config: {get_resource: nameserver_config2}
  nameserver_config2:
    type: OS::Heat::SoftwareConfig
    properties:
      config: |
        #!/bin/bash
        echo "nameserver 8.8.8.8" >> /etc/resolv.conf
        echo "userdata2.yaml" > /root/from-deploy.txt

outputs:
  OS::stack_id:
    value: {get_resource: userdata2}
~~~

*) I then added to the parameter_defaults: section of network-environment:

~~~
  ControllerHostnameFormat: '%stackname%-ctrl-%index%'
~~~

to force a change of the hostname. While this changed the hostname in `nova list`, it did not have any impact on the controller.

~~~
+--------------------------------------+---------------------+--------+------------+-------------+---------------------+
| ID                                   | Name                | Status | Task State | Power State | Networks            |
+--------------------------------------+---------------------+--------+------------+-------------+---------------------+
| 962cec64-aa45-4773-97e0-bc80691d1955 | overcloud-compute-0 | ACTIVE | -          | Running     | ctlplane=192.0.2.20 |
| 52ce710f-4203-43f1-a98d-89831be177f5 | overcloud-ctrl-0    | ACTIVE | -          | Running     | ctlplane=192.0.2.21 |
+--------------------------------------+---------------------+--------+------------+-------------+---------------------+
~~~

All it did is ... it made the deployment fail.

~~~
[stack@undercloud-7 ~]$ templates/deploy.sh 
control_scale=1, compute_scale=1, ceph_scale=0
Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
Stack failed with status: resources.ControllerNodesPostDeployment: resources.ControllerLoadBalancerDeployment_Step1: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
ERROR: openstack Heat Stack update failed.
~~~
~~~
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| ControllerNodesPostDeployment                 | d39ede89-1177-4691-b8c2-c98866164abd          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2016-09-21T16:34:16Z |                                               |
| ControllerLoadBalancerDeployment_Step1        | a57cb984-5d6d-46f1-b361-1fafac199ca8          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2016-09-21T16:35:32Z | ControllerNodesPostDeployment                 |
| 0                                             | 71cefc61-14cc-4085-896a-24c71a3b1c9b          | OS::Heat::StructuredDeployment                    | UPDATE_FAILED   | 2016-09-21T16:35:34Z | ControllerLoadBalancerDeployment_Step1        |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
~~~

Running the update again lead to the same error message.


*) This is a blocker for the customer's go-live.

Comment 3 Andreas Karis 2016-09-22 00:45:51 UTC
*) I installed OSP 7 with 7.3.1 overcloud images. This was a lab that I was already using for somethin else. I then downloaded 7.3.2 images (all 3 of them) and updated the images:

[stack@undercloud-7 ~]$ glance image-list
+--------------------------------------+----------------------------------------+-------------+------------------+------------+--------+
| ID                                   | Name                                   | Disk Format | Container Format | Size       | Status |
+--------------------------------------+----------------------------------------+-------------+------------------+------------+--------+
| ef7e2d56-639e-432e-a428-45f4e7b1dd8e | bm-deploy-kernel                       | aki         | aki              | 5027648    | active |
| 9cdd3161-af4c-4d81-b986-4d538ac04d5c | bm-deploy-ramdisk                      | ari         | ari              | 56384263   | active |
| f9380aa6-0398-4b3b-9861-fafc0ab4cb16 | overcloud-full                         | qcow2       | bare             | 1031872512 | active |
| ee1d0ab8-6948-4936-8e83-203b1347eda7 | overcloud-full-initrd                  | ari         | ari              | 40336571   | active |
| c00f99d4-e0ca-4a01-abbc-9829c8ea1e91 | overcloud-full-initrd_20160907T224835  | ari         | ari              | 40325640   | active |
| ffdc8a57-2033-4462-bce9-f085ae175541 | overcloud-full-vmlinuz                 | aki         | aki              | 5153536    | active |
| d65ba41f-0e31-4f81-85b9-cf0ba261683b | overcloud-full-vmlinuz_20160907T224832 | aki         | aki              | 5153184    | active |
| e7a3bb06-ed95-4623-b83d-c15e018b0932 | overcloud-full_20160907T224836         | qcow2       | bare             | 977790976  | active |
+--------------------------------------+----------------------------------------+-------------+------------------+------------+--------+

As you can see, images are different.

I just launched a stack update and already got to the postdeployment stage ===> an image update did not cause heat to completely redeploy the computes ...

And verifying in the database (just in case that the above workaround may have sneaked in via a patch ...)

MariaDB [nova]> select disable_terminate from instances
    -> ;
+-------------------+
| disable_terminate |
+-------------------+
|                 0 |
|                 0 |
|                 0 |
|                 0 |
|                 0 |
|                 0 |
|                 0 |
|                 0 |
|                 0 |
+-------------------+

Comment 4 Andreas Karis 2016-09-22 00:46:29 UTC
The ask here is to

a) provide a definitive way to reproduce this issue

b) provide a workaround for the customer so that they won't run into this issue