Bug 1568561 - FFU: scaling out with an additional compute node after the upgrade procedure gets stuck
Summary: FFU: scaling out with an additional compute node after the upgrade procedure ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 13.0 (Queens)
Assignee: Emilien Macchi
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks: 1561169
TreeView+ depends on / blocked
 
Reported: 2018-04-17 18:44 UTC by Marius Cornea
Modified: 2018-06-27 13:52 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.0.2-23.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:52:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 568236 0 'None' MERGED FFU Set NetworkDeploymentActions CREATE,UPDATE for ffwd-upgrade prepare 2020-08-31 22:30:21 UTC
Red Hat Product Errata RHEA-2018:2086 0 None None None 2018-06-27 13:52:54 UTC

Description Marius Cornea 2018-04-17 18:44:35 UTC
Description of problem:

FFU: scaling out with an additional compute node after the upgrade procedure gets stuck:

We can see the nova compute node getting provisioned:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| fb10588b-594e-444c-9114-3189414a2f58 | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.13 |
| 82c45537-eaf9-4115-8e54-0c48f5b831d8 | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.11 |
| 6b02eb1c-5790-4baf-ae75-d7f5e15ab186 | compute-4    | ACTIVE | -          | Running     | ctlplane=192.168.24.14 |
| c9cd690f-44f0-41a1-9b69-f6b5a13553ce | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+

The new node(compute-4) is reachable via ssh but the network configuration set in the nic templates doesn't get applied and the stack update gets stuck. 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.2-0.20180410170330.a39634a.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:

1. Deploy OSP10 with 1 controller + 3 computes
2. Upgrade to OSP13 via the FFU procedure
3. Update undercloud Glance images to OSP13 images
4. Update deploy_kernel and deploy_ramdisk for all ironic nodes:
openstack baremetal node set --driver-info deploy_kernel=$uuid --driver-info deploy_ramdisk=$uuid $uuid
5. Run overcloud deploy command with incremented --compute-scale

Actual results:
Stack update gets stuck.

Expected results:
Stack update succeeds and a new compute node is configured.

Additional info:

I suspect this is related to the fact that the initial OSP10 environment was deployed with old style nic templates and in OSP13 we only support new style nic templates. I filed this ticket to formulate and document the required steps to allow post upgrade scale out within the FFU context. To be more specific: where in the upgrade process and how do we update the nic templates to allow adding/replacing nodes after upgrade? I know there is a tool(BZ#1544571) which allows conversion of the nic templates but we need to document and test how that can be plugged into the upgrade procedure.

Comment 1 Marius Cornea 2018-04-17 21:30:58 UTC
Possible approach(I'm not sure if it's valid though because it involves 2 consecutive stack updates which is probably not acceptable for large environments where this operation could take a lot of time). In addition this operation most probably won't fit into the 4h maintenance window required by some operators. 

Post upgrade do:

1/ update the nic templates to the new format by using the script provided by BZ#1544571
2/ set NetworkDeploymentActions: ['CREATE', 'UPDATE'] in an environment file
3/ re-run openstack overcloud deploy command to update stack with the new nic templates
4/ set NetworkDeploymentActions: ['CREATE'] 
5/ re-run openstack overcloud deploy command to update stack to restore NetworkDeploymentActions to the original value

Comment 2 Marios Andreou 2018-04-25 11:08:06 UTC
o/ taken this for triage this week. Not sure yet if we will have to do a stand-alone step of if we can put this into our existing workflow. There is only one stack update during the ffwd-upgrade, which is the initial ffwd-upgrade prepare; am wondering if we can:

1. convert the templates to use new os-net-config format
2. run the ffwd-upgrade prepare and make sure that Network resources are updated
3. 'reset' the NetworkDeploymentActions during converge (since this isn't currently running a stack update this 'reset' will be on the stored plan, but will be applied on the next stack update).

This is definitely related to BZ 1544571 as you pointed out and also BZ 1561255 and BZ 1559151 since they are all about the old vs new os-net-config format. Marking triaged for now but we should discuss during our next call for next steps and testing.

Comment 3 Marios Andreou 2018-05-08 15:29:23 UTC
o/ revisiting .. we now *will* have a heat stack update on converge.  I think we can try/test changing to the new format before the ffwd-upgrade prepare. The config won't be applied here since we no-op all the things (including the os-net-config script [0][1] AFAICS). Then the 'normal' heat stack update on converge will deliver the 'new ' os-net-config configuration. 

To do this I think we may need to set the NetworkDeploymentActions like you mention in comment #1 so we could set in the ffwd-upgrade-prepare.yaml to UPDATE and then in ffwd-upgrade-converge.yaml set it back to CREATE. Gonna bring to our next call adding this comment for now.

[0] https://github.com/openstack/tripleo-heat-templates/blob/c7d18a4db3874b8a29db76aee2835422da69b40a/network/config/single-nic-vlans/role.role.j2.yaml#L50-L56
[1] https://github.com/openstack/tripleo-heat-templates/blob/c7d18a4db3874b8a29db76aee2835422da69b40a/network/config/single-nic-vlans/role.role.j2.yaml#L50-L56

Comment 4 Marios Andreou 2018-05-09 15:55:55 UTC
as just discussed on the phone mcornea here is a patch to test the approach and we can take it from there - tht @ https://review.openstack.org/567270 WIP: Set NetworkDeploymentActions to CREATE,UPDATE for ffwd prepare

Comment 5 Marius Cornea 2018-05-10 17:28:55 UTC
(In reply to Marios Andreou from comment #4)
> as just discussed on the phone mcornea here is a patch to test the approach
> and we can take it from there - tht @ https://review.openstack.org/567270
> WIP: Set NetworkDeploymentActions to CREATE,UPDATE for ffwd prepare

Thanks Marios. On a first test I got positive results with the following workflow:

1/ convert the nic template by using the script provided in tht before starting the fast forward process - https://review.openstack.org/#/c/567428/ - this would be a manual step the operator should do

2/ applied your patch https://review.openstack.org/#/c/567270/ and the workaround for BZ#1561255 to remove /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from overcloud nodes

3/ run the fast forward upgrade on the overcloud nodes

4/ scale down and scale out worked fine

Comment 7 Marios Andreou 2018-05-11 14:23:04 UTC
(In reply to Marius Cornea from comment #5)
> (In reply to Marios Andreou from comment #4)
> > as just discussed on the phone mcornea here is a patch to test the approach
> > and we can take it from there - tht @ https://review.openstack.org/567270
> > WIP: Set NetworkDeploymentActions to CREATE,UPDATE for ffwd prepare
> 
> Thanks Marios. On a first test I got positive results with the following
> workflow:
> 
> 1/ convert the nic template by using the script provided in tht before
> starting the fast forward process - https://review.openstack.org/#/c/567428/
> - this would be a manual step the operator should do
> 
> 2/ applied your patch https://review.openstack.org/#/c/567270/ and the
> workaround for BZ#1561255 to remove
> /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from
> overcloud nodes
> 
> 3/ run the fast forward upgrade on the overcloud nodes
> 
> 4/ scale down and scale out worked fine

ack thanks for the update @mcornea - gona remove the WIP from the https://review.openstack.org/#/c/567270/ and try and sell it :)

Comment 8 Marios Andreou 2018-05-14 09:01:56 UTC
adding queens removing master

Comment 18 errata-xmlrpc 2018-06-27 13:52:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.