Bug 1477962

Summary: OSP11 -> OSP12 upgrade: Ensure non-controller are usable after upgrade and before converge.
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: dbecker, jschluet, lyarwood, mandreou, mbracho, mbultel, mburns, morazi, rhel-osp-director-maint, sathlang, sclewis
Target Milestone: betaKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-7.0.1-0.20170928105409.el7ost python-tripleoclient-7.3.1-0.20170925220840.f114a61.el7ost openstack-tripleo-common-7.6.1-0.20170926174320.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:48:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1399762, 1477770    

Description Marius Cornea 2017-08-03 10:35:20 UTC
Description of problem:

In the previous iteration we had that mechanism in place https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/tripleo_upgrade_node.sh#L61-L69 for ensure that non-controller node were working after the upgrade and before the converge.

This is especially critical for compute node which should be able to get vm before the convergence step.

For compute node we also have to ensure that rpc pin/unpin happen within the nova_compute container using this parameter UpgradeLevelNovaCompute.

Comment 1 Marios Andreou 2017-08-16 12:23:58 UTC
Current proposal (duplicated from upstream bug for convenience) and adding the reviews to trackers above

With the help of a utility function in https://review.openstack.org/#/c/491749/ (python-tripleoclient) we can use the upgrade_tasks playbook generated by the tripleo-heat-templates at https://review.openstack.org/#/c/490848/ (note: this depends on a few shardy tht reviews see shortlog).

So, in the upgrade-non-controller.sh script, we add download and execution of both the upgrade_tasks and deploy_steps playbooks with https://review.openstack.org/#/c/490847/ (tripleo-common).

The generated playbooks look like https://paste.fedoraproject.org/paste/gUi5Ckq2qoTT~ed5kItxRw/raw (while it lasts)... seems like most of the things we need for the compute and swift nodes are in the ugprade_tasks (e.g. stop openstack-nova-compute which we had to add recently into the tripleo_upgrade_node.sh).

Reviews:
     (tripleo-common): https://review.openstack.org/#/c/490847/ "Download and run upgrade/deploy_steps_playbooks for upgrade"
     |
     |Depends-On:
     |
     -->(tripleo-heat-templates): https://review.openstack.org/#/c/490848/ "Also write an upgrade_(batch)_tasks playbook" (&see shortlog!)
        |
        |Depends-On:
        |
        -->(python-tripleo-client): https://review.openstack.org/#/c/491749/ "Adds when in upgrade_tasks playbook written by config download"

Comment 7 Marios Andreou 2017-08-29 12:41:25 UTC
just also posted https://review.openstack.org/498776 for disabling the puppet config run and related workarounds from the tripleo-upgrade-node.sh script. If testing you'll also need to apply this on your tripleo-heat-templates before running the major-upgrade-composable-steps-docker.yaml stage of the overcloud upgrade.

adding to trackers and for testing:

# tripleo-heat-templates: https://review.openstack.org/#/c/498776/ "Remove puppet run and workarounds from tripleo_upgrade_node.sh" 

 curl https://review.openstack.org/changes/498776/revisions/current/patch?download |  base64 -d | sudo patch  -d /usr/share/openstack-tripleo-heat-templates/ -p1

Comment 9 Marius Cornea 2017-08-30 15:18:55 UTC
So I managed to get the RoleConfig output after applying the following patch and running the deploy command with --setup-heat-outputs option. 

I think we should include this step in the major-upgrade-composable-steps-docker.yaml step so we don't have to include an additional step in the upgrade procedure.

curl -4 https://review.openstack.org/changes/495658/revisions/current/patch?download | base64 -d | sudo patch  -d /usr/lib/python2.7/site-packages/ -p1 -f

#!/bin/bash

timeout 180m openstack overcloud deploy \
--setup-heat-outputs \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/services-docker/sahara.yaml \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /home/stack/docker-osp12.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \


After this I was able to run upgrade-non-controller.sh --upgrade compute-0 which failed with the below error:

a quick note here: /usr/bin/tripleo-ansible-inventory --list takes around 2 minutes for a basic 1 controller + 1 compute deployment so you get the impression that the command is stuck at:

Wed Aug 30 11:03:04 EDT 2017 upgrade-non-controller.sh Starting the upgrade steps playbook run for compute-0 from compute-0/tripleo-bVAAT_-config/

In the end the playbook fails with the following error:

TASK [Ensure empty directory: emptying.] ******************************************************************************************************************************************************************************************************
 [WARNING]: when statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: ('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)

fatal: [192.168.24.13]: FAILED! => {"failed": true, "msg": "The conditional check '('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)' failed. The error was: error while evaluating conditional (('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)): 'dict object' has no attribute 'stdout'\n\nThe error appears to have been in '/home/stack/compute-0/tripleo-bVAAT_-config/Compute/upgrade_tasks.yaml': line 42, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n- block:\n  - file:\n    ^ here\n"}
	to retry, use: --limit @/home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_playbook.retry

Comment 10 Marius Cornea 2017-08-30 15:24:34 UTC
This is the complete output:

You can see that the 'Check openvswitch version' is skipped hence the dict object' has no attribute 'stdout' error regarding ovs_version.stdout


PLAY [overcloud] ******************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ************************************************************************************************************************************************************************************************************************
ok: [192.168.24.13]

TASK [include] ********************************************************************************************************************************************************************************************************************************
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13

TASK [include] ********************************************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [include] ********************************************************************************************************************************************************************************************************************************
included: /home/stack/compute-0/tripleo-bVAAT_-config/Compute/upgrade_tasks.yaml for 192.168.24.13

TASK [Check if neutron_ovs_agent is deployed] *************************************************************************************************************************************************************************************************
changed: [192.168.24.13]

TASK [Check yum for rpm-python present] *******************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Fail when rpm-python wasn't present] ****************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [PreUpgrade step0,validation: Check service neutron-openvswitch-agent is running] ********************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Stop neutron_ovs_agent service] *********************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Stop snmp service] **********************************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Check openvswitch version.] *************************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Check openvswitch packaging.] ***********************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Ensure empty directory: emptying.] ******************************************************************************************************************************************************************************************************
 [WARNING]: when statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: ('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)

fatal: [192.168.24.13]: FAILED! => {"failed": true, "msg": "The conditional check '('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)' failed. The error was: error while evaluating conditional (('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)): 'dict object' has no attribute 'stdout'\n\nThe error appears to have been in '/home/stack/compute-0/tripleo-bVAAT_-config/Compute/upgrade_tasks.yaml': line 42, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n- block:\n  - file:\n    ^ here\n"}
	to retry, use: --limit @/home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_playbook.retry

PLAY RECAP ************************************************************************************************************************************************************************************************************************************
192.168.24.13              : ok=8    changed=1    unreachable=0    failed=1

Comment 11 Marios Andreou 2017-08-31 11:03:14 UTC
we also need https://review.openstack.org/#/c/499540/ mcornea ++ adding to trackers

Comment 12 Marius Cornea 2017-09-01 20:51:22 UTC
Adding another review for allowing the upgrade tasks to run between steps:
https://review.openstack.org/#/c/499517/


Also I filed a BZ for tripleo-inventory being too slow:
https://bugzilla.redhat.com/show_bug.cgi?id=1487759

Comment 13 Marius Cornea 2017-09-04 16:31:14 UTC
Remaining issues that we need to track in this bug:

 - set up RoleConfig output during major-upgrade-composable-steps so we don't have to run an additional step with --setup-heat-outputs option

 - cache the tripleo-ansible-inventory so we don't waste 5 minutes per non controller node waiting for the ouptut of tripleo-ansible-inventory

Comment 14 Marius Cornea 2017-09-07 08:39:47 UTC
(In reply to Marius Cornea from comment #13)
> Remaining issues that we need to track in this bug:
> 
>  - set up RoleConfig output during major-upgrade-composable-steps so we
> don't have to run an additional step with --setup-heat-outputs option
> 
>  - cache the tripleo-ansible-inventory so we don't waste 5 minutes per non
> controller node waiting for the ouptut of tripleo-ansible-inventory

The slow inventory issue was addressed by https://review.openstack.org/#/c/501603/

In addition we need to address upgrading non controller nodes for split stack deployments.

Comment 15 Marius Cornea 2017-09-14 12:38:11 UTC
RoleConfig output issue is being tracked in bug 1490425

Remaining issues to be addressed by this bug:

 - upgrading non controller nodes on split stack deployments

Comment 16 Marius Cornea 2017-09-14 13:17:09 UTC
(In reply to Marius Cornea from comment #15)
> RoleConfig output issue is being tracked in bug 1490425
> 
> Remaining issues to be addressed by this bug:
> 
>  - upgrading non controller nodes on split stack deployments

We actually have a different BZ (bug 1474697) filed for split stack deployments so I think this bug can be moved to POST as all the patches attached to it are merged to stable/pike.

Comment 17 Marios Andreou 2017-09-18 10:07:21 UTC
(In reply to Marius Cornea from comment #16)
> (In reply to Marius Cornea from comment #15)
> > RoleConfig output issue is being tracked in bug 1490425
> > 
> > Remaining issues to be addressed by this bug:
> > 
> >  - upgrading non controller nodes on split stack deployments
> 
> We actually have a different BZ (bug 1474697) filed for split stack
> deployments so I think this bug can be moved to POST as all the patches
> attached to it are merged to stable/pike.

thanks mcornea I updated the trackers to point to stable/pike (the last two merged before pike was branched and I checked they are in stable/pike tripleo-heat-templates and tripleo-common for https://review.openstack.org/#/c/490848/ and https://review.openstack.org/#/c/490847/ respectively

I'll bring this on our call later and we can move to POST

Comment 23 errata-xmlrpc 2017-12-13 21:48:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462