Bug 1477962 - OSP11 -> OSP12 upgrade: Ensure non-controller are usable after upgrade and before converge.
OSP11 -> OSP12 upgrade: Ensure non-controller are usable after upgrade and be...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
urgent Severity urgent
: beta
: 12.0 (Pike)
Assigned To: Marios Andreou
Marius Cornea
: Triaged
Depends On:
Blocks: 1399762 1477770
  Show dependency treegraph
 
Reported: 2017-08-03 06:35 EDT by Marius Cornea
Modified: 2017-12-13 16:48 EST (History)
12 users (show)

See Also:
Fixed In Version: openstack-tripleo-heat-templates-7.0.1-0.20170928105409.el7ost python-tripleoclient-7.3.1-0.20170925220840.f114a61.el7ost openstack-tripleo-common-7.6.1-0.20170926174320.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-13 16:48:30 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1708115 None None None 2017-08-03 06:35 EDT
OpenStack gerrit 490847 None None None 2017-08-16 08:24 EDT
OpenStack gerrit 490848 None None None 2017-08-16 08:25 EDT
OpenStack gerrit 498776 None None None 2017-08-29 08:41 EDT
OpenStack gerrit 499625 None None None 2017-10-10 10:08 EDT
OpenStack gerrit 500596 None None None 2017-09-18 05:59 EDT
OpenStack gerrit 500751 None None None 2017-09-18 05:59 EDT
OpenStack gerrit 500752 None None None 2017-09-18 05:58 EDT

  None (edit)
Description Marius Cornea 2017-08-03 06:35:20 EDT
Description of problem:

In the previous iteration we had that mechanism in place https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/tripleo_upgrade_node.sh#L61-L69 for ensure that non-controller node were working after the upgrade and before the converge.

This is especially critical for compute node which should be able to get vm before the convergence step.

For compute node we also have to ensure that rpc pin/unpin happen within the nova_compute container using this parameter UpgradeLevelNovaCompute.
Comment 1 Marios Andreou 2017-08-16 08:23:58 EDT
Current proposal (duplicated from upstream bug for convenience) and adding the reviews to trackers above

With the help of a utility function in https://review.openstack.org/#/c/491749/ (python-tripleoclient) we can use the upgrade_tasks playbook generated by the tripleo-heat-templates at https://review.openstack.org/#/c/490848/ (note: this depends on a few shardy tht reviews see shortlog).

So, in the upgrade-non-controller.sh script, we add download and execution of both the upgrade_tasks and deploy_steps playbooks with https://review.openstack.org/#/c/490847/ (tripleo-common).

The generated playbooks look like https://paste.fedoraproject.org/paste/gUi5Ckq2qoTT~ed5kItxRw/raw (while it lasts)... seems like most of the things we need for the compute and swift nodes are in the ugprade_tasks (e.g. stop openstack-nova-compute which we had to add recently into the tripleo_upgrade_node.sh).

Reviews:
     (tripleo-common): https://review.openstack.org/#/c/490847/ "Download and run upgrade/deploy_steps_playbooks for upgrade"
     |
     |Depends-On:
     |
     -->(tripleo-heat-templates): https://review.openstack.org/#/c/490848/ "Also write an upgrade_(batch)_tasks playbook" (&see shortlog!)
        |
        |Depends-On:
        |
        -->(python-tripleo-client): https://review.openstack.org/#/c/491749/ "Adds when in upgrade_tasks playbook written by config download"
Comment 7 Marios Andreou 2017-08-29 08:41:25 EDT
just also posted https://review.openstack.org/498776 for disabling the puppet config run and related workarounds from the tripleo-upgrade-node.sh script. If testing you'll also need to apply this on your tripleo-heat-templates before running the major-upgrade-composable-steps-docker.yaml stage of the overcloud upgrade.

adding to trackers and for testing:

# tripleo-heat-templates: https://review.openstack.org/#/c/498776/ "Remove puppet run and workarounds from tripleo_upgrade_node.sh" 

 curl https://review.openstack.org/changes/498776/revisions/current/patch?download |  base64 -d | sudo patch  -d /usr/share/openstack-tripleo-heat-templates/ -p1
Comment 9 Marius Cornea 2017-08-30 11:18:55 EDT
So I managed to get the RoleConfig output after applying the following patch and running the deploy command with --setup-heat-outputs option. 

I think we should include this step in the major-upgrade-composable-steps-docker.yaml step so we don't have to include an additional step in the upgrade procedure.

curl -4 https://review.openstack.org/changes/495658/revisions/current/patch?download | base64 -d | sudo patch  -d /usr/lib/python2.7/site-packages/ -p1 -f

#!/bin/bash

timeout 180m openstack overcloud deploy \
--setup-heat-outputs \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/services-docker/sahara.yaml \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /home/stack/docker-osp12.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \


After this I was able to run upgrade-non-controller.sh --upgrade compute-0 which failed with the below error:

a quick note here: /usr/bin/tripleo-ansible-inventory --list takes around 2 minutes for a basic 1 controller + 1 compute deployment so you get the impression that the command is stuck at:

Wed Aug 30 11:03:04 EDT 2017 upgrade-non-controller.sh Starting the upgrade steps playbook run for compute-0 from compute-0/tripleo-bVAAT_-config/

In the end the playbook fails with the following error:

TASK [Ensure empty directory: emptying.] ******************************************************************************************************************************************************************************************************
 [WARNING]: when statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: ('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)

fatal: [192.168.24.13]: FAILED! => {"failed": true, "msg": "The conditional check '('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)' failed. The error was: error while evaluating conditional (('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)): 'dict object' has no attribute 'stdout'\n\nThe error appears to have been in '/home/stack/compute-0/tripleo-bVAAT_-config/Compute/upgrade_tasks.yaml': line 42, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n- block:\n  - file:\n    ^ here\n"}
	to retry, use: --limit @/home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_playbook.retry
Comment 10 Marius Cornea 2017-08-30 11:24:34 EDT
This is the complete output:

You can see that the 'Check openvswitch version' is skipped hence the dict object' has no attribute 'stdout' error regarding ovs_version.stdout


PLAY [overcloud] ******************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ************************************************************************************************************************************************************************************************************************
ok: [192.168.24.13]

TASK [include] ********************************************************************************************************************************************************************************************************************************
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13
included: /home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_tasks.yaml for 192.168.24.13

TASK [include] ********************************************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [include] ********************************************************************************************************************************************************************************************************************************
included: /home/stack/compute-0/tripleo-bVAAT_-config/Compute/upgrade_tasks.yaml for 192.168.24.13

TASK [Check if neutron_ovs_agent is deployed] *************************************************************************************************************************************************************************************************
changed: [192.168.24.13]

TASK [Check yum for rpm-python present] *******************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Fail when rpm-python wasn't present] ****************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [PreUpgrade step0,validation: Check service neutron-openvswitch-agent is running] ********************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Stop neutron_ovs_agent service] *********************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Stop snmp service] **********************************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Check openvswitch version.] *************************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Check openvswitch packaging.] ***********************************************************************************************************************************************************************************************************
skipping: [192.168.24.13]

TASK [Ensure empty directory: emptying.] ******************************************************************************************************************************************************************************************************
 [WARNING]: when statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: ('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)

fatal: [192.168.24.13]: FAILED! => {"failed": true, "msg": "The conditional check '('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)' failed. The error was: error while evaluating conditional (('2.5.0-14' in '{{ovs_version.stdout}}' or ovs_packaging_issue|succeeded) and (step == 2)): 'dict object' has no attribute 'stdout'\n\nThe error appears to have been in '/home/stack/compute-0/tripleo-bVAAT_-config/Compute/upgrade_tasks.yaml': line 42, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n- block:\n  - file:\n    ^ here\n"}
	to retry, use: --limit @/home/stack/compute-0/tripleo-bVAAT_-config/upgrade_steps_playbook.retry

PLAY RECAP ************************************************************************************************************************************************************************************************************************************
192.168.24.13              : ok=8    changed=1    unreachable=0    failed=1
Comment 11 Marios Andreou 2017-08-31 07:03:14 EDT
we also need https://review.openstack.org/#/c/499540/ mcornea ++ adding to trackers
Comment 12 Marius Cornea 2017-09-01 16:51:22 EDT
Adding another review for allowing the upgrade tasks to run between steps:
https://review.openstack.org/#/c/499517/


Also I filed a BZ for tripleo-inventory being too slow:
https://bugzilla.redhat.com/show_bug.cgi?id=1487759
Comment 13 Marius Cornea 2017-09-04 12:31:14 EDT
Remaining issues that we need to track in this bug:

 - set up RoleConfig output during major-upgrade-composable-steps so we don't have to run an additional step with --setup-heat-outputs option

 - cache the tripleo-ansible-inventory so we don't waste 5 minutes per non controller node waiting for the ouptut of tripleo-ansible-inventory
Comment 14 Marius Cornea 2017-09-07 04:39:47 EDT
(In reply to Marius Cornea from comment #13)
> Remaining issues that we need to track in this bug:
> 
>  - set up RoleConfig output during major-upgrade-composable-steps so we
> don't have to run an additional step with --setup-heat-outputs option
> 
>  - cache the tripleo-ansible-inventory so we don't waste 5 minutes per non
> controller node waiting for the ouptut of tripleo-ansible-inventory

The slow inventory issue was addressed by https://review.openstack.org/#/c/501603/

In addition we need to address upgrading non controller nodes for split stack deployments.
Comment 15 Marius Cornea 2017-09-14 08:38:11 EDT
RoleConfig output issue is being tracked in bug 1490425

Remaining issues to be addressed by this bug:

 - upgrading non controller nodes on split stack deployments
Comment 16 Marius Cornea 2017-09-14 09:17:09 EDT
(In reply to Marius Cornea from comment #15)
> RoleConfig output issue is being tracked in bug 1490425
> 
> Remaining issues to be addressed by this bug:
> 
>  - upgrading non controller nodes on split stack deployments

We actually have a different BZ (bug 1474697) filed for split stack deployments so I think this bug can be moved to POST as all the patches attached to it are merged to stable/pike.
Comment 17 Marios Andreou 2017-09-18 06:07:21 EDT
(In reply to Marius Cornea from comment #16)
> (In reply to Marius Cornea from comment #15)
> > RoleConfig output issue is being tracked in bug 1490425
> > 
> > Remaining issues to be addressed by this bug:
> > 
> >  - upgrading non controller nodes on split stack deployments
> 
> We actually have a different BZ (bug 1474697) filed for split stack
> deployments so I think this bug can be moved to POST as all the patches
> attached to it are merged to stable/pike.

thanks mcornea I updated the trackers to point to stable/pike (the last two merged before pike was branched and I checked they are in stable/pike tripleo-heat-templates and tripleo-common for https://review.openstack.org/#/c/490848/ and https://review.openstack.org/#/c/490847/ respectively

I'll bring this on our call later and we can move to POST
Comment 23 errata-xmlrpc 2017-12-13 16:48:30 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.