Bug 1740325

Summary: [RFE] Provide tooling to remove Sahara prior to a 13-16 FFU
Product: Red Hat OpenStack Reporter: Gregory Charot <gcharot>
Component: openstack-tripleo-heat-templatesAssignee: Giulio Fidente <gfidente>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: medium Docs Contact: Vlada Grosu <vgrosu>
Priority: medium    
Version: 16.0 (Train)CC: gfidente, hbrock, jfrancoa, jpretori, jslagle, kgilliga, lhh, ltoscano, mburns, mimccune, nlevinki, nwolf, shrjoshi, spower, tshefi, vgrosu
Target Milestone: AlphaKeywords: FutureFeature, TechPreview, TestOnly, Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170169.el8ost openstack-tripleo-common-11.4.1-1.20200914165651.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2009693 (view as bug list) Environment:
Last Closed: 2021-09-15 07:07:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1791384, 2009693    
Attachments:
Description Flags
Related problem issue if Sahara isn't removed none

Description Gregory Charot 2019-08-12 16:40:12 UTC
Description of problem:

Since Sahara is targeted for removal in OSP16, we need to a way to remove Sahara before a customer can FFU to OSP16.

Version-Release number of selected component (if applicable):

13 (although the upgrade target version is 16)


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


When a customer tries a FFU with Sahara enable it should by default fail.
There should be a process to remove Sahara from OSP13:

1/ An ad-hoc process that runs on OSP13
OR
2/ The process is part of the FFU, the operator explicitly mentions he wants to have Sahara removed 




Additional info:

Comment 7 Jose Luis Franco 2020-08-27 14:01:32 UTC
The code in https://code.engineering.redhat.com/gerrit/#/c/194590/3/deployment/sahara/disable-sahara-engine.yaml is making the upgrade to fail with:

TASK [remove cinder_backup init container on upgrade-scaleup to force re-init] ***
Wednesday 26 August 2020  13:48:39 -0400 (0:00:00.175)       0:04:33.517 ******

TASK [tripleo-container-rm : include_tasks] ************************************
Wednesday 26 August 2020  13:48:40 -0400 (0:00:00.280)       0:04:33.797 ******
fatal: [controller-0]: FAILED! => {"reason": "Could not find or access '/var/lib/mistral/16cba9f9-7fc0-40c5-8598-5f684958137f/tripleo_['podman']_container_rm.yml' on the Ansible Controller."}
fatal: [controller-1]: FAILED! => {"reason": "Could not find or access '/var/lib/mistral/16cba9f9-7fc0-40c5-8598-5f684958137f/tripleo_['podman']_container_rm.yml' on the Ansible Controller."}

PLAY RECAP *********************************************************************
controller-0               : ok=56   changed=15   unreachable=0    failed=1    skipped=32   rescued=0    ignored=2
controller-1               : ok=55   changed=16   unreachable=0    failed=1    skipped=32   rescued=0    ignored=2

Wednesday 26 August 2020  13:48:40 -0400 (0:00:00.220)       0:04:34.017 ******
===============================================================================



This looks to be caused by this block of code:

        - name: Disable openstack-sahara-engine
          when:
            - step|int == 1
          block:
            - name: Disable openstack-sahara-engine
              import_role:
                name: tripleo-container-stop
              vars:
                tripleo_containers_to_stop:
                  - openstack-sahara-engine
              when:
                - sahara_engine_enabled|bool
          block:
            - name: Remove openstack-sahara-engine
              import_role:
                name: tripleo-container-rm
              vars:
                tripleo_containers_to_rm:
                  - openstack-sahara-engine
                tripleo_container_cli:
                  - podman
              when:
                - sahara_engine_enabled|bool

File: https://code.engineering.redhat.com/gerrit/#/c/194590/3/deployment/sahara/disable-sahara-engine.yaml

There is an error in this block as it contains another two blocks. When rendering the ansible code, it causes issues in the tripleo_container_rm role.
Changing this code into:

            - name: Disable openstack-sahara-engine
              when:
                - step|int == 1
                - sahara_engine_enabled|bool
              block:
                - name: Disable openstack-sahara-engine
                  import_role:
                    name: tripleo-container-stop
                  vars:
                    tripleo_containers_to_stop:
                      - openstack-sahara-engine
     
                - name: Remove openstack-sahara-engine
                  import_role:
                    name: tripleo-container-rm
                  vars:
                    tripleo_containers_to_rm:
                      - openstack-sahara-engine
                    tripleo_container_cli:
                      - podman

And relaunching the upgrade step made the upgrade continue.

Comment 8 Jose Luis Franco 2020-08-27 14:24:03 UTC
So, the reason for the failure wasn't the block syntax but the tripleo_container_cli parameter. It was being set as a list:

                tripleo_container_cli:
                  - podman

When the parameter is a single value. That is why we were seeing the /var/lib/mistral/16cba9f9-7fc0-40c5-8598-5f684958137f/tripleo_['podman']_container_rm.yml because tripleo_container_cli gets converted into ['podman'].

The solution is to convert tripleo_container_cli in deployment/sahara/disable-sahara-engine.yaml and deployment/sahara/disable-sahara-api.yaml into:

tripleo_container_cli: "podman"

Comment 9 Tzach Shefi 2020-08-28 04:03:13 UTC
Created attachment 1712906 [details]
Related problem issue if Sahara isn't removed

Adding related FYI,

We should report/open a new bug, about OSP13 with Sahara installed getting stuck in FFU's overcloud controller upgrade[0]. 
If comment 8's change isn't implemented. 

Also a doc bz per "How address Sahara's removal during FFU" or at least an FFU release note about this. 


[0] If Sahara isn't removed on OSP13, this FFU step/command will fail:
#openstack overcloud upgrade run --stack overcloud --limit controller-0,controller-1 tee oc-c1-upgrade-run.log

tail oc-c1-upgrade-run.log

TASK [tripleo-container-rm : include_tasks] ************************************                                                                                             │·······························································
Wednesday 26 August 2020  13:56:04 -0400 (0:00:00.289)       0:01:06.323 ******                                                                                              │·······························································
fatal: [controller-0]: FAILED! => {"reason": "Could not find or access '/var/lib/mistral/3dcc5a5c-046d-4765-92ad-bcd95c6e5cee/tripleo_['podman']_container_rm.yml' on the Ans│·······························································
ible Controller."}                                                                                                                                                           │·······························································
fatal: [controller-1]: FAILED! => {"reason": "Could not find or access '/var/lib/mistral/3dcc5a5c-046d-4765-92ad-bcd95c6e5cee/tripleo_['podman']_container_rm.yml' on the Ans│·······························································
ible Controller."}                                                                                                                                                           │·······························································
                                                                                                                                                                             │·······························································
PLAY RECAP *********************************************************************                                                                                             │·······························································
controller-0               : ok=56   changed=15   unreachable=0    failed=1    skipped=32   rescued=0    ignored=2                                                           │·······························································
controller-1               : ok=55   changed=15   unreachable=0    failed=1    skipped=32   rescued=0    ignored=2                                                           │·······························································
                                                                                                                                                                             │·······························································
Wednesday 26 August 2020  13:56:04 -0400 (0:00:00.240)       0:01:06.563 ******                                                                                              │·······························································
===============================================================================                                                                                              │·······························································
                                                                                                                                                                             │·······························································
Ansible failed, check log at /var/log/containers/mistral/package_update.log.                                                                                                 │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: RuntimeError: Update failed with: Ansible│·······························································
 failed, check log at /var/log/containers/mistral/package_update.log.                                                                                                        │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun Traceback (most recent call last):                                                        │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 32, in run       │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     super(Command, self).run(parsed_args)                                                 │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run     │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     return super(Command, self).run(parsed_args)                                          │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/cliff/command.py", line 185, in run              │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     return_code = self.take_action(parsed_args) or 0                                      │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_upgrade.py", line 238,│·······························································
 in take_action                                                                                                                                                              │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     priv_key=key)                                                                         │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/utils.py", line 1245, in run_update│·······························································
_ansible_action                                                                                                                                                              │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     verbosity=verbosity, extra_vars=extra_vars)                                           │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/package_update.py", line │·······························································
127, in update_ansible                                                                                                                                                       │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     raise RuntimeError('Update failed with: {}'.format(payload['message']))               │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral│·······························································
/package_update.log.                                                                                                                                                         │·······························································
2020-08-26 13:56:05.145 529822 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun ESC[00m                                                                                   │·······························································
2020-08-26 13:56:05.150 529822 ERROR openstack [-] Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.: RuntimeError: Update fai│·······························································
led with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.ESC[00m                                                                                │·······························································
2020-08-26 13:56:05.150 529822 INFO osc_lib.shell [-] END return value: 1ESC[00m

Comment 10 Luigi Toscano 2020-08-28 07:47:19 UTC
(In reply to Tzach Shefi from comment #9)
> Created attachment 1712906 [details]
> Related problem issue if Sahara isn't removed
> 
> Adding related FYI,
> 
> We should report/open a new bug, about OSP13 with Sahara installed getting
> stuck in FFU's overcloud controller upgrade[0]. 
> If comment 8's change isn't implemented. 

This bug is enough: the usage of the special environment to remove sahara and the related settings are tracked here.

> 
> Also a doc bz per "How address Sahara's removal during FFU" or at least an
> FFU release note about this. 

Right now users are prevented from upgrading when sahara is installed, and that's expected at this stage, until this feature is implemented.
Workaround: they can still update the deployment on 13 without sahara before starting the upgrade to 16.1.

Comment 11 Luigi Toscano 2020-08-28 14:24:03 UTC
Another important detail: after fixing the parameter and completing the upgrade process, there are no more sahara container around, but the endpoints still list the sahara ones:

(overcloud) [stack@undercloud-0 ~]$ openstack endpoint list | grep sahara
| 21c372a4d389400985d2efff57defdf3 | regionOne | sahara       | data-processing | True    | public    | http://10.0.0.141:8386/v1.1/%(tenant_id)s     |
| 84411bbc525e4c6fa518bf97e51f9e00 | regionOne | sahara       | data-processing | True    | admin     | http://172.17.1.44:8386/v1.1/%(tenant_id)s    |
| fc1dbb69f0874c0d9e9dac7f7f953ba7 | regionOne | sahara       | data-processing | True    | internal  | http://172.17.1.44:8386/v1.1/%(tenant_id)s    |

Comment 14 Lon Hohberger 2020-10-21 10:53:36 UTC
According to our records, this should be resolved by openstack-tripleo-heat-templates-11.3.2-0.20200616081539.396affd.el8ost.  This build is available now.

Comment 16 Lon Hohberger 2020-11-02 11:51:50 UTC
According to our records, this should be resolved by openstack-tripleo-common-11.4.1-1.20200914165651.el8ost.  This build is available now.

Comment 26 errata-xmlrpc 2021-09-15 07:07:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483