Bug 1884677 - During FFWD 13>16 upgrade site-container.yml is triggered instead of docker-to-podman.yml even with --tags ceph_systemd
Summary: During FFWD 13>16 upgrade site-container.yml is triggered instead of docker-t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z3
: 16.1 (Train on RHEL 8.2)
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-02 16:03 UTC by Amedeo Salvati
Modified: 2020-12-15 18:37 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170166.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-15 18:36:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fail.log from comment #4 (255.24 KB, text/plain)
2020-10-02 17:00 UTC, John Fulton
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1898589 0 None None None 2020-10-05 16:15:28 UTC
OpenStack gerrit 756125 0 None MERGED Force CephAnsiblePlaybook to its default value on FFU prepare 2020-12-05 11:34:17 UTC
Red Hat Product Errata RHEA-2020:5413 0 None None None 2020-12-15 18:37:21 UTC

Description Amedeo Salvati 2020-10-02 16:03:51 UTC
Description of problem:
Ansible playbook fails during the ffu because the undercloud is using ansible 2.9, and ceph-ansible 3.2 contains statements deprecated for that version


Version-Release number of selected component (if applicable):
ansible-2.9.13-1.el8ae.noarch
ceph-ansible-3.2.49-1.el7cp.noarch

How reproducible:


Steps to Reproduce:
1. follow framework to upgrade till to point 17.2
2. run: 
openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0
3.

Actual results:
fatal: [undercloud]: FAILED! => {                                                                                                                                                             │···················
    "ceph_ansible_std_out_err": [                                                                                                                                                             │···················
        "Using /usr/share/ceph-ansible/ansible.cfg as config file",                                                                                                                           │···················
        "ERROR! 'delegate_to' is not a valid attribute for a TaskInclude",                                                                                                                    │···················
        "",                                                                                                                                                                                   │···················
        "The error appears to be in '/usr/share/ceph-ansible/roles/ceph-mon/tasks/main.yml': line 20, column 3, but may",                                                                     │···················
        "be elsewhere in the file depending on the exact syntax problem.",                                                                                                                    │···················
        "The offending line appears to be:",                                                                                                                                                  │···················
        "- name: include secure_cluster.yml",                                                                                                                                                 │···················
        "  ^ here"                                                                                                                                                                            │···················
    ],

Expected results:


Additional info:

Comment 1 RHEL Program Management 2020-10-02 16:03:58 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 4 John Fulton 2020-10-02 16:52:45 UTC
We can fail the deployment during the systemd systemd unit file updates in less than one minute and log the result [1]
Then if we grep the results we see the RIGHT playbook get set but then later it gets set to the WRONG playbook [2] 

Why does that happen?


[1]
(undercloud) [stack@undercloud ~]$ time openstack overcloud external-upgrade run --stack overcloud --tags ceph_systemd -e ceph_ansible_limit=ctr0 > fail.log
sys:1: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.200.2', 43666), raddr=('192.168.200.2', 5000)>
sys:1: ResourceWarning: unclosed <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.200.2', 53504), raddr=('192.168.200.2', 8989)>

real    0m57.644s
user    0m2.097s
sys     0m0.728s
(undercloud) [stack@undercloud ~]$ 

[2]
(undercloud) [stack@undercloud ~]$ grep ceph_ansible_playbooks_default  fail.log                                                                                                              
TASK [set ceph_ansible_playbooks_default] **************************************
ok: [undercloud] => {"ansible_facts": {"ceph_ansible_playbooks_default": ["/usr/share/ceph-ansible/infrastructure-playbooks/docker-to-podman.yml"]}, "changed": false}
TASK [set ceph_ansible_playbooks_default] **************************************
TASK [set ceph_ansible_playbooks_default] **************************************
(undercloud) [stack@undercloud ~]$ egrep "ceph_ansible_playbooks_default|ceph_ansible_playbooks" fail.log                                                                                     
TASK [set ceph_ansible_playbooks_default] **************************************
ok: [undercloud] => {"ansible_facts": {"ceph_ansible_playbooks_default": ["/usr/share/ceph-ansible/infrastructure-playbooks/docker-to-podman.yml"]}, "changed": false}
TASK [set ceph_ansible_playbooks_default] **************************************
TASK [set ceph_ansible_playbooks_default] **************************************
ok: [undercloud] => {"ansible_facts": {"ceph_ansible_environment_variables": ["ANSIBLE_SSH_RETRIES=6", "DEFAULT_FORKS=100"], "ceph_ansible_playbook_verbosity": 1, "ceph_ansible_playbooks_param": ["/usr/share/ceph-ansible/site-docker.yml.sample"], "ceph_ansible_skip_tags": "package-install,with_pkg"}, "changed": false}
ok: [undercloud] => {"ansible_facts": {"ceph_ansible_playbooks": ["/usr/share/ceph-ansible/site-docker.yml.sample"]}, "changed": false}
(undercloud) [stack@undercloud ~]$

Comment 5 John Fulton 2020-10-02 17:00:07 UTC
Created attachment 1718483 [details]
fail.log from comment #4

Comment 6 John Fulton 2020-10-02 17:21:38 UTC
In config-download output file external_deploy_steps_tasks.yaml we see that CephAnsiblePlaybook has its default value /usr/share/ceph-ansible/site-docker.yml.sample so it got set as per the following:

https://github.com/openstack/tripleo-heat-templates/blob/094631e0437f1775601ceb49d398427214759f63/deployment/ceph-ansible/ceph-base.yaml#L661-L668

and we see that playbook being run:

(undercloud) [stack@undercloud b03ca383-1c38-40b8-9ebd-f68517883164]$ grep "set ceph-ansible facts" external_deploy_steps_tasks.yaml -A 10
  - name: set ceph-ansible facts
    set_fact:
      blacklisted_hostnames: []
      ceph_ansible_extra_vars:
        container_binary: podman
        fetch_directory: '{{playbook_dir}}/ceph-ansible/fetch_dir'
        health_osd_check_delay: 40
        health_osd_check_retries: 30
        ireallymeanit: 'yes'
        osd_pool_default_min_size: 2
        osd_pool_default_pg_num: 4
--
  - name: set ceph-ansible facts
    set_fact:
      ceph_ansible_environment_variables:
      - ANSIBLE_SSH_RETRIES=6
      - DEFAULT_FORKS=100
      ceph_ansible_playbook_verbosity: 1
      ceph_ansible_playbooks_param:
      - /usr/share/ceph-ansible/site-docker.yml.sample
      ceph_ansible_skip_tags: package-install,with_pkg
  - include_role:
      name: tripleo-ceph-work-dir
(undercloud) [stack@undercloud b03ca383-1c38-40b8-9ebd-f68517883164]$ pwd
/var/lib/mistral/b03ca383-1c38-40b8-9ebd-f68517883164
(undercloud) [stack@undercloud b03ca383-1c38-40b8-9ebd-f68517883164]$

Comment 7 John Fulton 2020-10-02 17:28:46 UTC
$  openstack stack environment show overcloud | grep -i ceph | grep -i playbook -A 1
  CephAnsiblePlaybook:
  - /usr/share/ceph-ansible/site-docker.yml.sample

Comment 8 Giulio Fidente 2020-10-02 17:30:12 UTC
(In reply to John Fulton from comment #6)
> In config-download output file external_deploy_steps_tasks.yaml we see that
> CephAnsiblePlaybook has its default value
> /usr/share/ceph-ansible/site-docker.yml.sample so it got set as per the
> following:

CephAnsiblePlaybook seems to be the problem; somehow CephAnsiblePlaybook got set manually to site-docker by an environment file; its default value should actually be 'default' [1]

> https://github.com/openstack/tripleo-heat-templates/blob/
> 094631e0437f1775601ceb49d398427214759f63/deployment/ceph-ansible/ceph-base.
> yaml#L661-L668

the above looks correct:

ceph_ansible_playbooks_param is set to the value provided by the user, ceph_ansible_playbooks_default is set to a default list which we define, basing on --tags, then in [2] we pick either _default or _param depending on if the user has actually customized via THT the playbook they want to run

I think this would be solved by setting in an environment file "CephAnsiblePlaybook: default" then rerunning the prepare step. I suspect CephAnsiblePlaybook has been set once in the past, then removed from the env files but Heat doesn't reset it back to its default value in that scenario, it keeps the last value which was set for it.

1. https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L75
2. https://github.com/openstack/tripleo-ansible/blob/stable/train/tripleo_ansible/roles/tripleo-ceph-run-ansible/tasks/main.yml#L19

Comment 9 John Fulton 2020-10-02 17:59:29 UTC
WORKAROUND:

1. Create an environment file foo.yml with the following content:

"""
parameter_defaults:
  CephAnsiblePlaybook: default
"""

2. Re-run 'openstack overcloud upgrade prepare' but include foo.yaml

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#running-the-overcloud-upgrade-preparation-upgrading-overcloud-standard

3. When you get the the point where you run "openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd ..." it should set up the correct ceph-ansible playook. You can cofirm this by looking at the genereated shell script:

(undercloud) [stack@undercloud b03ca383-1c38-40b8-9ebd-f68517883164]$ cat /var/lib/mistral/c9ab0d1c-4127-4687-9d0e-5398c8710019/ceph-ansible/ceph_ansible_command.sh
#!/usr/bin/env bash
set -e
echo "Running $0" >> /var/lib/mistral/c9ab0d1c-4127-4687-9d0e-5398c8710019/ceph-ansible/ceph_ansible_command.log
ANSIBLE_ACTION_PLUGINS=/usr/share/ceph-ansible/plugins/actions/ ANSIBLE_CALLBACK_PLUGINS=/usr/share/ceph-ansible/plugins/callback/ ANSIBLE_FILTER_PLUGINS=/usr/share/ceph-ansible/plugins/filter/ ANSIBLE_ROLES_PATH=/usr/share/ceph-ansible/roles/ ANSIBLE_LOG_PATH="/var/lib/mistral/c9ab0d1c-4127-4687-9d0e-5398c8710019/ceph-ansible/ceph_ansible_command.log" ANSIBLE_SSH_CONTROL_PATH_DIR="/tmp/ceph_ansible_control_path" ANSIBLE_LIBRARY=/usr/share/ceph-ansible/library/ ANSIBLE_CONFIG=/usr/share/ceph-ansible/ansible.cfg ANSIBLE_REMOTE_TEMP="/tmp/ceph_ansible_tmp" ANSIBLE_FORKS=25 ANSIBLE_GATHER_TIMEOUT=60 ANSIBLE_CALLBACK_WHITELIST=profile_tasks ANSIBLE_STDOUT_CALLBACK=default  ANSIBLE_SSH_RETRIES=6 DEFAULT_FORKS=100 ansible-playbook --private-key /var/lib/mistral/c9ab0d1c-4127-4687-9d0e-5398c8710019/ssh_private_key -e ansible_python_interpreter=/usr/libexec/platform-python -v --skip-tags package-install,with_pkg --extra-vars @/var/lib/mistral/c9ab0d1c-4127-4687-9d0e-5398c8710019/ceph-ansible/extra_vars.yml --limit ctr0 -i /var/lib/mistral/c9ab0d1c-4127-4687-9d0e-5398c8710019/ceph-ansible/inventory.yml /usr/share/ceph-ansible/infrastructure-playbooks/docker-to-podman.yml 2>&1
(undercloud) [stack@undercloud b03ca383-1c38-40b8-9ebd-f68517883164]$

Comment 10 John Fulton 2020-10-02 18:04:06 UTC
ROOT CAUSE:

Someone must have run a stack update with CephAnsiblePlaybook overridden in the past. Even if your Heat env files no longer override this parameter the parameter that was overridden may still be in Heat. This is because TripleO's Heat is such that you can only replace values, not delete them (it's a feature, not a bug if you think about how this could save your deployment if you accidentally forget to -e an env file on update). 

Proposed solution:

Make procedure to upgrade always include setting "CephAnsiblePlaybook: default" with either a docs change or we update a default Heat Env Parameter.

Comment 27 errata-xmlrpc 2020-12-15 18:36:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413


Note You need to log in before you can comment on or make changes to this bug.