Description of problem: The ceph validations included in tripleo-heat-templates do not work when upgrading downstream. The validation considers that the ceph-ansible package is retrieved always from "centos-ceph-nautilus" (https://github.com/openstack/tripleo-validations/blob/4899441f68a53ce4c547ca1d40e4b42609906b1a/playbooks/ceph-ansible-installed.yaml#L12) which is not the case in a RHOSP deployment. A better approach would be to retrieve the value of the registry from the containers-prepare-parameters.yaml file, this way we make sure that the package is updated and retrieved from the right registry. The workaround to not get blocked by this validation is to skip the "opendev-validation" tags, however they are useful and valid to avoid further trouble so we would like to enable them back once this gets solved. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy OSP16 from a CI job with ceph. 2. Run openstack overcloud external-upgrade run \ --stack overcloud \ --tags ceph_systemd \ -e ceph_ansible_limit=controller-0 3. Actual results: Expected results: Additional info:
Thanks a lot for the comment Francesco. I will give that a try and comment back in the bugzilla.
Hello Francesco, Reopening this bug as after making use of the CephAnsibleRepo variable a new issue is appearing, now in a different validation, ceph-health.yaml: Monday 06 April 2020 16:24:16 -0400 (0:00:00.441) 0:02:31.646 ********** ok: [undercloud -> 192.168.24.10] => { "inventory_hostname": "undercloud" } TASK [ceph : Set container_cli fact from the inventory] ********************************************************************************************************************* task path: /usr/share/openstack-tripleo-validations/roles/ceph/tasks/ceph-health.yaml:13 Monday 06 April 2020 16:24:17 -0400 (0:00:00.400) 0:02:32.046 ********** ok: [undercloud -> 192.168.24.10] => {"ansible_facts": {"container_cli": "podman"}, "changed": false} TASK [ceph : Set container filter format] *********************************************************************************************************************************** task path: /usr/share/openstack-tripleo-validations/roles/ceph/tasks/ceph-health.yaml:17 Monday 06 April 2020 16:24:18 -0400 (0:00:00.675) 0:02:32.722 ********** ok: [undercloud -> 192.168.24.10] => {"ansible_facts": {"container_filter_format": "--format '{{ .Names }}'"}, "changed": false} TASK [ceph : Set ceph_mon_container name] *********************************************************************************************************************************** task path: /usr/share/openstack-tripleo-validations/roles/ceph/tasks/ceph-health.yaml:21 Monday 06 April 2020 16:24:18 -0400 (0:00:00.414) 0:02:33.136 ********** fatal: [undercloud -> 192.168.24.10]: FAILED! => {"changed": false, "cmd": "podman ps --format '{{ .Names }}' | grep ceph-mon", "delta": "0:00:00.006923", "end": "2020-04-06 20:24:18.699377", "msg": "non-zero return code", "rc": 1, "start": "2020-04-06 20:24:18.692454", "stderr": "/bin/sh: podman: command not found", "stderr_lines": ["/bin/sh: podman: command not found"], "stdout": "", "stdout_lines": []} The first debug task was added by me to check what's the content of the inventory_hostname. As this task is being run from ceph-base.yaml using delegate_to, the task runs in a ceph node but the inventory_hostname still points to undercloud. This shouldn't be a problem in a deployment scenario, but here we are upgrading from OSP13 to OSP16 so the Undercloud is in RHEL8 (container_cli=podman) and the ceph nodes in RHEL7 (container_cli=docker). Digging up a little bit in the Ansible behavior, this seems like a common thing. So a solution is parametrizing the target host in the ceph-health validation or find some other way to get the inventory hostname for the delegated target.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148