Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2210626

Summary: tripleo_cephadm: Bootstrap play fails on "Ensure cephadm uses image tags instead of digests" task when Ceph cluster name is not default
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: tripleo-ansibleAssignee: Francesco Pantano <fpantano>
Status: CLOSED ERRATA QA Contact: Alfredo <alfrgarc>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: eshames, fpantano, gfidente, mkatari
Target Milestone: rcKeywords: Automation, Regression, Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-1.20230518201532.el9ost openstack-tripleo-heat-templates-14.3.1-1.20230519151005.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 01:15:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2023-05-29 03:42:48 UTC
Description of problem:
Bootstrap play of ceph cluster fails on "Ensure cephadm uses image tags instead of digests" task.
The actual error output looks like following:
FATAL | Ensure cephadm uses image tags instead of digests | controller-0 | error={"changed": false, "cmd": ["podman", "run", "--rm", "--net=host", "--ipc=host", "--volume", "/var/lib/ceph/2696bddc-047e-4751-bdf2-259c510254f2/config/:/etc/ceph:z", "--volume", "/home/ceph-admin/assimilate_central.conf:/home/assimilate_central.conf:z", "--entrypoint", "ceph", "rhos-qe-mirror-tlv.usersys.redhat.com:5002/rh-osbs/rhceph:6-115", "--fsid", "2696bddc-047e-4751-bdf2-259c510254f2", "-c", "/etc/ceph/central.conf", "-k", "/etc/ceph/central.client.admin.keyring", "config", "set", "mgr", "mgr/cephadm/use_repo_digest", "false"], "delta": "0:00:00.584120", "end": "2023-05-26 14:41:46.698484", "msg": "non-zero return code", "rc": 1, "start": "2023-05-26 14:41:46.114364", "stderr": "Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')", "stderr_lines": ["Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')"], "stdout": "", "stdout_lines": []}

It comes from the following playbook:
https://opendev.org/openstack/tripleo-ansible/src/branch/stable/wallaby/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml#L78

The failing task has been added recently to the d/s build with this commit (currently latest compose -  RHOS-17.1-RHEL-9-20230525.n.1):
https://review.opendev.org/c/openstack/tripleo-ansible/+/883413
And that's the point It's started to fail.

The task runs the following command right after the ceph cluster bootstrap, i.e.:
sudo podman run --rm --net=host --ipc=host --volume /var/lib/ceph/2696bddc-047e-4751-bdf2-259c510254f2/config/:/etc/ceph:z --volume /home/ceph-admin/assimilate_central.conf:/home/assimilate_central.conf:z --entrypoint ceph rhos-qe-mirror-tlv.usersys.redhat.com:5002/rh-osbs/rhceph:6-115 --fsid c970470c-329b-5df0-a3d5-28f6e4b4ff98 -c /etc/ceph/central.conf -k /etc/ceph/central.client.admin.keyring config set mgr mgr/cephadm/use_repo_digest false

And It fails on:
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

The content of directories with conf files:
[cloud-admin@controller-0 ~]$ ll /etc/ceph/
-rw-------. 1 root root  63 May 26 14:40 central.client.admin.keyring
-rw-r--r--. 1 root root 173 May 26 14:40 central.conf

BUT

cloud-admin@controller-0 ~]$ sudo ls -la /var/lib/ceph/2696bddc-047e-4751-bdf2-259c510254f2/config/
-rw-r--r--. 1 root root   63 May 26 14:41 ceph.client.admin.keyring
-rw-r--r--. 1 root root  173 May 26 14:41 ceph.conf

The content of the conf files is the same in the both directories but the files in /var/lib/ceph have names with default Ceph cluster name "ceph" while the right names should include "central". It's a DCN deployment with multiple ceph clusters with different names. Both of the locations with the conf files seem to be created during the ceph cluster bootstrap, At first (based on timestamps) the files in /etc/ceph and then in /var/lib/ceph. I have no idea why It creates the conf file with different names in two different locations. The command used in the failing task looks for /var/lib/ceph (since the commit: https://opendev.org/openstack/tripleo-ansible/commit/5e302b4ff7a4e211e885d9e5a298343ba15eab25). Once I copy the content of /etc/ceph to /var/lib/ceph/$FISD/config the task passes successfully.

Version-Release number of selected component (if applicable):
tripleo-ansible-3.3.1-1.20230518201531.358f3c3.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy Openstack with ceph cluster with not default ceph cluster name (which means not "ceph")

Comment 18 Marian Krcmarik 2023-06-12 06:12:19 UTC
The bug can be VERIFIED

Comment 26 errata-xmlrpc 2023-08-16 01:15:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577