2210626 – tripleo_cephadm: Bootstrap play fails on "Ensure cephadm uses image tags instead of digests" task when Ceph cluster name is not default

Bug 2210626 - tripleo_cephadm: Bootstrap play fails on "Ensure cephadm uses image tags instead of digests" task when Ceph cluster name is not default

Summary: tripleo_cephadm: Bootstrap play fails on "Ensure cephadm uses image tags inst...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	17.1
Assignee:	Francesco Pantano
QA Contact:	Alfredo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-29 03:42 UTC by Marian Krcmarik
Modified:	2023-09-23 17:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:	tripleo-ansible-3.3.1-1.20230518201532.el9ost openstack-tripleo-heat-templates-14.3.1-1.20230519151005.el9ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-08-16 01:15:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	884332	None	ABANDONED	Fix permission issues on ceph conf	2023-07-31 07:21:47 UTC
OpenStack gerrit	884555	None	MERGED	Rework cephadm key and pools creation for Quincy+	2023-07-31 07:21:43 UTC
Red Hat Issue Tracker	OSP-25425	None	None	None	2023-05-29 03:43:13 UTC
Red Hat Product Errata	RHEA-2023:4577	None	None	None	2023-08-16 01:15:46 UTC

Description Marian Krcmarik 2023-05-29 03:42:48 UTC

Description of problem:
Bootstrap play of ceph cluster fails on "Ensure cephadm uses image tags instead of digests" task.
The actual error output looks like following:
FATAL | Ensure cephadm uses image tags instead of digests | controller-0 | error={"changed": false, "cmd": ["podman", "run", "--rm", "--net=host", "--ipc=host", "--volume", "/var/lib/ceph/2696bddc-047e-4751-bdf2-259c510254f2/config/:/etc/ceph:z", "--volume", "/home/ceph-admin/assimilate_central.conf:/home/assimilate_central.conf:z", "--entrypoint", "ceph", "rhos-qe-mirror-tlv.usersys.redhat.com:5002/rh-osbs/rhceph:6-115", "--fsid", "2696bddc-047e-4751-bdf2-259c510254f2", "-c", "/etc/ceph/central.conf", "-k", "/etc/ceph/central.client.admin.keyring", "config", "set", "mgr", "mgr/cephadm/use_repo_digest", "false"], "delta": "0:00:00.584120", "end": "2023-05-26 14:41:46.698484", "msg": "non-zero return code", "rc": 1, "start": "2023-05-26 14:41:46.114364", "stderr": "Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')", "stderr_lines": ["Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')"], "stdout": "", "stdout_lines": []}

It comes from the following playbook:
https://opendev.org/openstack/tripleo-ansible/src/branch/stable/wallaby/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml#L78

The failing task has been added recently to the d/s build with this commit (currently latest compose -  RHOS-17.1-RHEL-9-20230525.n.1):
https://review.opendev.org/c/openstack/tripleo-ansible/+/883413
And that's the point It's started to fail.

The task runs the following command right after the ceph cluster bootstrap, i.e.:
sudo podman run --rm --net=host --ipc=host --volume /var/lib/ceph/2696bddc-047e-4751-bdf2-259c510254f2/config/:/etc/ceph:z --volume /home/ceph-admin/assimilate_central.conf:/home/assimilate_central.conf:z --entrypoint ceph rhos-qe-mirror-tlv.usersys.redhat.com:5002/rh-osbs/rhceph:6-115 --fsid c970470c-329b-5df0-a3d5-28f6e4b4ff98 -c /etc/ceph/central.conf -k /etc/ceph/central.client.admin.keyring config set mgr mgr/cephadm/use_repo_digest false

And It fails on:
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

The content of directories with conf files:
[cloud-admin@controller-0 ~]$ ll /etc/ceph/
-rw-------. 1 root root  63 May 26 14:40 central.client.admin.keyring
-rw-r--r--. 1 root root 173 May 26 14:40 central.conf

BUT

cloud-admin@controller-0 ~]$ sudo ls -la /var/lib/ceph/2696bddc-047e-4751-bdf2-259c510254f2/config/
-rw-r--r--. 1 root root   63 May 26 14:41 ceph.client.admin.keyring
-rw-r--r--. 1 root root  173 May 26 14:41 ceph.conf

The content of the conf files is the same in the both directories but the files in /var/lib/ceph have names with default Ceph cluster name "ceph" while the right names should include "central". It's a DCN deployment with multiple ceph clusters with different names. Both of the locations with the conf files seem to be created during the ceph cluster bootstrap, At first (based on timestamps) the files in /etc/ceph and then in /var/lib/ceph. I have no idea why It creates the conf file with different names in two different locations. The command used in the failing task looks for /var/lib/ceph (since the commit: https://opendev.org/openstack/tripleo-ansible/commit/5e302b4ff7a4e211e885d9e5a298343ba15eab25). Once I copy the content of /etc/ceph to /var/lib/ceph/$FISD/config the task passes successfully.

Version-Release number of selected component (if applicable):
tripleo-ansible-3.3.1-1.20230518201531.358f3c3.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy Openstack with ceph cluster with not default ceph cluster name (which means not "ceph")

Comment 18 Marian Krcmarik 2023-06-12 06:12:19 UTC

The bug can be VERIFIED

Comment 26 errata-xmlrpc 2023-08-16 01:15:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577

Note You need to log in before you can comment on or make changes to this bug.