Bug 2168345 - OSP17.0 ceph Satellite Deploy fails due to expected osds not running
Summary: OSP17.0 ceph Satellite Deploy fails due to expected osds not running
Keywords:
Status: MODIFIED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z7
: 17.0
Assignee: Francesco Pantano
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On: 2193166
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-08 18:46 UTC by David Rosenfeld
Modified: 2023-05-19 09:33 UTC (History)
11 users (show)

Fixed In Version: tripleo-ansible-3.3.1-0.20230518150455.fa5422f.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2193166 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 883413 0 None NEW Do not use image digest when the Ceph cluster is deployed 2023-05-18 15:44:36 UTC
Red Hat Issue Tracker OSP-22181 0 None None None 2023-02-08 18:46:25 UTC

Description David Rosenfeld 2023-02-08 18:46:04 UTC
Description of problem: OSP 17 ceph satellite deploys fail with error:

FATAL | Wait for expected number of osds to be running | controller-0

Overcloud deploy failures with the satellite are specific to when ceph is used. LVM satellite deploys are successful.

After overcloud deploy fails it is seen that the ceph containers are present on the undercloud:

(undercloud) [stack@undercloud-0 ~]$ openstack tripleo container image list | grep ceph
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-openshift-ose-prometheus-alertmanager:v4.10   |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-grafana:latest                                |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-openshift-ose-prometheus:v4.10                |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-openshift-ose-prometheus-node-exporter:v4.10  |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph:5-359                                  |

It is also seen that on a ceph node the /var/log/ceph/cephadm.log contains 404 errors saying it can't find the ceph images on the undercloud:

2023-02-07 12:16:05,207 7f2f5388e740 DEBUG stat: Trying to pull undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce...
2023-02-07 12:16:05,214 7f2f5388e740 DEBUG stat: Error: initializing source docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce: reading manifest sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce in undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph: StatusCode: 404, <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">...
2023-02-07 12:16:05,217 7f2f5388e740 INFO Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce -e NODE_NAME=ceph-0 -e CEPH_USE_RANDOM_NONCE=1 undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce -c %u %g /var/lib/ceph

So its in a state where the ceph nodes can't find container images that do seem to be present on the undercloud.

One other thing found while debugging is that controller-0 could pull the ceph image:
[heat-admin@controller-0 ~]$ sudo podman images
REPOSITORY                                                                             TAG         IMAGE ID      CREATED      SIZE
undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph  5-359       412d7e4d681e  4 weeks ago  986 MB

However, controllers 1 and 2 have the same 404 in /var/log/ceph/cephadm.log that the ceph nodes have.

Version-Release number of selected component (if applicable): OSP17


How reproducible: Every time


Steps to Reproduce:
1. Do a satellite deployment with ceph
2.
3.

Actual results: Overcloud deploy fails with error:
FATAL | Wait for expected number of osds to be running | controller-0 
Actual results: Overcloud successfully deploys


Expected results: Overcloud successfully deploys

Additional info:


Note You need to log in before you can comment on or make changes to this bug.