Bug 2168345

Summary: OSP17.0 ceph Satellite Deploy fails due to expected osds not running
Product: Red Hat OpenStack Reporter: David Rosenfeld <drosenfe>
Component: tripleo-ansibleAssignee: Francesco Pantano <fpantano>
Status: MODIFIED --- QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: alfrgarc, bshephar, eharney, elicohen, fpantano, gfidente, jslagle, mburns, mkatari, ramishra, slinaber
Target Milestone: z7Keywords: Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-0.20230518150455.fa5422f.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2193166 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2193166    
Bug Blocks:    

Description David Rosenfeld 2023-02-08 18:46:04 UTC
Description of problem: OSP 17 ceph satellite deploys fail with error:

FATAL | Wait for expected number of osds to be running | controller-0

Overcloud deploy failures with the satellite are specific to when ceph is used. LVM satellite deploys are successful.

After overcloud deploy fails it is seen that the ceph containers are present on the undercloud:

(undercloud) [stack@undercloud-0 ~]$ openstack tripleo container image list | grep ceph
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-openshift-ose-prometheus-alertmanager:v4.10   |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-grafana:latest                                |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-openshift-ose-prometheus:v4.10                |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-openshift-ose-prometheus-node-exporter:v4.10  |
| docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph:5-359                                  |

It is also seen that on a ceph node the /var/log/ceph/cephadm.log contains 404 errors saying it can't find the ceph images on the undercloud:

2023-02-07 12:16:05,207 7f2f5388e740 DEBUG stat: Trying to pull undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce...
2023-02-07 12:16:05,214 7f2f5388e740 DEBUG stat: Error: initializing source docker://undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce: reading manifest sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce in undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph: StatusCode: 404, <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">...
2023-02-07 12:16:05,217 7f2f5388e740 INFO Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce -e NODE_NAME=ceph-0 -e CEPH_USE_RANDOM_NONCE=1 undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph@sha256:61ca086e93f6c433d6673afbe4d224b9bc51defed2cd88baaf9849a6a81940ce -c %u %g /var/lib/ceph

So its in a state where the ceph nodes can't find container images that do seem to be present on the undercloud.

One other thing found while debugging is that controller-0 could pull the ceph image:
[heat-admin@controller-0 ~]$ sudo podman images
REPOSITORY                                                                             TAG         IMAGE ID      CREATED      SIZE
undercloud-0.ctlplane.redhat.local:8787/default_organization-ceph-5-containers-rhceph  5-359       412d7e4d681e  4 weeks ago  986 MB

However, controllers 1 and 2 have the same 404 in /var/log/ceph/cephadm.log that the ceph nodes have.

Version-Release number of selected component (if applicable): OSP17


How reproducible: Every time


Steps to Reproduce:
1. Do a satellite deployment with ceph
2.
3.

Actual results: Overcloud deploy fails with error:
FATAL | Wait for expected number of osds to be running | controller-0 
Actual results: Overcloud successfully deploys


Expected results: Overcloud successfully deploys

Additional info: