Bug 1968177

Summary: switch-to-containerized fails and leaves cluster in degraded state
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Heðin <hej>
Component: Ceph-AnsibleAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED ERRATA QA Contact: Ameena Suhani S H <amsyedha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2CC: aschoen, ceph-eng-bugs, gabrioux, gmeno, nthomas, tserlin, vereddy, ykaul
Target Milestone: ---   
Target Release: 4.2z3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-4.0.61-1.el8cp, ceph-ansible-4.0.61-1.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-27 18:26:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
See 2021-06-06 12:11:47,842 none

Description Heðin 2021-06-06 14:08:41 UTC
Created attachment 1789117 [details]
See 2021-06-06 12:11:47,842

Description of problem:
switch-to-containerised fails on RHCS-4 when the following is not set in all.yml:
ceph_docker_registry: "registry.redhat.io"
ceph_docker_registry_auth: true
ceph_docker_registry_username:
ceph_docker_registry_password:

But it does not fail until after the non-containerized mon service have been removed.
This results in the cluster missing a monitor and the playbook fails on subsequent runs because it can't find the removed mon service

Version-Release number of selected component (if applicable):
ceph-ansible.noarch                  4.0.41-1.el7cp          @rhel-7-server-rhceph-4-tools-rpms


How reproducible:
I deploye RHCS-3, non-containerized, upgraded to RHCS-4 non-containerized, followed by running infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml without adding the above-mentioned ceph_docker_registry variables.

Steps to Reproduce:
1. Install RHCS-3 non-containerized
2. Upgrade to RHCS-4
3. Convert to containerized, by running infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml with -limit rhceph01 # rhceph01 first monitor

Actual results:
mon on rhceph01 is removed and cluster is left with 2 functioning mon's and health_warn

Expected results:
Early fail of playbook, with a message pointing out that registry.redhat.io requires said variables to be set, while keeping the cluster HEALTH_OK

Additional info:
Look at line: 2021-06-06 12:11:47,842 in the attached ansible.log

Comment 1 Heðin 2021-06-06 14:10:33 UTC
Set prio to high because the cluster is left in a degraded state.

Comment 2 Guillaume Abrioux 2021-07-02 12:09:52 UTC
v4.0.59 available upstream

Comment 7 Ameena Suhani S H 2021-08-04 06:10:36 UTC
Verified using 

ansible-2.9.24-1.el8ae.noarch
ceph-ansible-4.0.62-1.el8cp.noarch

Comment 9 errata-xmlrpc 2021-09-27 18:26:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.2 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3670