Bug 2344947

Summary: [ceph-ansible] When performing a rolling update and adoption from Ceph 4 to 5, alertmanager goes into an error state on Controller nodes
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Alfredo <alfrgarc>
Component: Ceph-AnsibleAssignee: Teoman ONAY <tonay>
Status: NEW --- QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.3CC: ceph-eng-bugs, cephqe-warriors, gfidente, gmeno, johfulto, mobisht, rsachere
Target Milestone: ---   
Target Release: 5.3z9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alfredo 2025-02-11 16:31:23 UTC
Description of problem:
When performing a rolling update and adoption from Ceph 4 to 5, the following behavior was observed on the Controller nodes:
[ceph: root@controller-0 /]# ceph orch ps | grep error
alertmanager.controller-1       controller-1.redhat.local  *:9093,9094       error              9m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
alertmanager.controller-2       controller-2.redhat.local  *:9093,9094       error              8m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
[ceph: root@controller-0 /]#

Version-Release number of selected component (if applicable):
https://download-01.beak-001.prod.iad2.dc.redhat.com/brewroot/vol/rhel-8/packages/ceph-ansible/6.0.28.20/1.el8cp/noarch/ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm

How reproducible:
Consistent

Steps to Reproduce:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Actual results:
Alertmanager is in an error state

Expected results:
Alertmanager is in a running state

Additional info:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Comment 1 John Fulton 2025-03-05 15:20:26 UTC
https://github.com/ceph/ceph-ansible/pull/7654 is necessary but not sufficient to fix this bug.

Comment 2 John Fulton 2025-03-05 16:11:53 UTC
Even though alertmanager is created with a correct spec during adoption it's coming up in error. 

Workaround:

- `ceph orch rm alertmanager`
- re-apply the spec [1]

It should have worked the first time though. We shouldn't need to recreate it.

[1]
--
service_type: alertmanager
service_name: alertmanager
placement:
  hosts:
  - host1
  - host2
  - host2
networks:
- 10.10.42.0/24
- 10.11.42.0/24
- 10.12.42.0/24
spec:
  port: 9093

Comment 3 John Fulton 2025-03-05 17:18:31 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=2350124