Bug 2344947

Summary:	[ceph-ansible] When performing a rolling update and adoption from Ceph 4 to 5, alertmanager goes into an error state on Controller nodes
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Alfredo <alfrgarc>
Component:	Ceph-Ansible	Assignee:	Teoman ONAY <tonay>
Status:	NEW ---	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.3	CC:	ceph-eng-bugs, cephqe-warriors, gfidente, gmeno, johfulto, mobisht, rsachere
Target Milestone:	---
Target Release:	5.3z9
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alfredo 2025-02-11 16:31:23 UTC

Description of problem:
When performing a rolling update and adoption from Ceph 4 to 5, the following behavior was observed on the Controller nodes:
[ceph: root@controller-0 /]# ceph orch ps | grep error
alertmanager.controller-1       controller-1.redhat.local  *:9093,9094       error              9m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
alertmanager.controller-2       controller-2.redhat.local  *:9093,9094       error              8m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
[ceph: root@controller-0 /]#

Version-Release number of selected component (if applicable):
https://download-01.beak-001.prod.iad2.dc.redhat.com/brewroot/vol/rhel-8/packages/ceph-ansible/6.0.28.20/1.el8cp/noarch/ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm

How reproducible:
Consistent

Steps to Reproduce:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Actual results:
Alertmanager is in an error state

Expected results:
Alertmanager is in a running state

Additional info:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Comment 1 John Fulton 2025-03-05 15:20:26 UTC

https://github.com/ceph/ceph-ansible/pull/7654 is necessary but not sufficient to fix this bug.

Comment 2 John Fulton 2025-03-05 16:11:53 UTC

Even though alertmanager is created with a correct spec during adoption it's coming up in error. 

Workaround:

- `ceph orch rm alertmanager`
- re-apply the spec [1]

It should have worked the first time though. We shouldn't need to recreate it.

[1]
--
service_type: alertmanager
service_name: alertmanager
placement:
  hosts:
  - host1
  - host2
  - host2
networks:
- 10.10.42.0/24
- 10.11.42.0/24
- 10.12.42.0/24
spec:
  port: 9093

Comment 3 John Fulton 2025-03-05 17:18:31 UTC

See also https://bugzilla.redhat.com/show_bug.cgi?id=2350124