2344947 – [ceph-ansible] When performing a rolling update and adoption from Ceph 4 to 5, alertmanager goes into an error state on Controller nodes

Bug 2344947 - [ceph-ansible] When performing a rolling update and adoption from Ceph 4 to 5, alertmanager goes into an error state on Controller nodes

Summary: [ceph-ansible] When performing a rolling update and adoption from Ceph 4 to 5...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z9
Assignee:	Teoman ONAY
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2025-02-11 16:31 UTC by Alfredo
Modified:	2025-04-04 12:48 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 7654	None	Merged	Adopt with grafana_network not grafana_server_addr	2025-02-18 14:12:38 UTC
Red Hat Bugzilla	2350124	high	CLOSED	cephadm trys to bind RGW daemon to all (::) interfaces when valid networks list is provided.	2025-04-01 11:45:47 UTC
Red Hat Issue Tracker	RHCEPH-10589	None	None	None	2025-02-11 16:32:39 UTC

Description Alfredo 2025-02-11 16:31:23 UTC

Description of problem:
When performing a rolling update and adoption from Ceph 4 to 5, the following behavior was observed on the Controller nodes:
[ceph: root@controller-0 /]# ceph orch ps | grep error
alertmanager.controller-1       controller-1.redhat.local  *:9093,9094       error              9m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
alertmanager.controller-2       controller-2.redhat.local  *:9093,9094       error              8m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
[ceph: root@controller-0 /]#

Version-Release number of selected component (if applicable):
https://download-01.beak-001.prod.iad2.dc.redhat.com/brewroot/vol/rhel-8/packages/ceph-ansible/6.0.28.20/1.el8cp/noarch/ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm

How reproducible:
Consistent

Steps to Reproduce:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Actual results:
Alertmanager is in an error state

Expected results:
Alertmanager is in a running state

Additional info:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Comment 1 John Fulton 2025-03-05 15:20:26 UTC

https://github.com/ceph/ceph-ansible/pull/7654 is necessary but not sufficient to fix this bug.

Comment 2 John Fulton 2025-03-05 16:11:53 UTC

Even though alertmanager is created with a correct spec during adoption it's coming up in error. 

Workaround:

- `ceph orch rm alertmanager`
- re-apply the spec [1]

It should have worked the first time though. We shouldn't need to recreate it.

[1]
--
service_type: alertmanager
service_name: alertmanager
placement:
  hosts:
  - host1
  - host2
  - host2
networks:
- 10.10.42.0/24
- 10.11.42.0/24
- 10.12.42.0/24
spec:
  port: 9093

Comment 3 John Fulton 2025-03-05 17:18:31 UTC

See also https://bugzilla.redhat.com/show_bug.cgi?id=2350124

Note You need to log in before you can comment on or make changes to this bug.