Bug 2344947 - [ceph-ansible] When performing a rolling update and adoption from Ceph 4 to 5, alertmanager goes into an error state on Controller nodes
Summary: [ceph-ansible] When performing a rolling update and adoption from Ceph 4 to 5...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 5.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 5.3z9
Assignee: Teoman ONAY
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-02-11 16:31 UTC by Alfredo
Modified: 2025-04-04 12:48 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 7654 0 None Merged Adopt with grafana_network not grafana_server_addr 2025-02-18 14:12:38 UTC
Red Hat Bugzilla 2350124 0 high CLOSED cephadm trys to bind RGW daemon to all (::) interfaces when valid networks list is provided. 2025-04-01 11:45:47 UTC
Red Hat Issue Tracker RHCEPH-10589 0 None None None 2025-02-11 16:32:39 UTC

Description Alfredo 2025-02-11 16:31:23 UTC
Description of problem:
When performing a rolling update and adoption from Ceph 4 to 5, the following behavior was observed on the Controller nodes:
[ceph: root@controller-0 /]# ceph orch ps | grep error
alertmanager.controller-1       controller-1.redhat.local  *:9093,9094       error              9m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
alertmanager.controller-2       controller-2.redhat.local  *:9093,9094       error              8m ago    2h        -        -  <unknown>          <unknown>     <unknown>     
[ceph: root@controller-0 /]#

Version-Release number of selected component (if applicable):
https://download-01.beak-001.prod.iad2.dc.redhat.com/brewroot/vol/rhel-8/packages/ceph-ansible/6.0.28.20/1.el8cp/noarch/ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm

How reproducible:
Consistent

Steps to Reproduce:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Actual results:
Alertmanager is in an error state

Expected results:
Alertmanager is in a running state

Additional info:
See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009

Comment 1 John Fulton 2025-03-05 15:20:26 UTC
https://github.com/ceph/ceph-ansible/pull/7654 is necessary but not sufficient to fix this bug.

Comment 2 John Fulton 2025-03-05 16:11:53 UTC
Even though alertmanager is created with a correct spec during adoption it's coming up in error. 

Workaround:

- `ceph orch rm alertmanager`
- re-apply the spec [1]

It should have worked the first time though. We shouldn't need to recreate it.

[1]
--
service_type: alertmanager
service_name: alertmanager
placement:
  hosts:
  - host1
  - host2
  - host2
networks:
- 10.10.42.0/24
- 10.11.42.0/24
- 10.12.42.0/24
spec:
  port: 9093

Comment 3 John Fulton 2025-03-05 17:18:31 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=2350124


Note You need to log in before you can comment on or make changes to this bug.