Description of problem: When performing a rolling update and adoption from Ceph 4 to 5, the following behavior was observed on the Controller nodes: [ceph: root@controller-0 /]# ceph orch ps | grep error alertmanager.controller-1 controller-1.redhat.local *:9093,9094 error 9m ago 2h - - <unknown> <unknown> <unknown> alertmanager.controller-2 controller-2.redhat.local *:9093,9094 error 8m ago 2h - - <unknown> <unknown> <unknown> [ceph: root@controller-0 /]# Version-Release number of selected component (if applicable): https://download-01.beak-001.prod.iad2.dc.redhat.com/brewroot/vol/rhel-8/packages/ceph-ansible/6.0.28.20/1.el8cp/noarch/ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm How reproducible: Consistent Steps to Reproduce: See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009 Actual results: Alertmanager is in an error state Expected results: Alertmanager is in a running state Additional info: See: https://bugzilla.redhat.com/show_bug.cgi?id=2269009
https://github.com/ceph/ceph-ansible/pull/7654 is necessary but not sufficient to fix this bug.
Even though alertmanager is created with a correct spec during adoption it's coming up in error. Workaround: - `ceph orch rm alertmanager` - re-apply the spec [1] It should have worked the first time though. We shouldn't need to recreate it. [1] -- service_type: alertmanager service_name: alertmanager placement: hosts: - host1 - host2 - host2 networks: - 10.10.42.0/24 - 10.11.42.0/24 - 10.12.42.0/24 spec: port: 9093
See also https://bugzilla.redhat.com/show_bug.cgi?id=2350124