2269009 – After cephadm adoption, haproxy fails to start when alertmanager is deployed

Bug 2269009 - After cephadm adoption, haproxy fails to start when alertmanager is deployed

Summary: After cephadm adoption, haproxy fails to start when alertmanager is deployed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	5.3z8
Assignee:	Teoman ONAY
QA Contact:	Alfredo
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:	2160009
TreeView+	depends on / blocked

Reported:	2024-03-11 13:49 UTC by Flavio Piccioni
Modified:	2025-03-05 15:21 UTC (History)
CC List:	22 users (show)
Fixed In Version:	ceph-ansible-6.0.28.20-1.el8cp
Doc Type:	Bug Fix
Doc Text:	.HAProxy now starts as expected when Alertmanager is deployed Previously, Alertmanager was not properly bound to the storage network. As a result, HAProxy would not start as expected. With this fix, there is proper binding of Alertmanager to the storage network and HAProxy starts as expected.
Clone Of:
Environment:
Last Closed:	2025-02-13 19:22:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8488	None	None	None	2024-03-11 13:49:31 UTC
Red Hat Knowledge Base (Solution)	7071082	None	None	None	2024-06-04 20:08:09 UTC
Red Hat Product Errata	RHBA-2025:1478	None	None	None	2025-02-13 19:22:43 UTC

Description Flavio Piccioni 2024-03-11 13:49:13 UTC

Description of problem:
This is basically a copy/paste from 2224351 [0] - this time the subject is alertmanager and not rgw


During the FFU from 16.2 to 17.1, when alertmanager is deployed as part of Director deployed ceph, the procedure fails on the next stack update.
In particular, haproxy-bundle is not able to start via pacemaker due to a
failure that occurs when it tries to bind to the alertmanager port (9093).

8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------
2024-03-11T11:41:23.975991313+01:00 stderr F [ALERT] 070/114123 (7) : Starting proxy ceph_alertmanager: cannot bind socket [192.168.3.213:9093]
8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------

After digging into the existing environment, we've seen that alertmanager has not been redeployed on the storage network, and it's bound on *.

The resulting spec gathered from the adopted cluster shows:

---
service_type: alertmanager
service_name: alertmanager
placement:
  count: 3
  label: monitoring
---
service_type: crash
service_name: crash
placement:
  label: ceph
---
service_type: grafana
service_name: grafana
placement:
  count: 3
  label: monitoring
---
service_type: mds
service_id: cephfs
service_name: mds.cephfs
placement:
  count: 3
  label: mdss
---
service_type: mgr
service_name: mgr
placement:
  count: 3
  label: mgrs
---
service_type: mon
service_name: mon
placement:
  count: 3
  label: mons
---
service_type: node-exporter
service_name: node-exporter
placement:
  host_pattern: '*'
---
service_type: osd
service_name: osd
unmanaged: true
spec:
  filter_logic: AND
  objectstore: bluestore
---
service_type: prometheus
service_name: prometheus
placement:
  count: 3
  label: monitoring
---
[..]



Actual results:
alertmanager not being binded to the storage network, preventing haproxy to starts

Expected results:
alertmanager being binded to the storage network


[0] https://bugzilla.redhat.com/show_bug.cgi?id=2224351

Comment 5 Kenny Tordeurs 2024-04-23 14:01:36 UTC

Also the following steps where required:
~~~
pcs resource disable haproxy-bundle
pcs resource enable haproxy-bundle
~~~

Comment 6 Madhur Gupta 2024-05-16 13:55:03 UTC

@fpiccion @tonay Kenny has shared additional steps which are required.

What would be way forward here? Or is this the solution then should it be documented in known issues section?

Comment 10 kgilliga 2024-06-04 18:36:42 UTC

(In reply to Francesco Pantano from comment #9)
> Hi Erin,
> can we add this bug as known issue in the FFU doc and mention the KCS as the
> current workaround?

I talked to Erin, and I will add the known issue to the FFU guide. I'm tracking the doc work in this Jira: https://issues.redhat.com/browse/OSPRH-7173

Comment 11 kgilliga 2024-06-04 19:29:03 UTC

Hi Flavio and Francesco,

Can you please review the following known issue?
https://gitlab.cee.redhat.com/rhci-documentation/docs-Red_Hat_Enterprise_Linux_OpenStack_Platform/-/merge_requests/12707

One follow-up question: The KCS says that HAProxy does not restart on the next stack update. Does "the next stack update" refer to any point in the FFU procedure where the stack is updated, or does it refer to a specific step in the FFU procedure?

Comment 14 kgilliga 2024-06-05 18:39:46 UTC

(In reply to Flavio Piccioni from comment #12)
> Hi Katie,
> 
> let me try to recover some data from the original support case to see if we
> can maybe tune a little bit the KCS too:
> 
> 
> [customer's description]
> 8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<---
> ----
> Today i did some further testing by doing again a clean install in osp16.2.6
> and then doing the upgrade to 17.1.2.
> 
> All runs smooth till we do the ceph 4 to 5 upgrade - Step 6.2.5 "Update the
> packages on the Red Hat Ceph Storage nodes" - the cluster was still ok":
> 
> But when doing step 6.2.6 "Configure the Red Hat Ceph Storage nodes to use
> cephadm" ceph is absolutely not happy anymore after the command ran
> successfully:
> 
> openstack overcloud external-upgrade run \
>     --skip-tags ceph_health,opendev-validation,ceph_ansible_remote_tmp \
>     --stack <stack> \
>     --tags cephadm_adopt  2>&1
> 
> Afterwards ceph is in warn state:
> [root@oscar05ctr001 ~]# ceph -s
>   cluster:
>     id:     74b8145c-7206-4edb-a40d-6b653b116060
>     health: HEALTH_WARN
>             Failed to place 1 daemon(s)
>             2 failed cephadm daemon(s)
>             2 stray daemon(s) not managed by cephadm
> 
>   services:
>     mon:     3 daemons, quorum oscar05ctr001,oscar05ctr002,oscar05ctr003
> (age 10m)
>     mgr:     oscar05ctr001(active, since 9m), standbys: oscar05ctr002,
> oscar05ctr003
>     mds:     1/1 daemons up, 2 standby
>     osd:     2 osds: 2 up (since 8m), 2 in (since 18h)
>     rgw-nfs: 2 daemons active (2 hosts, 1 zones)
> 
>   data:
>     volumes: 1/1 healthy
>     pools:   12 pools, 193 pgs
>     objects: 19.37k objects, 3.5 GiB
>     usage:   6.6 GiB used, 2.9 TiB / 2.9 TiB avail
>     pgs:     193 active+clean
> 
>   io:
>     client:   276 KiB/s rd, 1.4 MiB/s wr, 413 op/s rd, 1.62k op/s wr
> 
> When looking at which daemons are unhealty:
> [root@oscar05ctr001 ~]# ceph orch ls
> NAME               PORTS        RUNNING  REFRESHED  AGE  PLACEMENT
> alertmanager       ?:9093,9094      0/3  9m ago     3h  
> count:3;label:monitoring
> crash                               5/5  9m ago     3h   label:ceph
> grafana            ?:3000           3/3  9m ago     3h  
> count:3;label:monitoring
> mds.cephfs                          3/3  9m ago     3h   count:3;label:mdss
> mgr                                 3/3  9m ago     3h   count:3;label:mgrs
> mon                                 3/3  9m ago     3h   count:3;label:mons
> node-exporter      ?:9100           5/5  9m ago     3h   *
> osd                                   2  48s ago    -    <unmanaged>
> prometheus         ?:9095           3/3  9m ago     3h  
> count:3;label:monitoring
> rgw.oscar05ctr001  ?:8080           0/1  -          2s  
> oscar05ctr001;count-per-host:1
> rgw.oscar05ctr002  ?:8080           0/1  -          8s  
> oscar05ctr002;count-per-host:1
> rgw.oscar05ctr003  ?:8080           0/1  -          0s  
> oscar05ctr003;count-per-host:1
> 
> None of the alertmanager and rgw daemons can start.
> 8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<---
> ----
> 
> So basically the problem started here [0]
> 
> [0]
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.
> 1/html/framework_for_upgrades_16.2_to_17.1/upgrading-an-overcloud-with-
> director-deployed-ceph-deployments_preparing-overcloud#upgrading-to-ceph-
> storage-5-upgrading-ceph

Thank you Flavio for the explanation.
I added more details in the MR. Can you please review?
https://gitlab.cee.redhat.com/rhci-documentation/docs-Red_Hat_Enterprise_Linux_OpenStack_Platform/-/merge_requests/12707/diffs

If you are okay with these changes, do you want me to add the same text to the KCS (except for the last sentence with the link to the KCS)?

Comment 26 John Fulton 2025-02-03 16:54:22 UTC

After using fix in this bug ceph-ansible-6.0.28.17-1.el8cp we still had to follow the workaround in the KCS https://access.redhat.com/solutions/7071082. We think this is because https://bugzilla.redhat.com/show_bug.cgi?id=2274719 is not yet resolved.

Comment 30 John Fulton 2025-02-05 22:59:10 UTC

I had to follow https://access.redhat.com/solutions/7071082 not just for alertmanager but also for prometheus.

The log showed:

  2025-02-05T16:27:13.783551140-05:00 stderr F [ALERT] 035/162713 (7) : Starting proxy ceph_prometheus: cannot bind socket [10.1.13.18:9092]

and ss showed:

  # ss -plnt | grep 9092
  LISTEN 0      2048               *:9092             *:*    users:(("prometheus",pid=435094,fd=7))

We see `ceph orch ls --export` showed:

service_type: prometheus
service_name: prometheus
placement:
  count: 3
  label: monitoring
spec:
  port: 9092
---
service_type: node-exporter
service_name: node-exporter
placement:
  host_pattern: '*'
---
service_type: grafana
service_name: grafana
placement:
  count: 3
  label: monitoring


The above spec is created when infrastructure-playbooks/cephadm-adopt.yml is run. It's not specifying a networks list so those services listen on *:9092 instead of a specific IP and port 9092. The workaround was to assign a networks list to prometheus. 

I can pass the networks list to the RGW spec here by setting radosgw_address_block

  https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L953

and alertmanager here by setting grafana_server_addr

  https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L1491

but I can't do the same for prometheus

  https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L1504

Maybe the playbook could be updated to take grafana_server_addr as an argument here? I could send in a PR.

Comment 31 John Fulton 2025-02-06 00:03:18 UTC

(In reply to John Fulton from comment #30)
> Maybe the playbook could be updated to take grafana_server_addr as an
> argument here? I could send in a PR.

https://github.com/ceph/ceph-ansible/pull/7649

Comment 39 John Fulton 2025-02-07 17:37:47 UTC

Parameter grafana_server_addr needs to be an IP, not a range,
since it is passed to module wait_for in role ceph-dashboard.

Parameter spec.networks[str] is usually a range [1] used to
select an IP [2][3] with Python's ipaddress.overlaps() [4],
which tolerates if an IP is passed instead of a range.

Thus, we can pass the grafana_server_addr in the spec, but we
should not override it to a range when running ceph-ansible.
The grafana_server_addr is computed by ceph-facts from the
grafana_network [5] which defaults to the public_network [6].

Thus, it is not necessary to pass an override to avoid this bug.
We're testing without the override.

My PR would have been better if it had used grafana_network
instead of grafana_server_addr but it will still produce a
working deployment given how ipaddress.overlaps() works.

[1] https://docs.ceph.com/en/latest/cephadm/services/#ceph.deployment.service_spec.ServiceSpec.networks
[2] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/services/monitoring.py
[3] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/module.py
[4] https://docs.python.org/3/library/ipaddress.html#ipaddress.IPv4Network.overlaps
[5] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-facts/tasks/grafana.yml
[6] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-defaults/defaults/main.yml#L618

Comment 40 John Fulton 2025-02-07 18:16:09 UTC

(In reply to John Fulton from comment #39)
> My PR would have been better if it had used grafana_network
> instead of grafana_server_addr but it will still produce a
> working deployment given how ipaddress.overlaps() works.

https://github.com/ceph/ceph-ansible/pull/7654

Comment 47 errata-xmlrpc 2025-02-13 19:22:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 security and bug fix updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:1478

Comment 49 John Fulton 2025-03-05 15:21:47 UTC

The fix of this bug is in https://github.com/ceph/ceph-ansible/pull/7654 but that patch is not in ceph-ansible 6.0.28.20-1.el8cp [1]

Thus, the target release was correctly changed on 2025-02-11 19:56:35 UTC 5.3z8 → 5.3z9 so that we could target the fix for this bug at z9 after after discussion with Federico.

The bug report right now reads like the fix has been released and closed with errata but it has not.

The fix failed QA per my update on 2025-02-11 13:27:11 UTC

If https://bugzilla.redhat.com/show_bug.cgi?id=2344947 will be the "release vehicle" for https://github.com/ceph/ceph-ansible/pull/7654 that's OK with me, BUT that alone will not fix the issue reported in 2344947.

[1] 

[fultonj@runcible test]$ sha256sum ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm
278042512c7f080bf0ef7bb4e77afffd576f5b552518ca0e60292f9545110b7c  ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm
[fultonj@runcible test]$ rpm2cpio ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm | cpio -idmv
<...>
[fultonj@runcible test]$ grep grafana_network usr/share/ceph-ansible/infrastructure-playbooks/cephadm-adopt.yml
[fultonj@runcible test]$ grep grafana_server_addr usr/share/ceph-ansible/infrastructure-playbooks/cephadm-adopt.yml
              {% if grafana_server_addr is defined %}
              networks: {{ grafana_server_addr.split(',') | list if ',' in grafana_server_addr else grafana_server_addr | string }}
              {% if grafana_server_addr is defined %}
              networks: {{ grafana_server_addr.split(',') | list if ',' in grafana_server_addr else grafana_server_addr | string }}
[fultonj@runcible test]$

Note You need to log in before you can comment on or make changes to this bug.

alfrgarc
ceph-eng-bugs
cephqe-warriors
erpeters
fpantano
gfidente
gmeno
jbadiapa
jcaratza
jelle.hoylaerts.ext
johfulto
jpretori
ktordeur
lbezdick
madgupta
mcaldeir
mobisht
racpatel
rpollack
tonay
tserlin
vereddy