Description of problem: This is basically a copy/paste from 2224351 [0] - this time the subject is alertmanager and not rgw During the FFU from 16.2 to 17.1, when alertmanager is deployed as part of Director deployed ceph, the procedure fails on the next stack update. In particular, haproxy-bundle is not able to start via pacemaker due to a failure that occurs when it tries to bind to the alertmanager port (9093). 8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<------- 2024-03-11T11:41:23.975991313+01:00 stderr F [ALERT] 070/114123 (7) : Starting proxy ceph_alertmanager: cannot bind socket [192.168.3.213:9093] 8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<------- After digging into the existing environment, we've seen that alertmanager has not been redeployed on the storage network, and it's bound on *. The resulting spec gathered from the adopted cluster shows: --- service_type: alertmanager service_name: alertmanager placement: count: 3 label: monitoring --- service_type: crash service_name: crash placement: label: ceph --- service_type: grafana service_name: grafana placement: count: 3 label: monitoring --- service_type: mds service_id: cephfs service_name: mds.cephfs placement: count: 3 label: mdss --- service_type: mgr service_name: mgr placement: count: 3 label: mgrs --- service_type: mon service_name: mon placement: count: 3 label: mons --- service_type: node-exporter service_name: node-exporter placement: host_pattern: '*' --- service_type: osd service_name: osd unmanaged: true spec: filter_logic: AND objectstore: bluestore --- service_type: prometheus service_name: prometheus placement: count: 3 label: monitoring --- [..] Actual results: alertmanager not being binded to the storage network, preventing haproxy to starts Expected results: alertmanager being binded to the storage network [0] https://bugzilla.redhat.com/show_bug.cgi?id=2224351
Also the following steps where required: ~~~ pcs resource disable haproxy-bundle pcs resource enable haproxy-bundle ~~~
@fpiccion @tonay Kenny has shared additional steps which are required. What would be way forward here? Or is this the solution then should it be documented in known issues section?
(In reply to Francesco Pantano from comment #9) > Hi Erin, > can we add this bug as known issue in the FFU doc and mention the KCS as the > current workaround? I talked to Erin, and I will add the known issue to the FFU guide. I'm tracking the doc work in this Jira: https://issues.redhat.com/browse/OSPRH-7173
Hi Flavio and Francesco, Can you please review the following known issue? https://gitlab.cee.redhat.com/rhci-documentation/docs-Red_Hat_Enterprise_Linux_OpenStack_Platform/-/merge_requests/12707 One follow-up question: The KCS says that HAProxy does not restart on the next stack update. Does "the next stack update" refer to any point in the FFU procedure where the stack is updated, or does it refer to a specific step in the FFU procedure?
(In reply to Flavio Piccioni from comment #12) > Hi Katie, > > let me try to recover some data from the original support case to see if we > can maybe tune a little bit the KCS too: > > > [customer's description] > 8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<--- > ---- > Today i did some further testing by doing again a clean install in osp16.2.6 > and then doing the upgrade to 17.1.2. > > All runs smooth till we do the ceph 4 to 5 upgrade - Step 6.2.5 "Update the > packages on the Red Hat Ceph Storage nodes" - the cluster was still ok": > > But when doing step 6.2.6 "Configure the Red Hat Ceph Storage nodes to use > cephadm" ceph is absolutely not happy anymore after the command ran > successfully: > > openstack overcloud external-upgrade run \ > --skip-tags ceph_health,opendev-validation,ceph_ansible_remote_tmp \ > --stack <stack> \ > --tags cephadm_adopt 2>&1 > > Afterwards ceph is in warn state: > [root@oscar05ctr001 ~]# ceph -s > cluster: > id: 74b8145c-7206-4edb-a40d-6b653b116060 > health: HEALTH_WARN > Failed to place 1 daemon(s) > 2 failed cephadm daemon(s) > 2 stray daemon(s) not managed by cephadm > > services: > mon: 3 daemons, quorum oscar05ctr001,oscar05ctr002,oscar05ctr003 > (age 10m) > mgr: oscar05ctr001(active, since 9m), standbys: oscar05ctr002, > oscar05ctr003 > mds: 1/1 daemons up, 2 standby > osd: 2 osds: 2 up (since 8m), 2 in (since 18h) > rgw-nfs: 2 daemons active (2 hosts, 1 zones) > > data: > volumes: 1/1 healthy > pools: 12 pools, 193 pgs > objects: 19.37k objects, 3.5 GiB > usage: 6.6 GiB used, 2.9 TiB / 2.9 TiB avail > pgs: 193 active+clean > > io: > client: 276 KiB/s rd, 1.4 MiB/s wr, 413 op/s rd, 1.62k op/s wr > > When looking at which daemons are unhealty: > [root@oscar05ctr001 ~]# ceph orch ls > NAME PORTS RUNNING REFRESHED AGE PLACEMENT > alertmanager ?:9093,9094 0/3 9m ago 3h > count:3;label:monitoring > crash 5/5 9m ago 3h label:ceph > grafana ?:3000 3/3 9m ago 3h > count:3;label:monitoring > mds.cephfs 3/3 9m ago 3h count:3;label:mdss > mgr 3/3 9m ago 3h count:3;label:mgrs > mon 3/3 9m ago 3h count:3;label:mons > node-exporter ?:9100 5/5 9m ago 3h * > osd 2 48s ago - <unmanaged> > prometheus ?:9095 3/3 9m ago 3h > count:3;label:monitoring > rgw.oscar05ctr001 ?:8080 0/1 - 2s > oscar05ctr001;count-per-host:1 > rgw.oscar05ctr002 ?:8080 0/1 - 8s > oscar05ctr002;count-per-host:1 > rgw.oscar05ctr003 ?:8080 0/1 - 0s > oscar05ctr003;count-per-host:1 > > None of the alertmanager and rgw daemons can start. > 8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<-------8<--- > ---- > > So basically the problem started here [0] > > [0] > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17. > 1/html/framework_for_upgrades_16.2_to_17.1/upgrading-an-overcloud-with- > director-deployed-ceph-deployments_preparing-overcloud#upgrading-to-ceph- > storage-5-upgrading-ceph Thank you Flavio for the explanation. I added more details in the MR. Can you please review? https://gitlab.cee.redhat.com/rhci-documentation/docs-Red_Hat_Enterprise_Linux_OpenStack_Platform/-/merge_requests/12707/diffs If you are okay with these changes, do you want me to add the same text to the KCS (except for the last sentence with the link to the KCS)?
After using fix in this bug ceph-ansible-6.0.28.17-1.el8cp we still had to follow the workaround in the KCS https://access.redhat.com/solutions/7071082. We think this is because https://bugzilla.redhat.com/show_bug.cgi?id=2274719 is not yet resolved.
I had to follow https://access.redhat.com/solutions/7071082 not just for alertmanager but also for prometheus. The log showed: 2025-02-05T16:27:13.783551140-05:00 stderr F [ALERT] 035/162713 (7) : Starting proxy ceph_prometheus: cannot bind socket [10.1.13.18:9092] and ss showed: # ss -plnt | grep 9092 LISTEN 0 2048 *:9092 *:* users:(("prometheus",pid=435094,fd=7)) We see `ceph orch ls --export` showed: service_type: prometheus service_name: prometheus placement: count: 3 label: monitoring spec: port: 9092 --- service_type: node-exporter service_name: node-exporter placement: host_pattern: '*' --- service_type: grafana service_name: grafana placement: count: 3 label: monitoring The above spec is created when infrastructure-playbooks/cephadm-adopt.yml is run. It's not specifying a networks list so those services listen on *:9092 instead of a specific IP and port 9092. The workaround was to assign a networks list to prometheus. I can pass the networks list to the RGW spec here by setting radosgw_address_block https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L953 and alertmanager here by setting grafana_server_addr https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L1491 but I can't do the same for prometheus https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/cephadm-adopt.yml#L1504 Maybe the playbook could be updated to take grafana_server_addr as an argument here? I could send in a PR.
(In reply to John Fulton from comment #30) > Maybe the playbook could be updated to take grafana_server_addr as an > argument here? I could send in a PR. https://github.com/ceph/ceph-ansible/pull/7649
Parameter grafana_server_addr needs to be an IP, not a range, since it is passed to module wait_for in role ceph-dashboard. Parameter spec.networks[str] is usually a range [1] used to select an IP [2][3] with Python's ipaddress.overlaps() [4], which tolerates if an IP is passed instead of a range. Thus, we can pass the grafana_server_addr in the spec, but we should not override it to a range when running ceph-ansible. The grafana_server_addr is computed by ceph-facts from the grafana_network [5] which defaults to the public_network [6]. Thus, it is not necessary to pass an override to avoid this bug. We're testing without the override. My PR would have been better if it had used grafana_network instead of grafana_server_addr but it will still produce a working deployment given how ipaddress.overlaps() works. [1] https://docs.ceph.com/en/latest/cephadm/services/#ceph.deployment.service_spec.ServiceSpec.networks [2] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/services/monitoring.py [3] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/module.py [4] https://docs.python.org/3/library/ipaddress.html#ipaddress.IPv4Network.overlaps [5] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-facts/tasks/grafana.yml [6] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-defaults/defaults/main.yml#L618
(In reply to John Fulton from comment #39) > My PR would have been better if it had used grafana_network > instead of grafana_server_addr but it will still produce a > working deployment given how ipaddress.overlaps() works. https://github.com/ceph/ceph-ansible/pull/7654
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.3 security and bug fix updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2025:1478
The fix of this bug is in https://github.com/ceph/ceph-ansible/pull/7654 but that patch is not in ceph-ansible 6.0.28.20-1.el8cp [1] Thus, the target release was correctly changed on 2025-02-11 19:56:35 UTC 5.3z8 → 5.3z9 so that we could target the fix for this bug at z9 after after discussion with Federico. The bug report right now reads like the fix has been released and closed with errata but it has not. The fix failed QA per my update on 2025-02-11 13:27:11 UTC If https://bugzilla.redhat.com/show_bug.cgi?id=2344947 will be the "release vehicle" for https://github.com/ceph/ceph-ansible/pull/7654 that's OK with me, BUT that alone will not fix the issue reported in 2344947. [1] [fultonj@runcible test]$ sha256sum ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm 278042512c7f080bf0ef7bb4e77afffd576f5b552518ca0e60292f9545110b7c ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm [fultonj@runcible test]$ rpm2cpio ceph-ansible-6.0.28.20-1.el8cp.noarch.rpm | cpio -idmv <...> [fultonj@runcible test]$ grep grafana_network usr/share/ceph-ansible/infrastructure-playbooks/cephadm-adopt.yml [fultonj@runcible test]$ grep grafana_server_addr usr/share/ceph-ansible/infrastructure-playbooks/cephadm-adopt.yml {% if grafana_server_addr is defined %} networks: {{ grafana_server_addr.split(',') | list if ',' in grafana_server_addr else grafana_server_addr | string }} {% if grafana_server_addr is defined %} networks: {{ grafana_server_addr.split(',') | list if ',' in grafana_server_addr else grafana_server_addr | string }} [fultonj@runcible test]$