+++ This bug was initially created as a clone of Bug #2125107 +++ Description of problem: Site weights are not monitored post stretch mode deployment. The two sites can have different weights and cluster does not throw any warning messages. Before stretch mode is enabled, there are checks that make sure that the weights of the two sites are same, otherwise it does not allow for deployment of stretch mode. # ceph mon enable_stretch_mode ceph-pdhiran-22eyie-node1-installer stretch_rule datacenter Error EINVAL: the 2 datacenterinstances in the cluster have differing weights 25947 and 15728 but stretch mode currently requires they be the same! # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.01788 root default -19 0.39592 datacenter DC1 -15 0.15594 host ceph-pdhiran-22eyie-node11 21 hdd 0.03899 osd.21 up 1.00000 1.00000 22 hdd 0.03899 osd.22 up 1.00000 1.00000 23 hdd 0.03899 osd.23 up 1.00000 1.00000 24 hdd 0.03899 osd.24 up 1.00000 1.00000 -7 0.10199 host ceph-pdhiran-22eyie-node3 0 hdd 0.03099 osd.0 up 1.00000 1.00000 10 hdd 0.03000 osd.10 up 1.00000 1.00000 14 hdd 0.01999 osd.14 up 1.00000 1.00000 16 hdd 0.02100 osd.16 up 1.00000 1.00000 -11 0.13799 host ceph-pdhiran-22eyie-node8 3 hdd 0.04099 osd.3 up 1.00000 1.00000 13 hdd 0.06599 osd.13 up 1.00000 1.00000 19 hdd 0.03099 osd.19 up 1.00000 1.00000 -20 0.23999 datacenter DC2 -9 0.13799 host ceph-pdhiran-22eyie-node10 4 hdd 0.04099 osd.4 up 1.00000 1.00000 12 hdd 0.06599 osd.12 up 1.00000 1.00000 17 hdd 0.03099 osd.17 up 1.00000 1.00000 -3 0.10199 host ceph-pdhiran-22eyie-node4 1 hdd 0.03099 osd.1 up 1.00000 1.00000 6 hdd 0.03000 osd.6 up 1.00000 1.00000 8 hdd 0.01999 osd.8 up 1.00000 1.00000 11 hdd 0.02100 osd.11 up 1.00000 1.00000 We can remove the host causing the issue, deploy stretch mode and add back the host, without any issues. Should there be any checks if post deployment, hosts ( OSDs ) are added to the sites, we show warning messages suggesting to keep the weights same? Version-Release number of selected component (if applicable): 4.x, 5.x How reproducible: Always Steps to Reproduce: 1. Deploy stretch mode cluster 2. Add OSD(s) to one site, making the weights uneven. Actual results: Cluster works as expected. So that's good. but should there be any checks post stretch mode deployment as well? Expected results: Warning message if required. Additional info: --- Additional comment from Vikhyat Umrao on 2022-09-08 11:26:18 EDT --- Good catch, Pawan. Let me check with Greg/Neha if it could be an extended RFE to the already existing check feature. --- Additional comment from Vikhyat Umrao on 2022-09-09 15:59:16 EDT --- Junior - as discussed in ceph-rados google chat space! Vikhyat Umrao, Yesterday 8:36 AM @Gregory Farnum @Neha Ojha https://bugzilla.redhat.com/show_bug.cgi?id=2125107 FYI - nothing urgent - this one would need your attention - stretch mode - could be an extended RFE Gregory Farnum, Yesterday 8:40 AM Yeah I think that makes sense. It probably wouldn’t be hard for Junior to add in on a per-map-generation basis (in one of the do_stretch_mode function areas) Vikhyat Umrao, Yesterday 8:40 AM Thank you, Greg. @Kamoltat Sirivadhna FYI --- Additional comment from Kamoltat (Junior) Sirivadhna on 2022-09-23 09:22:17 EDT --- Here is the upstream patch, the PR includes the solution to the bug and also a reproducer. https://github.com/ceph/ceph/pull/48209 --- Additional comment from Kamoltat (Junior) Sirivadhna on 2022-09-23 09:55:37 EDT --- FYI: the reproducer I've included at this current time will not pass because it is blocked by this issue: https://tracker.ceph.com/issues/57650?next_issue_id=57632. However, I have tested the patch with a modified reproducer that goes around the issue. But I think the best way is for QE to verify my fix by cherry-picking my commit into the build and verify it. --- Additional comment from Pawan on 2022-09-25 23:14:12 EDT --- I would be able to test the fix and provide feedback if I have a build available for me to pick. --- Additional comment from Kamoltat (Junior) Sirivadhna on 2022-09-28 14:26:40 EDT --- Hi Pawan, here is a command to pull a container image of an upstream build I created with my fix: docker pull quay.ceph.io/ceph-ci/ceph:f5acb0752663296c9686e7891b9a8dfbe3c76442 Let me know if this works for you, thank you --- Additional comment from Pawan on 2022-09-30 00:39:35 EDT --- Thank you for the build. I'll try the scenario and update the bug. --- Additional comment from Red Hat Bugzilla on 2022-12-31 14:09:44 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 14:13:22 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 14:32:29 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 14:59:55 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 17:43:27 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 18:43:28 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 18:45:48 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 00:35:15 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 01:27:04 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 01:28:56 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 03:38:24 EST --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 03:39:24 EST --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 03:48:15 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 03:49:46 EST --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 03:51:47 EST --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-09 03:29:39 EST --- Account disabled by LDAP Audit for extended failure --- Additional comment from Radoslaw Zarzynski on 2023-03-27 14:16:59 EDT --- The PR in upstream is not merged yet. --- Additional comment from Radoslaw Zarzynski on 2023-07-07 14:08:32 EDT --- The patch is merged in main, need to get the reef backport (https://tracker.ceph.com/issues/61810). --- Additional comment from Kamoltat (Junior) Sirivadhna on 2023-12-14 11:46:39 EST --- Thanks, Radek, I just cherry-picked the commits from (https://github.com/ceph/ceph/pull/52457/) downstream ceph-7.0-rhel-patches. Moving to POST ... --- Additional comment from Kamoltat (Junior) Sirivadhna on 2023-12-14 13:44:16 EST --- Downstream ceph-7.1-rhel-patches is not yet out for backporting. --- Additional comment from on 2024-01-22 23:08:59 EST --- (In reply to Kamoltat (Junior) Sirivadhna from comment #27) > Downstream ceph-7.1-rhel-patches is not yet out for backporting. The branch is out now, but I already see the three commits in https://github.com/ceph/ceph/pull/52457/commits included as part of the 18.2.1 rebase. Thomas --- Additional comment from errata-xmlrpc on 2024-01-22 23:09:33 EST --- This bug has been added to advisory RHBA-2024:126567 by Thomas Serlin (tserlin) --- Additional comment from errata-xmlrpc on 2024-01-22 23:09:33 EST --- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2024:126567-01 https://errata.engineering.redhat.com/advisory/126567 --- Additional comment from Pawan on 2024-02-01 04:01:26 EST --- Verified the warnings generated. Fix working as expected. [root@ceph-pdhiran-uur192-node8 ~]# ceph -s cluster: id: c9a060e8-c032-11ee-b413-fa163ebc4e6f health: HEALTH_WARN Stretch mode buckets have different weights! services: mon: 5 daemons, quorum ceph-pdhiran-uur192-node1-installer,ceph-pdhiran-uur192-node6,ceph-pdhiran-uur192-node3,ceph-pdhiran-uur192-node5,ceph-pdhiran-uur192-node2 (age 20h) mgr: ceph-pdhiran-uur192-node1-installer.upzyqc(active, since 20h), standbys: ceph-pdhiran-uur192-node5.upvxfr, ceph-pdhiran-uur192-node6.qdppxk, ceph-pdhiran-uur192-node2.xzslqe, ceph-pdhiran-uur192-node3.ytewip mds: 1/1 daemons up, 1 standby osd: 23 osds: 23 up (since 21s), 23 in (since 20h) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 8 pools, 433 pgs objects: 247 objects, 457 KiB usage: 7.8 GiB used, 567 GiB / 575 GiB avail pgs: 433 active+clean io: client: 4.5 KiB/s rd, 0 B/s wr, 4 op/s rd, 2 op/s wr [root@ceph-pdhiran-uur192-node8 ~]# ceph health detail HEALTH_WARN Stretch mode buckets have different weights! [WRN] UNEVEN_WEIGHTS_STRETCH_MODE: Stretch mode buckets have different weights! [root@ceph-pdhiran-uur192-node8 ~]# ceph -s cluster: id: c9a060e8-c032-11ee-b413-fa163ebc4e6f health: HEALTH_WARN Stretch mode buckets != 2 services: mon: 5 daemons, quorum ceph-pdhiran-uur192-node1-installer,ceph-pdhiran-uur192-node6,ceph-pdhiran-uur192-node3,ceph-pdhiran-uur192-node5,ceph-pdhiran-uur192-node2 (age 20h) mgr: ceph-pdhiran-uur192-node1-installer.upzyqc(active, since 20h), standbys: ceph-pdhiran-uur192-node5.upvxfr, ceph-pdhiran-uur192-node6.qdppxk, ceph-pdhiran-uur192-node2.xzslqe, ceph-pdhiran-uur192-node3.ytewip mds: 1/1 daemons up, 1 standby osd: 23 osds: 23 up (since 2m), 23 in (since 20h) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 8 pools, 433 pgs objects: 247 objects, 457 KiB usage: 7.8 GiB used, 567 GiB / 575 GiB avail pgs: 433 active+clean [root@ceph-pdhiran-uur192-node8 ~]# ceph health detail HEALTH_WARN Stretch mode buckets != 2 [WRN] INCORRECT_NUM_BUCKETS_STRETCH_MODE: Stretch mode buckets != 2 --- Additional comment from Akash Raj on 2024-03-04 00:31:56 EST --- Hi Kamoltat. Please confirm whether the doc text is required for inclusion in the release notes. If so, please provide the doc text and type. Thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:2743