Created attachment 2027762 [details] JSON output for requested cephadm shell commands to assist debug Description of problem: The MAX_AVAIL statistic behaves unexpectedly when the cluster becomes unhealthy after losing contact with several OSD containers. Instead of getting smaller after losing 10 OSDs (25% of all OSD containersand NVMe storage devices in the pool), max_avail actually gets much larger (33% larger). After the cluster returns to healthy state, MAX_AVAIL is close to its original value. We care about the unexpected MAX_AVAIL value because our control plane uses this metric when selecting a pool/cluster to place a newly provisioned volume. We need reliable behavior for the MAX_AVAIL metric whether the cluster is healthy or not. With its current behavior when the cluster is not "ok_healthy", MAX_AVAIL isn't suitable for consideration when performing volume placement. It's value makes no sense for our use case because the metric *increases* by 33% after losing 25% of the OSD containers and storage devices in our dev cluster. This can cause our control plane to select the "wrong" pool/cluster when doing volume placement. Version-Release number of selected component (if applicable): "IBM_CEPH_IMAGE": "ibm-ceph/ceph-6-rhel9:6-20-1.0.0.build-13", "RHCS_CEPH_VER": "17.2.6-196.el9cp", "architecture": "x86_64", "build-date": "2024-02-02T03:42:04", "description": "IBM Storage Ceph 6", How reproducible: 100% Steps to Reproduce: 1. Save current MAX_AVAIL value 2. apply iptables rules to block all ports (public, cluster, heartbeat-front, heartbeat-back) for several OSD containers. I took down 25% of OSD containers in a dev cluster whose pool was almost empty (very low utilization) 3. When all impacted OSD containers have been stopped, check MAX_AVAIL. In my env it is 33% larger after losing 25% of the cluster's OSD devices. 4. Remove iptables filters, restart the stopped OSD containers, and wait for cluster to return to healthy status. 5. Check MAX_AVAIL again, and it will be very close to its original value. This issue was discussed with our CEPH dev team, and they requested that we collect the following data for analysis: ceph status --format=json > $DIR/status_${TAG}.json ceph df --format=json > $DIR/df_${TAG}.json ceph osd df tree --format=json > $DIR/osd_dftree_${TAG}.json ceph osd dump --format=json > $DIR/osd_dump_${TAG}.json ceph pg dump_pools_json > $DIR/pg_dump_${TAG}.json Datasets were collected at these times: a. before experiment started (*_ok_start.json) b. soon after OSD network traffic dropped (*_starting_down.json) c. all OSDs down (*_all_down.json) d. after firewall rules removed and OSDs restarted (*_osds_restarting.json) e. after all OSDs up, cluster healthy (*_healthy_again.json) The json files we collected are attached to this bugzilla in max_avail_json.tar.gz Actual results: df_ok_start.json "max_avail": 72158076207104 df_starting_down.json "max_avail": 79563069587456 df_all_down.json "max_avail": 96042674552832 df_osds_restarting.json "max_avail": 72154544603136 df_healthy_again.json "max_avail": 72153084985344 MAX_AVAIL increases when cluster isn't healthy after OSDs are lost. After removing the iptables rules dropping all OSD network traffic and restarting the down OSD containers, the MAX_AVAIL shrinks back to be close to its original value. Expected results: We expected MAX_AVAIL to decrease after losing 25% of its OSD containers and storage devices, and then when the OSDs are brought online again, we expected MAX_AVAIL to increase to its original value. Instead, we see the opposite behavior. Additional info:
I've identified the code responsible for this discrepancy, and confirmed the math using the provided pg dump. 1. The 'raw_used_rate' is passed in to function `PGMapDigest::dump_object_stat_sum`, which should be equal to: - the number of copies for replicated pools - or ( K + M / K ) for EC pools. 2. At: https://github.com/ceph/ceph/blob/main/src/mon/PGMap.cc#L886 raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies; - This applies a scaling factor equal to the percentage of non-degraded object copies ( when compared to the total object copies count ). - Using the 'all_down' pgdump for libvirt-pool: num_object_copies: 7287540 num_objects_degraded: 1812426 Scaling factor applied: 0.7512979688619205 3. The 'MAX_AVAIL' value is calculated at: https://github.com/ceph/ceph/blob/main/src/mon/PGMap.cc#L901 auto avail_res = raw_used_rate ? avail / raw_used_rate : 0; - 'avail' is the raw available bytes - 'raw_used_rate' is now ~75% of what it was, thus the 'MAX_AVAIL' increases by ~25% avail: ( min(osd_avail_kbytes) * num_osds ) - ( sum(osd_max_kbytes) * ( 1 - mon_osd_full_ratio )) (5597462260 * 40) - ( 250048839680 * ( 1 - 0.95 )) = 211396048416 raw_used_rate: 3 * 0.7512979688619205 = 2.253893907 max_avail: 211396048416 / 2.253893907 * 1024 = 96042476953095 I must admit, I don't understand the reason for the scaling factor logic, but the results do track with the code in place. @Radek -- What are your thoughts on this logic?
In thinking on this a bit further, I suspect the 'avail' value supplied by `PGMap::get_rule_avail` is expected to only include the 'UP' OSDs. However, that function returns the total available bytes for all 'IN' OSDs. If I repeat the math using only 'UP' OSDs, the MAX_AVAIL would only be 72031857714821, which tracks as expected. A bit of background: OSDs have two separate state flags: - UP/DOWN: this corresponds to whether the OSD service is responding on the network - IN/OUT: this corresponds to whether the OSD is still considered for data placement Note, the operation of `PGMap::get_rule_avail` including all 'IN' OSDs fits by definition of the state flags. The 10 OSDs taken 'DOWN' are not marked 'OUT' yet, and thus, should be considered as part of the data placement. In my opinion, the scale factor being applied should be removed to provide a more realistic 'MAX_AVAIL' calculation.
I've submitted an upstream PR and tracker for this: https://github.com/ceph/ceph/pull/57003 https://tracker.ceph.com/issues/65591
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 8.1 security, bug fix and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2025:17047