Bug 2275995 - MAX_AVAIL increases after losing OSD containers and storage devices
Summary: MAX_AVAIL increases after losing OSD containers and storage devices
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 6.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 8.1z3
Assignee: Michael J. Kidd
QA Contact: Harsh Kumar
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2277857 2277178
TreeView+ depends on / blocked
 
Reported: 2024-04-18 20:45 UTC by jeff.a.smith
Modified: 2025-09-30 09:22 UTC (History)
12 users (show)

Fixed In Version: ceph-19.2.1-254.el9cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2277178 2277857 (view as bug list)
Environment:
Last Closed: 2025-09-30 09:21:50 UTC
Embargoed:
linuxkidd: needinfo-
linuxkidd: needinfo-
linuxkidd: needinfo-
linuxkidd: needinfo-
linuxkidd: needinfo-


Attachments (Terms of Use)
JSON output for requested cephadm shell commands to assist debug (52.96 KB, application/gzip)
2024-04-18 20:45 UTC, jeff.a.smith
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 65591 0 None None None 2024-04-19 14:52:58 UTC
Github ceph ceph pull 57003 0 None open PGMap: remove pool max_avail scale factor 2024-04-19 14:52:58 UTC
Red Hat Issue Tracker RHCEPH-8840 0 None None None 2024-04-18 20:46:48 UTC
Red Hat Product Errata RHBA-2025:17047 0 None None None 2025-09-30 09:22:02 UTC

Internal Links: 2277857

Description jeff.a.smith 2024-04-18 20:45:33 UTC
Created attachment 2027762 [details]
JSON output for requested cephadm shell commands to assist debug

Description of problem:

The MAX_AVAIL statistic behaves unexpectedly when
the cluster becomes unhealthy after losing contact
with several OSD containers.   Instead of getting 
smaller after losing 10 OSDs (25% of all OSD
containersand NVMe storage devices in the pool), max_avail
actually gets much larger (33% larger).  After the
cluster returns to healthy state, MAX_AVAIL is close
to its original value.

We care about the unexpected MAX_AVAIL value because our
control plane uses this metric when selecting a pool/cluster
to place a newly provisioned volume.   We need reliable
behavior for the MAX_AVAIL metric whether the cluster is
healthy or not.   With its current behavior when the cluster is
not "ok_healthy", MAX_AVAIL isn't suitable for consideration
when performing volume placement.   It's value makes no sense
for our use case because the metric *increases* by 33% after
losing 25% of the OSD containers and storage devices in our
dev cluster.  This can cause our control plane to select the
"wrong" pool/cluster when doing volume placement.


Version-Release number of selected component (if applicable):

                    "IBM_CEPH_IMAGE": "ibm-ceph/ceph-6-rhel9:6-20-1.0.0.build-13",
                    "RHCS_CEPH_VER": "17.2.6-196.el9cp",
                    "architecture": "x86_64",
                    "build-date": "2024-02-02T03:42:04",
                    "description": "IBM Storage Ceph 6",

How reproducible:
100%

Steps to Reproduce:
1. Save current MAX_AVAIL value
2. apply iptables rules to block all ports (public, cluster, heartbeat-front, heartbeat-back) for several OSD containers.  I took down 25% of OSD containers in a dev cluster whose pool was almost empty (very low utilization)
3. When all impacted OSD containers have been stopped, check MAX_AVAIL. In my env it is 33% larger after losing 25% of the cluster's OSD devices.
4. Remove iptables filters, restart the stopped OSD containers, and wait for cluster to return to healthy status.
5. Check MAX_AVAIL again, and it will be very close to its original value.

This issue was discussed with our CEPH dev team, and they
requested that we collect the following data for analysis:

ceph status --format=json        > $DIR/status_${TAG}.json
ceph df --format=json            > $DIR/df_${TAG}.json
ceph osd df tree --format=json   > $DIR/osd_dftree_${TAG}.json
ceph osd dump --format=json      > $DIR/osd_dump_${TAG}.json
ceph pg dump_pools_json          > $DIR/pg_dump_${TAG}.json


Datasets were collected at these times:
a. before experiment started                  (*_ok_start.json)
b. soon after OSD network traffic dropped     (*_starting_down.json)
c. all OSDs down                              (*_all_down.json)
d. after firewall rules removed and OSDs restarted (*_osds_restarting.json)
e. after all OSDs up, cluster healthy         (*_healthy_again.json)

The json files we collected are attached to this bugzilla in max_avail_json.tar.gz


Actual results:

df_ok_start.json         "max_avail": 72158076207104
df_starting_down.json    "max_avail": 79563069587456
df_all_down.json         "max_avail": 96042674552832
df_osds_restarting.json  "max_avail": 72154544603136
df_healthy_again.json    "max_avail": 72153084985344

MAX_AVAIL increases when cluster isn't healthy after OSDs are lost.

After removing the iptables rules dropping all OSD network traffic and
restarting the down OSD containers, the MAX_AVAIL shrinks back to be close
to its original value.

Expected results:

We expected MAX_AVAIL to decrease after losing 25% of its OSD containers
and storage devices, and then when the OSDs are brought online again, we expected MAX_AVAIL to increase to its original value.    Instead, we see the opposite
behavior.

Additional info:

Comment 1 Michael J. Kidd 2024-04-19 13:59:38 UTC
I've identified the code responsible for this discrepancy, and confirmed the math using the provided pg dump.

1. The 'raw_used_rate' is passed in to function `PGMapDigest::dump_object_stat_sum`, which should be equal to:
  - the number of copies for replicated pools
  - or ( K + M / K ) for EC pools.

2. At: https://github.com/ceph/ceph/blob/main/src/mon/PGMap.cc#L886
    raw_used_rate *= (float)(sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies;

   - This applies a scaling factor equal to the percentage of non-degraded object copies ( when compared to the total object copies count ).
   - Using the 'all_down' pgdump for libvirt-pool:
     num_object_copies:    7287540
     num_objects_degraded: 1812426
     Scaling factor applied: 0.7512979688619205

3. The 'MAX_AVAIL' value is calculated at: https://github.com/ceph/ceph/blob/main/src/mon/PGMap.cc#L901
   auto avail_res = raw_used_rate ? avail / raw_used_rate : 0;
   
   - 'avail' is the raw available bytes
   - 'raw_used_rate' is now ~75% of what it was, thus the 'MAX_AVAIL' increases by ~25%

   avail: ( min(osd_avail_kbytes) * num_osds ) - ( sum(osd_max_kbytes) * ( 1 - mon_osd_full_ratio ))
          (5597462260 * 40) - ( 250048839680 * ( 1 - 0.95 )) = 211396048416
   raw_used_rate: 3 * 0.7512979688619205 = 2.253893907
   max_avail: 211396048416 / 2.253893907 * 1024 = 96042476953095


I must admit, I don't understand the reason for the scaling factor logic, but the results do track with the code in place.

@Radek -- What are your thoughts on this logic?

Comment 2 Michael J. Kidd 2024-04-19 14:13:33 UTC
In thinking on this a bit further, I suspect the 'avail' value supplied by `PGMap::get_rule_avail` is expected to only include the 'UP' OSDs.
However, that function returns the total available bytes for all 'IN' OSDs.

If I repeat the math using only 'UP' OSDs, the MAX_AVAIL would only be 72031857714821, which tracks as expected.

A bit of background:
OSDs have two separate state flags:
- UP/DOWN: this corresponds to whether the OSD service is responding on the network
- IN/OUT: this corresponds to whether the OSD is still considered for data placement

Note, the operation of `PGMap::get_rule_avail` including all 'IN' OSDs fits by definition of the state flags.  
The 10 OSDs taken 'DOWN' are not marked 'OUT' yet, and thus, should be considered as part of the data placement.
In my opinion, the scale factor being applied should be removed to provide a more realistic 'MAX_AVAIL' calculation.

Comment 3 Michael J. Kidd 2024-04-19 14:52:59 UTC
I've submitted an upstream PR and tracker for this: 
https://github.com/ceph/ceph/pull/57003
https://tracker.ceph.com/issues/65591

Comment 18 errata-xmlrpc 2025-09-30 09:21:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.1 security, bug fix and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:17047


Note You need to log in before you can comment on or make changes to this bug.