2237880 – [5.3.z6 backport][cee/sd][BlueFS][RHCS 5.x] no BlueFS spillover health warning in RHCS 5.x

Bug 2237880 - [5.3.z6 backport][cee/sd][BlueFS][RHCS 5.x] no BlueFS spillover health warning in RHCS 5.x

Summary: [5.3.z6 backport][cee/sd][BlueFS][RHCS 5.x] no BlueFS spillover health warnin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	5.3z6
Assignee:	Radoslaw Zarzynski
QA Contact:	Pawan
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Depends On:	2129414 2237881
Blocks:	2258797
TreeView+	depends on / blocked

Reported:	2023-09-07 13:20 UTC by Vikhyat Umrao
Modified:	2024-03-18 20:09 UTC (History)
CC List:	26 users (show)
Fixed In Version:	ceph-16.2.10-244.el8cp
Doc Type:	Bug Fix
Doc Text:	.The detection code is reintroduced and spillover appears as expected Previously, the refactor removed the spillover detection code and this spillover from the dedicated Block.DB to main block device would never get detected. With this fix, the code is reintroduced, and the spillover appears properly.
Clone Of:	2129414
Environment:
Last Closed:	2024-02-08 16:55:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	59340	None	None	None	2023-09-07 13:20:07 UTC
Red Hat Issue Tracker	RHCEPH-7340	None	None	None	2023-09-07 13:21:48 UTC
Red Hat Knowledge Base (Solution)	6977626	None	None	None	2024-03-18 20:09:22 UTC
Red Hat Product Errata	RHSA-2024:0745	None	None	None	2024-02-08 16:55:14 UTC

Description Vikhyat Umrao 2023-09-07 13:20:07 UTC

+++ This bug was initially created as a clone of Bug #2129414 +++

Description of problem:
In RHCS 5.x there is no blueFS spillover health warning generated when the RocksDB starts consuming block (slower) device space. 

Version-Release number of selected component (if applicable): RHCS 5.0z4 and RHCS 5.2

How reproducible: Always

Steps to Reproduce:
1. Deploy a fresh RHCS 5 or upgrade a cluster from RHCS 4 to RHCS 5 with smaller block DB size (Like 10 Mib or 30 Mib)
   - For example:
~~~
service_type: osd
service_id: osd_nodeXY_paths
service_name: osd.osd_nodeXY_paths
placement:
  hosts:
  - nodeX
  - nodeY
spec:
  block_db_size: 10485760   <----
  data_devices:
    paths:
    - /dev/sdb
    - /dev/sdc
  db_devices:
    paths:
    - /dev/sdd
  filter_logic: AND
  objectstore: bluestore
~~~
2. Add some data into the cluster using RBD 
3. Collect the output of the below command and look for the "slow_used_bytes" parameter.
~~~
$ ceph daemon osd.<id> perf dump bluefs
~~~
   - If using non-colocated OSDs, then also verify using the below command and look for "SLOW" Column
~~~
$ ceph daemon osd.<id> bluefs stats    
~~~

*NOTE*: non-colocated: OSDs having DB and Data on separate devices

Actual results: No bluefs spillover health warning 

Expected results: It should show the bluefs spillover health warning


Additional info:

Tried to reproduce this issue in RHCS 4.2z4 and successfully able to get the bluefs spillover health warning

--- Additional comment from Kritik Sachdeva on 2022-09-23 17:02:23 UTC ---

Hello team,

Here is the results of the "perf dump" and "bluefs stats" command from the lab environment.

+ bluefs stats
~~~
[ceph: root@node2 /]# ceph daemon osd.0 bluefs stats
1 : device size 0x7fe000 : using 0x700000(7 MiB)
2 : device size 0x27fc00000 : using 0x8137000(129 MiB)
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:7969177, slow_total:10196562739, db_avail:0
Usage matrix:                        
DEV/LEV     WAL         DB          SLOW  <--      *           *           REAL        FILES       
LOG         0 B         4 MiB       0 B         0 B         0 B         688 KiB     1           
WAL         0 B         1 MiB       64 KiB      0 B         0 B         1.1 MiB     1           
DB          0 B         2 MiB       704 KiB     0 B         0 B         85 KiB      14          
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         0           
TOTALS      0 B         7 MiB       768 KiB     0 B         0 B         0 B         16          
MAXIMUMS:
LOG         0 B         4 MiB       0 B         0 B         0 B         688 KiB     
WAL         0 B         1 MiB       64 KiB      0 B         0 B         1.1 MiB     
DB          0 B         3 MiB       768 KiB     0 B         0 B         121 KiB     
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         
TOTALS      0 B         7 MiB       832 KiB     0 B         0 B         0 B  
~~~

+ perf dump bluefs
~~~
[ceph: root@node2 /]# ceph daemon osd.0 perf dump bluefs
{
    "bluefs": {
        "db_total_bytes": 8380416,
        "db_used_bytes": 7340032,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 10733223936,
        "slow_used_bytes": 786432,      <----
        "num_files": 16,
        "log_bytes": 704512,
        "log_compactions": 0,
        "logged_bytes": 499712,
        "files_written_wal": 1,
        "files_written_sst": 4,
        "bytes_written_wal": 1564672,
        "bytes_written_sst": 20480,
        "bytes_written_slow": 139264,
        "max_bytes_wal": 0,
        "max_bytes_db": 7340032,
        "max_bytes_slow": 786432,
        "read_random_count": 41,
        "read_random_bytes": 11755,
        "read_random_disk_count": 1,
        "read_random_disk_bytes": 4168,
        "read_random_buffer_count": 40,
        "read_random_buffer_bytes": 7587,
        "read_count": 70,
        "read_bytes": 259859,
        "read_prefetch_count": 7,
        "read_prefetch_bytes": 7318,
        "read_zeros_candidate": 0,
        "read_zeros_errors": 0
    }
}
~~~

+ Cluster health status
~~~
# ceph health detail
HEALTH_OK
~~~

+ ceph osd df tree
~~~
[ceph: root@node2 /]# ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME     
-1         0.03918         -  40 GiB  1.9 GiB  1.9 GiB   0 B   39 MiB   38 GiB  4.82  1.00    -          root default  
-3         0.01959         -  20 GiB  987 MiB  965 MiB   0 B   20 MiB   19 GiB  4.82  1.00    -              host node2
 0    hdd  0.00980   1.00000  10 GiB  137 MiB  128 MiB   0 B  7.8 MiB  9.9 GiB  1.34  0.28    2      up          osd.0 
 1    hdd  0.00980   1.00000  10 GiB  850 MiB  837 MiB   0 B   12 MiB  9.2 GiB  8.29  1.72    7      up          osd.1 
-5         0.01959         -  20 GiB  987 MiB  965 MiB   0 B   20 MiB   19 GiB  4.82  1.00    -              host node3
 2    hdd  0.00980   1.00000  10 GiB  252 MiB  241 MiB   0 B   10 MiB  9.8 GiB  2.46  0.51    2      up          osd.2 
 3    hdd  0.00980   1.00000  10 GiB  735 MiB  725 MiB   0 B  9.1 MiB  9.3 GiB  7.17  1.49    7      up          osd.3 
                       TOTAL  40 GiB  1.9 GiB  1.9 GiB   0 B   39 MiB   38 GiB  4.82                                   
MIN/MAX VAR: 0.28/1.72  STDDEV: 2.97
~~~

Regards,
Kritik Sachdeva

--- Additional comment from Adam Kupczyk on 2022-09-26 13:52:37 UTC ---

I confirm that code to set spillover health warn has been removed.

This is an accidental byproduct of an improvement https://github.com/ceph/ceph/pull/30838 .

A new place where this trigger should be added must be devised.

--- Additional comment from Vikhyat Umrao on 2022-09-27 17:32:19 UTC ---

Good find, Kritik!

Adam - what is our plan? Do you think it can be fixed soon? as this is regression? 
Also, if you have created an upstream tracker please let me know I will attach it to the bug.

--- Additional comment from RHEL Program Management on 2022-09-27 17:32:29 UTC ---

This bug report has Keywords: Regression or TestBlocker.

Since no regressions or test blockers are allowed between releases, it is being proposed as a blocker for this release.

Please resolve \triage ASAP.

--- Additional comment from Red Hat Bugzilla on 2022-12-31 19:09:42 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 19:13:33 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 19:32:42 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 20:00:05 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 22:43:36 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 23:43:39 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 23:45:58 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 05:35:27 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 06:27:15 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 06:29:08 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:39:00 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:39:47 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:48:44 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:50:17 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:52:14 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-09 08:29:27 UTC ---

Account disabled by LDAP Audit for extended failure

--- Additional comment from Kritik Sachdeva on 2023-02-16 04:10:41 UTC ---

Hello team,

After discussing with Michael & Vikhyat, attaching the test results from RHCS 5.3 environment where the issue is still present.

## Steps to reproduce  ##
--------------------------

1. Deploy a fresh RHCS 5 with smaller block DB size (Like 10 Mib or 30 Mib)
   - For example:
~~~
service_type: osd
service_id: osd_nodeXY_paths
service_name: osd.osd_nodeXY_paths
placement:
  hosts:
  - nodeX
  - nodeY
spec:
  block_db_size: 10485760   <----
  data_devices:
    paths:
    - /dev/sdb
    - /dev/sdc
  db_devices:
    paths:
    - /dev/sdd
  filter_logic: AND
  objectstore: bluestore
~~~
2. Add some data into the cluster using RBD 
3. Collect the output of the below command and look for the "slow_used_bytes" parameter.
~~~
$ ceph daemon osd.<id> perf dump bluefs
~~~
   - If using non-colocated OSDs, then also verify using the below command and look for "SLOW" Column
~~~
$ ceph daemon osd.<id> bluefs stats    
~~~

=======================================================================================================

## Testing Results ##
---------------------

[root@node1ceph5 ~]# ceph -s
  cluster:
    id:     3357af48-acf0-11ed-a09d-001a4a00064d
    health: HEALTH_WARN
            1 slow ops, oldest one blocked for 302 sec, mon.node3ceph5 has slow ops
 
  services:
    mon: 3 daemons, quorum node1ceph5,node2ceph5,node3ceph5 (age 12m)
    mgr: node1ceph5.msurmv(active, since 23m), standbys: node2ceph5.bvkilb
    osd: 5 osds: 5 up (since 3m), 5 in (since 3m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   45 MiB used, 110 GiB / 110 GiB avail
    pgs:     1 active+clean

[root@node1ceph5 ~]# ceph versions
{
    "mon": {
        "ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)": 5
    },
    "mds": {},
    "overall": {
        "ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)": 10
    }
}
[root@node1ceph5 ~]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME            STATUS  REWEIGHT  PRI-AFF
-1         0.10748  root default                                  
-7         0.02930      host node1ceph5                           
 4    hdd  0.02930          osd.4            up   1.00000  1.00000
-5         0.03909      host node2ceph5                           
 1    hdd  0.02930          osd.1            up   1.00000  1.00000
 3    hdd  0.00980          osd.3            up   1.00000  1.00000
-3         0.03909      host node3ceph5                           
 0    hdd  0.02930          osd.0            up   1.00000  1.00000
 2    hdd  0.00980          osd.2            up   1.00000  1.00000


[ceph: root@node2ceph5 /]# ceph daemon osd.1 perf dump bluefs
{
    "bluefs": {
        "db_total_bytes": 8380416,
        "db_used_bytes": 7340032,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 32208060416,
        "slow_used_bytes": 786432,     <----- Non-empty value
        "num_files": 12,
        "log_bytes": 307200,
        "log_compactions": 0,
        "logged_bytes": 139264,
        "files_written_wal": 1,
        "files_written_sst": 3,
        "bytes_written_wal": 491520,
        "bytes_written_sst": 8192,
        "bytes_written_slow": 540672,
        "max_bytes_wal": 0,
        "max_bytes_db": 7340032,
        "max_bytes_slow": 720896,
        "read_random_count": 18,
        "read_random_bytes": 3522,
        "read_random_disk_count": 0,
        "read_random_disk_bytes": 0,
        "read_random_buffer_count": 18,
        "read_random_buffer_bytes": 3522,
        "read_count": 57,
        "read_bytes": 218464,
        "read_prefetch_count": 3,
        "read_prefetch_bytes": 3328,
        "read_zeros_candidate": 0,
        "read_zeros_errors": 0
    }
}



[ceph: root@node2ceph5 /]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    110 GiB  110 GiB  45 MiB    45 MiB       0.04
TOTAL  110 GiB  110 GiB  45 MiB    45 MiB       0.04
 
--- POOLS ---
POOL                   ID  PGS  STORED  OBJECTS  USED  %USED  MAX AVAIL
device_health_metrics   1    1     0 B        0   0 B      0     35 GiB
[ceph: root@node2ceph5 /]# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS
 4    hdd  0.02930   1.00000   30 GiB  8.9 MiB  248 KiB   0 B  7.7 MiB   30 GiB  0.03  0.73    1      up
 1    hdd  0.02930   1.00000   30 GiB  9.0 MiB  248 KiB   0 B  7.8 MiB   30 GiB  0.03  0.73    1      up
 3    hdd  0.00980   1.00000   10 GiB  8.9 MiB  248 KiB   0 B  7.7 MiB   10 GiB  0.09  2.19    0      up
 0    hdd  0.02930   1.00000   30 GiB  9.1 MiB  248 KiB   0 B  7.8 MiB   30 GiB  0.03  0.74    1      up
 2    hdd  0.00980   1.00000   10 GiB  9.0 MiB  248 KiB   0 B  7.8 MiB   10 GiB  0.09  2.20    0      up
                       TOTAL  110 GiB   45 MiB  1.2 MiB   0 B   39 MiB  110 GiB  0.04                   
MIN/MAX VAR: 0.73/2.20  STDDEV: 0.03


[ceph: root@node2ceph5 /]# ceph daemon osd.1 bluefs stats    
1 : device size 0x7fe000 : using 0x700000(7 MiB)
2 : device size 0x77fc00000 : using 0x100000(1 MiB)
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:7969177, slow_total:30597657395, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW  <---  *           *           REAL        FILES       
LOG         0 B         4 MiB       0 B         0 B         0 B         300 KiB     1           
WAL         0 B         0 B         384 KiB     0 B         0 B         372 KiB     1           
DB          0 B         3 MiB       384 KiB     0 B         0 B         76 KiB      10          
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         0           
TOTALS      0 B         7 MiB       768 KiB     0 B         0 B         0 B         12          
MAXIMUMS:
LOG         0 B         4 MiB       0 B         0 B         0 B         300 KiB     
WAL         0 B         0 B         384 KiB     0 B         0 B         372 KiB     
DB          0 B         3 MiB       512 KiB     0 B         0 B         112 KiB     
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         
TOTALS      0 B         7 MiB       768 KiB     0 B         0 B         0 B    


// Still no heath warning for bluefs Spillover    //

[ceph: root@node2ceph5 /]# ceph -s
  cluster:
    id:     3357af48-acf0-11ed-a09d-001a4a00064d
    health: HEALTH_WARN
            1 slow ops, oldest one blocked for 627 sec, mon.node3ceph5 has slow ops
 
  services:
    mon: 3 daemons, quorum node1ceph5,node2ceph5,node3ceph5 (age 18m)
    mgr: node1ceph5.msurmv(active, since 28m), standbys: node2ceph5.bvkilb
    osd: 5 osds: 5 up (since 8m), 5 in (since 8m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   45 MiB used, 110 GiB / 110 GiB avail
    pgs:     1 active+clean




// Even after restarting of the OSDs similar results in bluefs stats output and no health warning   //

[ceph: root@node2ceph5 /]# ceph orch restart osd.osd_node23_paths
Scheduled to restart osd.1 on host 'node2ceph5'
Scheduled to restart osd.3 on host 'node2ceph5'
Scheduled to restart osd.0 on host 'node3ceph5'
Scheduled to restart osd.2 on host 'node3ceph5'
[ceph: root@node2ceph5 /]# 
[ceph: root@node2ceph5 /]# ceph daemon osd.1 perf dump bluefs
{
    "bluefs": {
        "db_total_bytes": 8380416,
        "db_used_bytes": 7340032,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 32208060416,
        "slow_used_bytes": 983040,     <---- Non-empty value
        "num_files": 19,
        "log_bytes": 475136,
        "log_compactions": 0,
        "logged_bytes": 143360,
        "files_written_wal": 1,
        "files_written_sst": 7,
        "bytes_written_wal": 311296,
        "bytes_written_sst": 49152,
        "bytes_written_slow": 176128,
        "max_bytes_wal": 0,
        "max_bytes_db": 7340032,
        "max_bytes_slow": 1376256,
        "read_random_count": 59,
        "read_random_bytes": 38730,
        "read_random_disk_count": 1,
        "read_random_disk_bytes": 20621,
        "read_random_buffer_count": 58,
        "read_random_buffer_bytes": 18109,
        "read_count": 117,
        "read_bytes": 812174,
        "read_prefetch_count": 10,
        "read_prefetch_bytes": 10286,
        "read_zeros_candidate": 0,
        "read_zeros_errors": 0
    }
}

[ceph: root@node2ceph5 /]# ceph daemon osd.1 bluefs stats
1 : device size 0x7fe000 : using 0x700000(7 MiB)
2 : device size 0x77fc00000 : using 0x15a000(1.4 MiB)
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:7969177, slow_total:30597657395, db_avail:0
Usage matrix:
DEV/LEV     WAL         DB          SLOW  <---   *           *           REAL        FILES       
LOG         0 B         4 MiB       0 B         0 B         0 B         476 KiB     1           
WAL         0 B         1 MiB       64 KiB      0 B         0 B         239 KiB     1           
DB          0 B         2 MiB       896 KiB     0 B         0 B         111 KiB     17          
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         0           
TOTALS      0 B         7 MiB       960 KiB     0 B         0 B         0 B         19          
MAXIMUMS:
LOG         0 B         4 MiB       0 B         0 B         0 B         476 KiB     
WAL         0 B         1 MiB       448 KiB     0 B         0 B         417 KiB     
DB          0 B         3 MiB       960 KiB     0 B         0 B         147 KiB     
SLOW        0 B         0 B         0 B         0 B         0 B         0 B         
TOTALS      0 B         7 MiB       1.4 MiB     0 B         0 B         0 B 



[ceph: root@node2ceph5 /]# ceph -s
  cluster:
    id:     3357af48-acf0-11ed-a09d-001a4a00064d
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum node1ceph5,node2ceph5,node3ceph5 (age 22m)
    mgr: node1ceph5.msurmv(active, since 32m), standbys: node2ceph5.bvkilb
    osd: 5 osds: 5 up (since 50s), 5 in (since 12m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   47 MiB used, 110 GiB / 110 GiB avail
    pgs:     1 active+clean


Regards,
Kritik Sachdeva

--- Additional comment from Adam Kupczyk on 2023-03-29 13:36:58 UTC ---

I think we will be fixing the problem with this: https://github.com/ceph/ceph/pull/49987 + backports.

--- Additional comment from Red Hat Bugzilla on 2023-06-20 18:45:35 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-06-20 18:45:46 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Vikhyat Umrao on 2023-08-30 17:17:07 UTC ---

Harsh, if you want you can mark this verified as you already saw in workload-dfg clusters this warning :)

--- Additional comment from Vikhyat Umrao on 2023-08-30 17:17:49 UTC ---

(In reply to Vikhyat Umrao from comment #25)
> Harsh, if you want you can mark this verified as you already saw in
> workload-dfg clusters this warning :)

I mean when the bz goes ON_QA :)

--- Additional comment from errata-xmlrpc on 2023-08-30 19:40:00 UTC ---

This bug has been added to advisory RHBA-2023:118213 by Ken Dreyer (kdreyer)

--- Additional comment from errata-xmlrpc on 2023-08-30 19:40:00 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2023:118213-01
https://errata.devel.redhat.com/advisory/118213

--- Additional comment from Pawan on 2023-08-31 15:52:00 UTC ---

Verified the fix.

Deployed OSD with data + db device, and observed the spillover . Created very small DB device intentionally to check the warning generation.

[ceph: root@ceph-pdhiran-66-bq3qsx-node1-installer /]# ceph orch daemon add osd ceph-pdhiran-66-bq3qsx-node9:data_devices=/dev/data_vg/data-lv1,db_devices=/dev/db_vg/db-lv1
Created osd(s) 1 on host 'ceph-pdhiran-66-bq3qsx-node9'
[ceph: root@ceph-pdhiran-66-bq3qsx-node1-installer /]# ceph -s
  cluster:
    id:     c4a335be-47f3-11ee-a24b-fa163e51ef57
    health: HEALTH_WARN
            1 OSD(s) experiencing BlueFS spillover
            1 stray daemon(s) not managed by cephadm
            Degraded data redundancy: 9555/120660 objects degraded (7.919%), 27 pgs degraded

  services:
    mon: 5 daemons, quorum ceph-pdhiran-66-bq3qsx-node1-installer,ceph-pdhiran-66-bq3qsx-node2,ceph-pdhiran-66-bq3qsx-node11,ceph-pdhiran-66-bq3qsx-node8,ceph-pdhiran-66-bq3qsx-node6 (age 3h)
    mgr: ceph-pdhiran-66-bq3qsx-node1-installer.cxxbck(active, since 4h), standbys: ceph-pdhiran-66-bq3qsx-node6.twpvla, ceph-pdhiran-66-bq3qsx-node2.mnzoos
    osd: 17 osds: 17 up (since 17s), 17 in (since 3h); 28 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 40.22k objects, 38 GiB
    usage:   118 GiB used, 306 GiB / 425 GiB avail
    pgs:     9555/120660 objects degraded (7.919%)
             247/120660 objects misplaced (0.205%)
             101 active+clean
             27  active+recovery_wait+undersized+degraded+remapped
             1   active+recovering+undersized+remapped

  io:
    recovery: 78 MiB/s, 79 objects/s

[ceph: root@ceph-pdhiran-66-bq3qsx-node1-installer /]# ceph health detail
HEALTH_WARN 1 OSD(s) experiencing BlueFS spillover; 1 stray daemon(s) not managed by cephadm; Degraded data redundancy: 8952/120660 objects degraded (7.419%), 25 pgs degraded
[WRN] BLUEFS_SPILLOVER: 1 OSD(s) experiencing BlueFS spillover
     osd.1 spilled over 65 MiB metadata from 'db' device (3 MiB used of 4.0 MiB) to slow device


# ceph version
ceph version 18.2.0-2.el9cp (cbed2329dbd9e1c06cf77afcfad901ae16cc5e6a) reef (stable)

--- Additional comment from Pawan on 2023-09-07 03:18:33 UTC ---

Hello Neha, Vikhyat,  Do we have backports of this bug, for the issue to be fixed in 5.x & 6.x? I feel that we should include this fix in other releases as well.

--- Additional comment from Vikhyat Umrao on 2023-09-07 13:18:31 UTC ---

(In reply to Pawan from comment #30)
> Hello Neha, Vikhyat,  Do we have backports of this bug, for the issue to be
> fixed in 5.x & 6.x? I feel that we should include this fix in other releases
> as well.

I see in upstream we do have backports for pacific and quincy we could do it for 5.3.z and 6.1z. I agree it would be needed. Thank you for highlighting it.

Comment 2 Vikhyat Umrao 2023-09-07 13:21:04 UTC

Pacific backport - https://github.com/ceph/ceph/pull/50932

Comment 9 errata-xmlrpc 2024-02-08 16:55:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:0745

Note You need to log in before you can comment on or make changes to this bug.

akupczyk
amathuri
bhubbard
ceph-eng-bugs
cephqe-warriors
choffman
gjose
hakumar
hklein
kdreyer
ksachdev
ksirivad
lflores
lithomas
mcaldeir
nojha
pdhange
pdhiran
rfriedma
rmandyam
roemerso
rzarzyns
skanta
sseshasa
tserlin
vumrao