Bug 2274175 - With Replica-1 enabled, replicated pool is spreading PGs across all OSDs
Summary: With Replica-1 enabled, replicated pool is spreading PGs across all OSDs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.16
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.16.0
Assignee: Malay Kumar parida
QA Contact: Aviad Polak
URL:
Whiteboard:
: 2254344 (view as bug list)
Depends On:
Blocks: 2291321
TreeView+ depends on / blocked
 
Reported: 2024-04-09 14:59 UTC by Aviad Polak
Modified: 2024-08-26 11:59 UTC (History)
4 users (show)

Fixed In Version: 4.16.0-126
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2291321 (view as bug list)
Environment:
Last Closed: 2024-07-17 13:18:41 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-ci pull 9501 0 None Merged Replica1 tests 2024-08-26 11:59:45 UTC
Github red-hat-storage ocs-operator pull 2615 0 None open Set deviceClasses to avoid replicated pool spreading PGs across all OSDs 2024-06-07 12:18:18 UTC
Github red-hat-storage ocs-operator pull 2662 0 None open Bug 2274175: Set deviceClasses to avoid replicated pool spreading PGs across all OSDs 2024-06-11 13:19:36 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:18:48 UTC

Description Aviad Polak 2024-04-09 14:59:05 UTC
Description of problem (please be detailed as possible and provide log
snippests):
while the replica-1 osd has correct device class, the old OSDs got "ssd" class instead of "replicated"

sh-5.1$ ceph osd df tree
ID   CLASS       WEIGHT    REWEIGHT  SIZE    RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME                                           
 -1              18.00000         -  18 TiB  430 MiB  274 MiB   0 B  155 MiB   18 TiB  0.00  1.00    -          root default                                        
 -5              18.00000         -  18 TiB  430 MiB  274 MiB   0 B  155 MiB   18 TiB  0.00  1.00    -              region us-east-1                                
-14               6.00000         -   6 TiB  149 MiB   91 MiB   0 B   58 MiB  6.0 TiB  0.00  1.04    -                  zone us-east-1a                             
-13               2.00000         -   2 TiB   48 MiB   15 MiB   0 B   32 MiB  2.0 TiB  0.00  1.00    -                      host ocs-deviceset-gp3-csi-2-data-05ggvh
  2         ssd   2.00000   1.00000   2 TiB   48 MiB   15 MiB   0 B   32 MiB  2.0 TiB  0.00  1.00   33      up                  osd.2                               
-41               2.00000         -   2 TiB   53 MiB   37 MiB   0 B   16 MiB  2.0 TiB  0.00  1.11    -                      host us-east-1a-data-0knkjg             
  3  us-east-1a   2.00000   1.00000   2 TiB   53 MiB   37 MiB   0 B   16 MiB  2.0 TiB  0.00  1.11   44      up                  osd.3                               
-61               2.00000         -   2 TiB   48 MiB   39 MiB   0 B  9.5 MiB  2.0 TiB  0.00  1.01    -                      host us-east-1a-data-1v5fwd             
  7  us-east-1a   2.00000   1.00000   2 TiB   48 MiB   39 MiB   0 B  9.5 MiB  2.0 TiB  0.00  1.01   36      up                  osd.7                               
 -4               6.00000         -   6 TiB  132 MiB   91 MiB   0 B   41 MiB  6.0 TiB  0.00  0.92    -                  zone us-east-1b                             
 -3               2.00000         -   2 TiB   65 MiB   43 MiB   0 B   22 MiB  2.0 TiB  0.00  1.36    -                      host ocs-deviceset-gp3-csi-1-data-0d98jz
  0         ssd   2.00000   1.00000   2 TiB   65 MiB   43 MiB   0 B   22 MiB  2.0 TiB  0.00  1.36   35      up                  osd.0                               
-51               2.00000         -   2 TiB   35 MiB   24 MiB   0 B   10 MiB  2.0 TiB  0.00  0.73    -                      host us-east-1b-data-0jp5x5             
  5  us-east-1b   2.00000   1.00000   2 TiB   35 MiB   24 MiB   0 B   10 MiB  2.0 TiB  0.00  0.73   40      up                  osd.5                               
-56               2.00000         -   2 TiB   32 MiB   24 MiB   0 B  8.4 MiB  2.0 TiB  0.00  0.67    -                      host us-east-1b-data-1hfftf             
  6  us-east-1b   2.00000   1.00000   2 TiB   32 MiB   24 MiB   0 B  8.4 MiB  2.0 TiB  0.00  0.67   38      up                  osd.6                               
-10               6.00000         -   6 TiB  149 MiB   92 MiB   0 B   57 MiB  6.0 TiB  0.00  1.04    -                  zone us-east-1c                             
 -9               2.00000         -   2 TiB   73 MiB   40 MiB   0 B   32 MiB  2.0 TiB  0.00  1.52    -                      host ocs-deviceset-gp3-csi-0-data-0mgxqb
  1         ssd   2.00000   1.00000   2 TiB   73 MiB   40 MiB   0 B   32 MiB  2.0 TiB  0.00  1.52   38      up                  osd.1                               
-46               2.00000         -   2 TiB   33 MiB   18 MiB   0 B   16 MiB  2.0 TiB  0.00  0.70    -                      host us-east-1c-data-0lxvf6             
  4  us-east-1c   2.00000   1.00000   2 TiB   33 MiB   18 MiB   0 B   16 MiB  2.0 TiB  0.00  0.70   40      up                  osd.4                               
-66               2.00000         -   2 TiB   43 MiB   34 MiB   0 B  8.8 MiB  2.0 TiB  0.00  0.90    -                      host us-east-1c-data-1rm7lr             
  8  us-east-1c   2.00000   1.00000   2 TiB   43 MiB   34 MiB   0 B  8.8 MiB  2.0 TiB  0.00  0.90   38      up                  osd.8                               
                              TOTAL  18 TiB  430 MiB  274 MiB   0 B  155 MiB   18 TiB  0.00                                                                         
MIN/MAX VAR: 0.67/1.52  STDDEV: 0


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
the impact found when trying to delete replica-1 and revert cluster into back replica-3 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Steps to Reproduce:
1. enable replica-1
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 Travis Nielsen 2024-04-10 22:52:48 UTC
When the replica 1 feature is enabled, there are a couple of issues going wrong:
1. The crushDeviceClass property on the storageClassDeviceSets is being updated to "replicated". However, the OSDs are **not** getting updated from their default deviceClass of ssd as seen in the crush tree above. This is by design from ceph that the deviceClass to prevent accidental updates to the device class. The upstream Rook issue [1] is opened to consider supporting updated deviceClasses on the OSDs.
2. The default pools are updating their device classes to "replicated". However, their crush rules are **not** getting updated because the "replicated" deviceClass is not available on the OSDs, and it is silently failing with error [2], which was not even being logged by the rook operator. This will be fixed by [3] to fail the reconcile if the deviceClass is not available on any OSDs.

The user impact is that the main replicated pools are using all OSDs instead of only the expected OSDs that are independent from the replica 1 OSDs. During normal operation the user won't notice any issues other than the replicated pools consuming some space on the replica 1 OSDs. 

This issue would be very impactful if we supported disabling the replica 1 feature and the replica 1 OSDs are purged. In this case, the replicated pools will potentially have lost PGs, as we noticed when QE Aviad was trying to clean up the test cluster and it was left with unclean PGs. This is a data loss scenario. 

Fixing issue [1] is going to take a bit longer to implement, and anyway could be a bit disruptive to update the deviceClass on the OSDs. For a fix that will be simple and have immediately impact to resolve this issue, OCS operator can set the deviceClass of the OSDs to "ssd". This is the default deviceClass in all clusters as far as I have seen. But to ensure an existing deviceClass is set, OCS operator should confirm that the value is in the CephCluster CR status. If replica 1 has already been enabled, the entries will likely be:
  status:
    storage:
      deviceClasses:
      - name: ssd
      - us-east-1a
      - us-east-1b
      - us-east-1c

If replica 1 was enabled in a clean cluster, or if the cluster was expanded, then the "replicated" device class would be set the OSDs and be in that list. In that case, we do need to keep "replicated" instead of "ssd".

Since the replica 1 feature is in GA already, let's consider backporting to 4.15 after we confirm the solution is solid. Customers using replica 1 may not have noticed it, but it's an important fix for data integrity.


[1] https://github.com/rook/rook/issues/14056
[2] EINVAL: device class replicated does not exist: exit status 22
[3] https://github.com/rook/rook/pull/14057

Comment 5 Malay Kumar parida 2024-04-17 05:04:19 UTC
@tnielsen as per our discussion I thought of the implementation, but one issue that comes to my mind is the deviceclass field for any osd is completely customizable so it's not necessary the value would always have been "ssd". So how do we account for that case where customer has multiple sets of osds with different deviceclasses how can we handle that?

Comment 6 Malay Kumar parida 2024-04-17 06:38:53 UTC
Also there was another discussion about the deviceclass we would use here https://bugzilla.redhat.com/show_bug.cgi?id=2254344#c19

Comment 7 Travis Nielsen 2024-04-17 18:39:08 UTC
(In reply to Malay Kumar parida from comment #5)
> @tnielsen as per our discussion I thought of the implementation,
> but one issue that comes to my mind is the deviceclass field for any osd is
> completely customizable so it's not necessary the value would always have
> been "ssd". So how do we account for that case where customer has multiple
> sets of osds with different deviceclasses how can we handle that?

In the case where the customer has defined the deviceClass, they must have also set the deviceClass on the pools, right? So in that case we shouldn't need to worry about the default deviceClass? It's already determined, correct?

If they haven't set the deviceClass on the OSDs or the pools, then we expect only a single deviceClass to be found in the CephCluster status (other than the replica 1 device classes). If there is more than one deviceClass, then let's just not set the deviceClass on the pool by default. They would get an error in that case about the overlapping roots. To fix it, they would need to specify the deviceClass for the pools, which is already possible to configure, right?

Comment 10 Travis Nielsen 2024-05-02 23:09:13 UTC
To summarize, I believe we need the following changes in OCS operator:

1. If there are *no* deviceClasses found in the status on the CephCluster CR [1]
    a. Keep the deviceClass blank on the pools [2]
    b. If the replica1 feature is enabled, set the deviceClass to "replicated" on the pools [2] (to maintain today's behavior of green-field clusters)
2. If there is *one* deviceClass found in the status [1]
    a. Set the deviceClass to that value on the pools [2]
3. If there are *four* deviceClasses found in the status [1], and *three* of them are zone names for the replica 1 feature:
    a. Set the deviceClass to the value on the pools that is not from the replica 1 pools [2]
4. If there is any other set of deviceClasses in the status [1], leave the deviceClass blank, as the OCS operator does not know how to set it
    a. The admin needs to set the deviceClass on the pools [2], to avoid the error of multiple roots in the crush map. OCS operator should respect this desired value and not overwrite it. (Perhaps this means they need to disable the reconcile of blockpools, filesystems, and objectstores. Setting custom deviceClasses is an advanced scenario and less common, so that workaround seems sufficient.)

Rook will also need to complete the two fixes mentioned in Comment 4.

[1] status:
      storage:
        deviceClasses:
          ssd | replicated | whatever

[2] "pools" refers to pools created as part of the built-in CRs created by OCS operator: CephBlockPool, CephFilesystem, and CephObjectStore CRs

Comment 15 Malay Kumar parida 2024-05-15 17:50:05 UTC
@Travis for comment 10, For the third scenario we can't assume we always have 3 pools/osds if customer have more nodes like 4 we can have 4 replica-1 pools/osds. So we have to modify condition 3.

Comment 16 Travis Nielsen 2024-05-15 18:28:35 UTC
(In reply to Malay Kumar parida from comment #15)
> @Travis for comment 10, For the third scenario we can't assume we always
> have 3 pools/osds if customer have more nodes like 4 we can have 4 replica-1
> pools/osds. So we have to modify condition 3.

Sounds good. Perhaps then this condition becomes:

3. If there are n+1 deviceClasses found in the status [1], where "n" is the number of names for replica 1 failure domains:
    a. Set the deviceClass to the value on the pools that is not from the replica 1 pools [2]

Comment 21 errata-xmlrpc 2024-07-17 13:18:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.