Description of problem (please be detailed as possible and provide log snippests): while the replica-1 osd has correct device class, the old OSDs got "ssd" class instead of "replicated" sh-5.1$ ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 18.00000 - 18 TiB 430 MiB 274 MiB 0 B 155 MiB 18 TiB 0.00 1.00 - root default -5 18.00000 - 18 TiB 430 MiB 274 MiB 0 B 155 MiB 18 TiB 0.00 1.00 - region us-east-1 -14 6.00000 - 6 TiB 149 MiB 91 MiB 0 B 58 MiB 6.0 TiB 0.00 1.04 - zone us-east-1a -13 2.00000 - 2 TiB 48 MiB 15 MiB 0 B 32 MiB 2.0 TiB 0.00 1.00 - host ocs-deviceset-gp3-csi-2-data-05ggvh 2 ssd 2.00000 1.00000 2 TiB 48 MiB 15 MiB 0 B 32 MiB 2.0 TiB 0.00 1.00 33 up osd.2 -41 2.00000 - 2 TiB 53 MiB 37 MiB 0 B 16 MiB 2.0 TiB 0.00 1.11 - host us-east-1a-data-0knkjg 3 us-east-1a 2.00000 1.00000 2 TiB 53 MiB 37 MiB 0 B 16 MiB 2.0 TiB 0.00 1.11 44 up osd.3 -61 2.00000 - 2 TiB 48 MiB 39 MiB 0 B 9.5 MiB 2.0 TiB 0.00 1.01 - host us-east-1a-data-1v5fwd 7 us-east-1a 2.00000 1.00000 2 TiB 48 MiB 39 MiB 0 B 9.5 MiB 2.0 TiB 0.00 1.01 36 up osd.7 -4 6.00000 - 6 TiB 132 MiB 91 MiB 0 B 41 MiB 6.0 TiB 0.00 0.92 - zone us-east-1b -3 2.00000 - 2 TiB 65 MiB 43 MiB 0 B 22 MiB 2.0 TiB 0.00 1.36 - host ocs-deviceset-gp3-csi-1-data-0d98jz 0 ssd 2.00000 1.00000 2 TiB 65 MiB 43 MiB 0 B 22 MiB 2.0 TiB 0.00 1.36 35 up osd.0 -51 2.00000 - 2 TiB 35 MiB 24 MiB 0 B 10 MiB 2.0 TiB 0.00 0.73 - host us-east-1b-data-0jp5x5 5 us-east-1b 2.00000 1.00000 2 TiB 35 MiB 24 MiB 0 B 10 MiB 2.0 TiB 0.00 0.73 40 up osd.5 -56 2.00000 - 2 TiB 32 MiB 24 MiB 0 B 8.4 MiB 2.0 TiB 0.00 0.67 - host us-east-1b-data-1hfftf 6 us-east-1b 2.00000 1.00000 2 TiB 32 MiB 24 MiB 0 B 8.4 MiB 2.0 TiB 0.00 0.67 38 up osd.6 -10 6.00000 - 6 TiB 149 MiB 92 MiB 0 B 57 MiB 6.0 TiB 0.00 1.04 - zone us-east-1c -9 2.00000 - 2 TiB 73 MiB 40 MiB 0 B 32 MiB 2.0 TiB 0.00 1.52 - host ocs-deviceset-gp3-csi-0-data-0mgxqb 1 ssd 2.00000 1.00000 2 TiB 73 MiB 40 MiB 0 B 32 MiB 2.0 TiB 0.00 1.52 38 up osd.1 -46 2.00000 - 2 TiB 33 MiB 18 MiB 0 B 16 MiB 2.0 TiB 0.00 0.70 - host us-east-1c-data-0lxvf6 4 us-east-1c 2.00000 1.00000 2 TiB 33 MiB 18 MiB 0 B 16 MiB 2.0 TiB 0.00 0.70 40 up osd.4 -66 2.00000 - 2 TiB 43 MiB 34 MiB 0 B 8.8 MiB 2.0 TiB 0.00 0.90 - host us-east-1c-data-1rm7lr 8 us-east-1c 2.00000 1.00000 2 TiB 43 MiB 34 MiB 0 B 8.8 MiB 2.0 TiB 0.00 0.90 38 up osd.8 TOTAL 18 TiB 430 MiB 274 MiB 0 B 155 MiB 18 TiB 0.00 MIN/MAX VAR: 0.67/1.52 STDDEV: 0 Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? the impact found when trying to delete replica-1 and revert cluster into back replica-3 Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Steps to Reproduce: 1. enable replica-1 2. 3. Actual results: Expected results: Additional info:
When the replica 1 feature is enabled, there are a couple of issues going wrong: 1. The crushDeviceClass property on the storageClassDeviceSets is being updated to "replicated". However, the OSDs are **not** getting updated from their default deviceClass of ssd as seen in the crush tree above. This is by design from ceph that the deviceClass to prevent accidental updates to the device class. The upstream Rook issue [1] is opened to consider supporting updated deviceClasses on the OSDs. 2. The default pools are updating their device classes to "replicated". However, their crush rules are **not** getting updated because the "replicated" deviceClass is not available on the OSDs, and it is silently failing with error [2], which was not even being logged by the rook operator. This will be fixed by [3] to fail the reconcile if the deviceClass is not available on any OSDs. The user impact is that the main replicated pools are using all OSDs instead of only the expected OSDs that are independent from the replica 1 OSDs. During normal operation the user won't notice any issues other than the replicated pools consuming some space on the replica 1 OSDs. This issue would be very impactful if we supported disabling the replica 1 feature and the replica 1 OSDs are purged. In this case, the replicated pools will potentially have lost PGs, as we noticed when QE Aviad was trying to clean up the test cluster and it was left with unclean PGs. This is a data loss scenario. Fixing issue [1] is going to take a bit longer to implement, and anyway could be a bit disruptive to update the deviceClass on the OSDs. For a fix that will be simple and have immediately impact to resolve this issue, OCS operator can set the deviceClass of the OSDs to "ssd". This is the default deviceClass in all clusters as far as I have seen. But to ensure an existing deviceClass is set, OCS operator should confirm that the value is in the CephCluster CR status. If replica 1 has already been enabled, the entries will likely be: status: storage: deviceClasses: - name: ssd - us-east-1a - us-east-1b - us-east-1c If replica 1 was enabled in a clean cluster, or if the cluster was expanded, then the "replicated" device class would be set the OSDs and be in that list. In that case, we do need to keep "replicated" instead of "ssd". Since the replica 1 feature is in GA already, let's consider backporting to 4.15 after we confirm the solution is solid. Customers using replica 1 may not have noticed it, but it's an important fix for data integrity. [1] https://github.com/rook/rook/issues/14056 [2] EINVAL: device class replicated does not exist: exit status 22 [3] https://github.com/rook/rook/pull/14057
@tnielsen as per our discussion I thought of the implementation, but one issue that comes to my mind is the deviceclass field for any osd is completely customizable so it's not necessary the value would always have been "ssd". So how do we account for that case where customer has multiple sets of osds with different deviceclasses how can we handle that?
Also there was another discussion about the deviceclass we would use here https://bugzilla.redhat.com/show_bug.cgi?id=2254344#c19
(In reply to Malay Kumar parida from comment #5) > @tnielsen as per our discussion I thought of the implementation, > but one issue that comes to my mind is the deviceclass field for any osd is > completely customizable so it's not necessary the value would always have > been "ssd". So how do we account for that case where customer has multiple > sets of osds with different deviceclasses how can we handle that? In the case where the customer has defined the deviceClass, they must have also set the deviceClass on the pools, right? So in that case we shouldn't need to worry about the default deviceClass? It's already determined, correct? If they haven't set the deviceClass on the OSDs or the pools, then we expect only a single deviceClass to be found in the CephCluster status (other than the replica 1 device classes). If there is more than one deviceClass, then let's just not set the deviceClass on the pool by default. They would get an error in that case about the overlapping roots. To fix it, they would need to specify the deviceClass for the pools, which is already possible to configure, right?
To summarize, I believe we need the following changes in OCS operator: 1. If there are *no* deviceClasses found in the status on the CephCluster CR [1] a. Keep the deviceClass blank on the pools [2] b. If the replica1 feature is enabled, set the deviceClass to "replicated" on the pools [2] (to maintain today's behavior of green-field clusters) 2. If there is *one* deviceClass found in the status [1] a. Set the deviceClass to that value on the pools [2] 3. If there are *four* deviceClasses found in the status [1], and *three* of them are zone names for the replica 1 feature: a. Set the deviceClass to the value on the pools that is not from the replica 1 pools [2] 4. If there is any other set of deviceClasses in the status [1], leave the deviceClass blank, as the OCS operator does not know how to set it a. The admin needs to set the deviceClass on the pools [2], to avoid the error of multiple roots in the crush map. OCS operator should respect this desired value and not overwrite it. (Perhaps this means they need to disable the reconcile of blockpools, filesystems, and objectstores. Setting custom deviceClasses is an advanced scenario and less common, so that workaround seems sufficient.) Rook will also need to complete the two fixes mentioned in Comment 4. [1] status: storage: deviceClasses: ssd | replicated | whatever [2] "pools" refers to pools created as part of the built-in CRs created by OCS operator: CephBlockPool, CephFilesystem, and CephObjectStore CRs
@Travis for comment 10, For the third scenario we can't assume we always have 3 pools/osds if customer have more nodes like 4 we can have 4 replica-1 pools/osds. So we have to modify condition 3.
(In reply to Malay Kumar parida from comment #15) > @Travis for comment 10, For the third scenario we can't assume we always > have 3 pools/osds if customer have more nodes like 4 we can have 4 replica-1 > pools/osds. So we have to modify condition 3. Sounds good. Perhaps then this condition becomes: 3. If there are n+1 deviceClasses found in the status [1], where "n" is the number of names for replica 1 failure domains: a. Set the deviceClass to the value on the pools that is not from the replica 1 pools [2]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591