Description of problem:
If 'portable: false' is used in a StorageCluster CR, it is ignored and the PVC ID is used as the name for the host CRUSH bucket in Ceph.
Version-Release number of selected component (if applicable):
sh-4.4# ceph -v
ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)
mini:~ kyle$ oc get pod ocs-operator-66977dc7fc-gzx95 -o yaml | grep image
Steps to Reproduce:
1. Deploy a cluster w/ LSO and 'portable: false"
2. Deploy Ceph toolbox
3. ceph osd tree
mini:~ kyle$ oc rsh rook-ceph-tools-5f7db56774-qslmn
sh-4.4# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 13.64000 root default
-5 13.64000 region us-west-2
-12 4.54700 zone us-west-2a
-11 2.27299 host ocs-deviceset-2-0-57szr
2 ssd 2.27299 osd.2 up 1.00000 1.00000
-15 2.27299 host ocs-deviceset-2-1-n9vkw
3 ssd 2.27299 osd.3 up 1.00000 1.00000
-4 4.54700 zone us-west-2b
-9 2.27299 host ocs-deviceset-0-0-wq5w4
1 ssd 2.27299 osd.1 up 1.00000 1.00000
-3 2.27299 host ocs-deviceset-0-1-wzpsh
0 ssd 2.27299 osd.0 up 1.00000 1.00000
-18 4.54700 zone us-west-2c
-17 2.27299 host ocs-deviceset-1-0-c8gjz
4 ssd 2.27299 osd.4 up 1.00000 1.00000
-21 2.27299 host ocs-deviceset-1-1-hv87f
5 ssd 2.27299 osd.5 up 1.00000 1.00000
The instance / hostname should be used as the name for the host CRUSH bucket.
mini:~ kyle$ cat storagecluster-ec2.yaml
- count: 2
Isn't this an upstream issue? I'm not sure this portable flag is supported downstream?
@Kyle, can you confirm if you were using OCS 4.2? This should have been fixed in 4.3 by this upstream PR: https://github.com/rook/rook/pull/4658
@Travis, I can confirm I'm using 4.3
mini:~ kyle$ oc get pods rook-ceph-operator-577cb7dfd9-5lgxc -o yaml | grep image
@Yaniv, we set 'portable: true' with dynamically provisioned PVCs (eg. aws-ebs/vsphere-volume) in OCS, 'portable: false' should work.
Independently, we could 'notfix', but combined with this issue  we have a big problem. We don't need to fix both for safety, but at least one needs to be fixed for the 4.3 release.
What version and container image of Rook-Ceph is running?
It's hard for me to discern the tag because it's a private registry, but I provided the sha256 for it in the previous comment.
From chat with Kyle, the rook version is 4.3:
mini:~ kyle$ oc logs rook-ceph-operator-577cb7dfd9-5lgxc | head
2020-03-24 18:07:00.981562 I | rookcmd: starting Rook 4.3-32.07d83470.ocs_4.3 with arguments '/usr/local/bin/rook ceph operator'
I was not able to repro this on upstream rook v1.2, which is the base for 4.3. I see the host name in the CRUSH map as expected when portable: false
sh-4.2$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02939 root default
-5 0.02939 region us-east-1
-4 0.02939 zone us-east-1a
-3 0.02939 host ip-10-0-136-227
0 ssd 0.00980 osd.0 up 1.00000 1.00000
In OCS I see that portable is forced to true under certain conditions if there is no placement specified in the storagecluster CR and if there are more topology keys:
From Kyle in chat, he is also seeing that portable: false is not being passed on to rook:
mini:~ kyle$ oc get storagecluster -o yaml | grep port
mini:~ kyle$ oc get cephcluster -o yaml | grep port
@Jose can you take a look?
Travis found the spot, yeah. Currently we can't change this behavior as the OCS console UI depends on it, so the UI needs to change first. As a workaround, setting an explicit Placement should suffice.
I was telling since early January or even December that we would need to rework/adapt how we're doing portable vs non-portable OSDs... :-/
Currently ocs operator does racks if it doesn't find enough zones and will set the failure domain to rack forces portable = true.
If it finds enough zone labels it will do failure domain=zone and forces portable = true.
If we look at the storagecluster crd, it explains that portable is ignored if "placement" is not set.
So if we rely on automatic placement, portable will be ignored. If we do manual placement, portable will be honored.
The UI doesn't do placement afaik, and so we are forcing portable = true.
Maybe we need to revert to doing manual setup (non-UI) for direct attached disks.
Anyway, I think we can live with it for the 4.3 Tech Preview. But we should fix it for 4.4.
UI seems to hardcode `portable:true`, so it wouldn't interfere with UI to honor `portable:false` in the CR.
(In reply to Rohan CJ from comment #10)
> UI seems to hardcode `portable:true`,
@Rohan, Can you please share the code reference?
fixing severity to high.
Set it to medium by mistake
> @Rohan, Can you please share the code reference?
I just created a StorageCluster via UI and checked.
Asked Umanga for the code ref: https://github.com/openshift/console/blob/master/frontend/packages/ceph-storage-plugin/src/constants/ocs-install.ts#L16
There is currently a workaround for this, and it does not impact supported functionality in OCS 4.3 or 4.4, so pushing back to OCS 4.5. But we do still need to fix it, so giving devel_ack+.
Kyle, can you confirm the workaround is valid?
This was a candidate blocker for 4.3, therefore, moving to .4.4 as a blocker candidate
I see nowhere in the BZ history where this was marked as a blocker candidate, and this is not a functionality blocker for the product. To even expose this bug in the product would require a UI change which will not happen for the OCS 4.4 timeframe. As such, we will not be fixing this in OCS 4.4 and I am removing devel_ack+ until we push this back to OCS 4.5.
True, we should have add a summary in the bug itself.
This bug was discussed as one of the candidate blockers for 4.3 as the OSD distribution was severely impacted by this.
As the discussion happen with you (from Eng side), please follow up on this
This doesn't affect any of supported OCS platforms that I'm aware and there is a workaround for baremetal. Not sure either why this would be a blocker.
Seems there was a misunderstanding: The UI no longer requires the behavior that led us to force "portable: true". as such we can remove it and honor the "portable" setting in all configurations.
I still don't like doing this so late, but there's pressure from on high to do it so reinstating devel_ack.
Initial PR is submitted to master: https://github.com/openshift/ocs-operator/pull/480
We'll backport to release-4.4 once it merges.
Backport PR has merged! Ready for downstream build.
4.4.0-414.ci aka 4.4.0-rc2 is now available with this fix
not requiring doc text since this shows up mostly when using LSO based deployments, and this is the first time we are supporting it
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.