Description of problem: If 'portable: false' is used in a StorageCluster CR, it is ignored and the PVC ID is used as the name for the host CRUSH bucket in Ceph. Version-Release number of selected component (if applicable): sh-4.4# ceph -v ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable) mini:~ kyle$ oc get pod ocs-operator-66977dc7fc-gzx95 -o yaml | grep image image: quay.io/rhceph-dev/ocs-operator@sha256:40fd024ff48aa144df9b8147d893282069eeaa4016e16465a6b37e9d1296e4f5 Steps to Reproduce: 1. Deploy a cluster w/ LSO and 'portable: false" 2. Deploy Ceph toolbox 3. ceph osd tree Actual results: mini:~ kyle$ oc rsh rook-ceph-tools-5f7db56774-qslmn sh-4.4# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 13.64000 root default -5 13.64000 region us-west-2 -12 4.54700 zone us-west-2a -11 2.27299 host ocs-deviceset-2-0-57szr 2 ssd 2.27299 osd.2 up 1.00000 1.00000 -15 2.27299 host ocs-deviceset-2-1-n9vkw 3 ssd 2.27299 osd.3 up 1.00000 1.00000 -4 4.54700 zone us-west-2b -9 2.27299 host ocs-deviceset-0-0-wq5w4 1 ssd 2.27299 osd.1 up 1.00000 1.00000 -3 2.27299 host ocs-deviceset-0-1-wzpsh 0 ssd 2.27299 osd.0 up 1.00000 1.00000 -18 4.54700 zone us-west-2c -17 2.27299 host ocs-deviceset-1-0-c8gjz 4 ssd 2.27299 osd.4 up 1.00000 1.00000 -21 2.27299 host ocs-deviceset-1-1-hv87f 5 ssd 2.27299 osd.5 up 1.00000 1.00000 sh-4.4# exit Expected results: The instance / hostname should be used as the name for the host CRUSH bucket. Additional info: mini:~ kyle$ cat storagecluster-ec2.yaml apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: name: ocs-storagecluster namespace: openshift-storage spec: manageNodes: false monPVCTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: gp2 volumeMode: Filesystem storageDeviceSets: - count: 2 dataPVCTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 1 storageClassName: localblock volumeMode: Block name: ocs-deviceset placement: {} portable: false replica: 3 resources: {}
Isn't this an upstream issue? I'm not sure this portable flag is supported downstream?
@Kyle, can you confirm if you were using OCS 4.2? This should have been fixed in 4.3 by this upstream PR: https://github.com/rook/rook/pull/4658
@Travis, I can confirm I'm using 4.3 ocs-operator.v4.3.0-377.ci (rhdev-ceph) mini:~ kyle$ oc get pods rook-ceph-operator-577cb7dfd9-5lgxc -o yaml | grep image image: quay.io/rhceph-dev/rook-ceph@sha256:f42ce65085719f31e23c3459d35ccff442c0eceb217fc724796c7dcb1ba829f4 @Yaniv, we set 'portable: true' with dynamically provisioned PVCs (eg. aws-ebs/vsphere-volume) in OCS, 'portable: false' should work. Independently, we could 'notfix', but combined with this issue [1] we have a big problem. We don't need to fix both for safety, but at least one needs to be fixed for the 4.3 release. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1814681
What version and container image of Rook-Ceph is running?
It's hard for me to discern the tag because it's a private registry, but I provided the sha256 for it in the previous comment.
From chat with Kyle, the rook version is 4.3: mini:~ kyle$ oc logs rook-ceph-operator-577cb7dfd9-5lgxc | head 2020-03-24 18:07:00.981562 I | rookcmd: starting Rook 4.3-32.07d83470.ocs_4.3 with arguments '/usr/local/bin/rook ceph operator' I was not able to repro this on upstream rook v1.2, which is the base for 4.3. I see the host name in the CRUSH map as expected when portable: false sh-4.2$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.02939 root default -5 0.02939 region us-east-1 -4 0.02939 zone us-east-1a -3 0.02939 host ip-10-0-136-227 0 ssd 0.00980 osd.0 up 1.00000 1.00000 In OCS I see that portable is forced to true under certain conditions if there is no placement specified in the storagecluster CR and if there are more topology keys: https://github.com/openshift/ocs-operator/blob/release-4.3/pkg/controller/storagecluster/reconcile.go#L902 From Kyle in chat, he is also seeing that portable: false is not being passed on to rook: mini:~ kyle$ oc get storagecluster -o yaml | grep port mini:~ kyle$ oc get cephcluster -o yaml | grep port portable: true portable: true portable: true @Jose can you take a look?
Travis found the spot, yeah. Currently we can't change this behavior as the OCS console UI depends on it, so the UI needs to change first. As a workaround, setting an explicit Placement should suffice.
I was telling since early January or even December that we would need to rework/adapt how we're doing portable vs non-portable OSDs... :-/ Currently ocs operator does racks if it doesn't find enough zones and will set the failure domain to rack forces portable = true. If it finds enough zone labels it will do failure domain=zone and forces portable = true. If we look at the storagecluster crd, it explains that portable is ignored if "placement" is not set. https://github.com/openshift/ocs-operator/blob/release-4.3/deploy/olm-catalog/ocs-operator/4.3.0/storagecluster.crd.yaml#L96 So if we rely on automatic placement, portable will be ignored. If we do manual placement, portable will be honored. The UI doesn't do placement afaik, and so we are forcing portable = true. Maybe we need to revert to doing manual setup (non-UI) for direct attached disks. Anyway, I think we can live with it for the 4.3 Tech Preview. But we should fix it for 4.4.
UI seems to hardcode `portable:true`, so it wouldn't interfere with UI to honor `portable:false` in the CR.
(In reply to Rohan CJ from comment #10) > UI seems to hardcode `portable:true`, @Rohan, Can you please share the code reference?
re-establishing severity
fixing severity to high. Set it to medium by mistake
> @Rohan, Can you please share the code reference? I just created a StorageCluster via UI and checked.
Asked Umanga for the code ref: https://github.com/openshift/console/blob/master/frontend/packages/ceph-storage-plugin/src/constants/ocs-install.ts#L16
There is currently a workaround for this, and it does not impact supported functionality in OCS 4.3 or 4.4, so pushing back to OCS 4.5. But we do still need to fix it, so giving devel_ack+. Kyle, can you confirm the workaround is valid?
This was a candidate blocker for 4.3, therefore, moving to .4.4 as a blocker candidate
I see nowhere in the BZ history where this was marked as a blocker candidate, and this is not a functionality blocker for the product. To even expose this bug in the product would require a UI change which will not happen for the OCS 4.4 timeframe. As such, we will not be fixing this in OCS 4.4 and I am removing devel_ack+ until we push this back to OCS 4.5.
True, we should have add a summary in the bug itself. This bug was discussed as one of the candidate blockers for 4.3 as the OSD distribution was severely impacted by this. @Michael, @Anat, As the discussion happen with you (from Eng side), please follow up on this
This doesn't affect any of supported OCS platforms that I'm aware and there is a workaround for baremetal. Not sure either why this would be a blocker.
Seems there was a misunderstanding: The UI no longer requires the behavior that led us to force "portable: true". as such we can remove it and honor the "portable" setting in all configurations. I still don't like doing this so late, but there's pressure from on high to do it so reinstating devel_ack.
Initial PR is submitted to master: https://github.com/openshift/ocs-operator/pull/480 We'll backport to release-4.4 once it merges.
https://github.com/openshift/ocs-operator/pull/483 backport PR
Backport PR has merged! Ready for downstream build.
4.4.0-414.ci aka 4.4.0-rc2 is now available with this fix
not requiring doc text since this shows up mostly when using LSO based deployments, and this is the first time we are supporting it
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2393
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days