1816820 – portable: false ignored

Bug 1816820 - portable: false ignored

Summary: portable: false ignored

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.4.0
Assignee:	Rohan Gupta
QA Contact:	Neha Berry
Docs Contact:	Erin Donnelly
URL:
Whiteboard:
Depends On:
Blocks:	1826482
TreeView+	depends on / blocked

Reported:	2020-03-24 19:33 UTC by Kyle Bader
Modified:	2023-09-14 05:54 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When `portable: false` was used in a StorageCluster CR; for example, Local Storage infrastructure, it was ignored and the PVC ID was used as the name for the host CRUSH bucket in Ceph. With this fix, `portable: false` is honored in the CephCluster and CRUSH bucket uses hostnames instead of PVC IDs for name of the host.
Clone Of:
Environment:
Last Closed:	2020-06-04 12:54:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 480	None	closed	Removed hardcoded portable=true	2020-12-24 11:44:28 UTC
Github	openshift ocs-operator pull 483	None	closed	Bug 1816820: Removed hardcoded portable=true	2020-12-24 11:43:55 UTC
Red Hat Product Errata	RHBA-2020:2393	None	None	None	2020-06-04 12:54:53 UTC

Description Kyle Bader 2020-03-24 19:33:07 UTC

Description of problem:

If 'portable: false' is used in a StorageCluster CR, it is ignored and the PVC ID is used as the name for the host CRUSH bucket in Ceph.

Version-Release number of selected component (if applicable):

sh-4.4# ceph -v
ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)

mini:~ kyle$ oc get pod ocs-operator-66977dc7fc-gzx95 -o yaml | grep image
    image: quay.io/rhceph-dev/ocs-operator@sha256:40fd024ff48aa144df9b8147d893282069eeaa4016e16465a6b37e9d1296e4f5

Steps to Reproduce:
1. Deploy a cluster w/ LSO and 'portable: false"
2. Deploy Ceph toolbox
3. ceph osd tree

Actual results:

mini:~ kyle$ oc rsh rook-ceph-tools-5f7db56774-qslmn
sh-4.4# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME                                STATUS REWEIGHT PRI-AFF
 -1       13.64000 root default
 -5       13.64000     region us-west-2
-12        4.54700         zone us-west-2a
-11        2.27299             host ocs-deviceset-2-0-57szr
  2   ssd  2.27299                 osd.2                        up  1.00000 1.00000
-15        2.27299             host ocs-deviceset-2-1-n9vkw
  3   ssd  2.27299                 osd.3                        up  1.00000 1.00000
 -4        4.54700         zone us-west-2b
 -9        2.27299             host ocs-deviceset-0-0-wq5w4
  1   ssd  2.27299                 osd.1                        up  1.00000 1.00000
 -3        2.27299             host ocs-deviceset-0-1-wzpsh
  0   ssd  2.27299                 osd.0                        up  1.00000 1.00000
-18        4.54700         zone us-west-2c
-17        2.27299             host ocs-deviceset-1-0-c8gjz
  4   ssd  2.27299                 osd.4                        up  1.00000 1.00000
-21        2.27299             host ocs-deviceset-1-1-hv87f
  5   ssd  2.27299                 osd.5                        up  1.00000 1.00000
sh-4.4# exit


Expected results:

The instance / hostname should be used as the name for the host CRUSH bucket.

Additional info:

mini:~ kyle$ cat storagecluster-ec2.yaml
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  manageNodes: false
  monPVCTemplate:
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: gp2
      volumeMode: Filesystem
  storageDeviceSets:
  - count: 2
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1
        storageClassName: localblock
        volumeMode: Block
    name: ocs-deviceset
    placement: {}
    portable: false
    replica: 3
    resources: {}

Comment 2 Yaniv Kaul 2020-03-25 08:50:49 UTC

Isn't this an upstream issue? I'm not sure this portable flag is supported downstream?

Comment 3 Travis Nielsen 2020-03-25 14:35:52 UTC

@Kyle, can you confirm if you were using OCS 4.2? This should have been fixed in 4.3 by this upstream PR: https://github.com/rook/rook/pull/4658

Comment 4 Kyle Bader 2020-03-25 14:52:24 UTC

@Travis, I can confirm I'm using 4.3

ocs-operator.v4.3.0-377.ci (rhdev-ceph)

mini:~ kyle$ oc get pods rook-ceph-operator-577cb7dfd9-5lgxc -o yaml | grep image
    image: quay.io/rhceph-dev/rook-ceph@sha256:f42ce65085719f31e23c3459d35ccff442c0eceb217fc724796c7dcb1ba829f4

@Yaniv, we set 'portable: true' with dynamically provisioned PVCs (eg. aws-ebs/vsphere-volume) in OCS, 'portable: false' should work.

Independently, we could 'notfix', but combined with this issue [1] we have a big problem. We don't need to fix both for safety, but at least one needs to be fixed for the 4.3 release.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1814681

Comment 5 Jose A. Rivera 2020-03-25 15:15:23 UTC

What version and container image of Rook-Ceph is running?

Comment 6 Kyle Bader 2020-03-25 15:32:04 UTC

It's hard for me to discern the tag because it's a private registry, but I provided the sha256 for it in the previous comment.

Comment 7 Travis Nielsen 2020-03-25 17:22:59 UTC

From chat with Kyle, the rook version is 4.3:

mini:~ kyle$ oc logs rook-ceph-operator-577cb7dfd9-5lgxc | head
2020-03-24 18:07:00.981562 I | rookcmd: starting Rook 4.3-32.07d83470.ocs_4.3 with arguments '/usr/local/bin/rook ceph operator'

I was not able to repro this on upstream rook v1.2, which is the base for 4.3. I see the host name in the CRUSH map as expected when portable: false

sh-4.2$ ceph osd tree
ID CLASS WEIGHT  TYPE NAME                        STATUS REWEIGHT PRI-AFF 
-1       0.02939 root default                                             
-5       0.02939     region us-east-1                                     
-4       0.02939         zone us-east-1a                                  
-3       0.02939             host ip-10-0-136-227                         
 0   ssd 0.00980                 osd.0                up  1.00000 1.00000 

In OCS I see that portable is forced to true under certain conditions if there is no placement specified in the storagecluster CR and if there are more topology keys:

https://github.com/openshift/ocs-operator/blob/release-4.3/pkg/controller/storagecluster/reconcile.go#L902

From Kyle in chat, he is also seeing that portable: false is not being passed on to rook:

mini:~ kyle$ oc get storagecluster -o yaml | grep port
mini:~ kyle$ oc get cephcluster -o yaml | grep port
        portable: true
        portable: true
        portable: true

@Jose can you take a look?

Comment 8 Jose A. Rivera 2020-03-25 21:19:43 UTC

Travis found the spot, yeah. Currently we can't change this behavior as the OCS console UI depends on it, so the UI needs to change first. As a workaround, setting an explicit Placement should suffice.

Comment 9 Michael Adam 2020-03-25 22:01:45 UTC

I was telling since early January or even December that we would need to rework/adapt how we're doing portable vs non-portable OSDs... :-/

Currently ocs operator does racks if it doesn't find enough zones and will set the failure domain to rack forces portable = true.
If it finds enough zone labels it will do failure domain=zone and forces portable = true.

If we look at the storagecluster crd, it explains that portable is ignored if "placement" is not set.

https://github.com/openshift/ocs-operator/blob/release-4.3/deploy/olm-catalog/ocs-operator/4.3.0/storagecluster.crd.yaml#L96

So if we rely on automatic placement, portable will be ignored. If we do manual placement, portable will be honored.
The UI doesn't do placement afaik, and so we are forcing portable = true.

Maybe we need to revert to doing manual setup (non-UI) for direct attached disks.

Anyway, I think we can live with it for the 4.3 Tech Preview. But we should fix it for 4.4.

Comment 10 Rohan CJ 2020-03-26 07:29:18 UTC

UI seems to hardcode `portable:true`, so it wouldn't interfere with UI to honor `portable:false` in the CR.

Comment 11 Michael Adam 2020-03-26 07:40:26 UTC

(In reply to Rohan CJ from comment #10)
> UI seems to hardcode `portable:true`,

@Rohan, Can you please share the code reference?

Comment 12 Michael Adam 2020-03-26 07:53:38 UTC

re-establishing severity

Comment 13 Raz Tamir 2020-03-26 07:58:20 UTC

fixing severity to high.
Set it to medium by mistake

Comment 14 Rohan CJ 2020-03-30 08:28:53 UTC

> @Rohan, Can you please share the code reference?

I just created a StorageCluster via UI and checked.

Comment 15 Rohan CJ 2020-03-30 08:32:22 UTC

Asked Umanga for the code ref: https://github.com/openshift/console/blob/master/frontend/packages/ceph-storage-plugin/src/constants/ocs-install.ts#L16

Comment 16 Jose A. Rivera 2020-04-01 13:51:33 UTC

There is currently a workaround for this, and it does not impact supported functionality in OCS 4.3 or 4.4, so pushing back to OCS 4.5. But we do still need to fix it, so giving devel_ack+.

Kyle, can you confirm the workaround is valid?

Comment 17 Raz Tamir 2020-04-12 13:22:43 UTC

This was a candidate blocker for 4.3, therefore, moving to .4.4 as a blocker candidate

Comment 20 Jose A. Rivera 2020-04-14 14:17:24 UTC

I see nowhere in the BZ history where this was marked as a blocker candidate, and this is not a functionality blocker for the product. To even expose this bug in the product would require a UI change which will not happen for the OCS 4.4 timeframe. As such, we will not be fixing this in OCS 4.4 and I am removing devel_ack+ until we push this back to OCS 4.5.

Comment 21 Raz Tamir 2020-04-16 08:42:14 UTC

True, we should have add a summary in the bug itself.
This bug was discussed as one of the candidate blockers for 4.3 as the OSD distribution was severely impacted by this.

@Michael, @Anat,

As the discussion happen with you (from Eng side), please follow up on this

Comment 22 Travis Nielsen 2020-04-16 16:43:49 UTC

This doesn't affect any of supported OCS platforms that I'm aware and there is a workaround for baremetal. Not sure either why this would be a blocker.

Comment 24 Jose A. Rivera 2020-04-21 14:01:28 UTC

Seems there was a misunderstanding: The UI no longer requires the behavior that led us to force "portable: true". as such we can remove it and honor the "portable" setting in all configurations.

I still don't like doing this so late, but there's pressure from on high to do it so reinstating devel_ack.

Comment 25 Jose A. Rivera 2020-04-21 22:28:24 UTC

Initial PR is submitted to master: https://github.com/openshift/ocs-operator/pull/480

We'll backport to release-4.4 once it merges.

Comment 26 Michael Adam 2020-04-23 09:34:45 UTC

https://github.com/openshift/ocs-operator/pull/483
backport PR

Comment 29 Jose A. Rivera 2020-04-23 20:10:17 UTC

Backport PR has merged! Ready for downstream build.

Comment 30 Michael Adam 2020-04-24 16:24:09 UTC

4.4.0-414.ci aka 4.4.0-rc2 is now available with this fix

Comment 35 Michael Adam 2020-05-04 20:54:52 UTC

not requiring doc text since this shows up mostly when using LSO based deployments, and this is the first time we are supporting it

Comment 38 errata-xmlrpc 2020-06-04 12:54:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2393

Comment 39 Red Hat Bugzilla 2023-09-14 05:54:40 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.