Bug 1816820 - portable: false ignored [NEEDINFO]
Summary: portable: false ignored
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat
Component: ocs-operator
Version: 4.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.4.0
Assignee: Rohan Gupta
QA Contact: Neha Berry
Erin Donnelly
URL:
Whiteboard:
Depends On:
Blocks: 1826482
TreeView+ depends on / blocked
 
Reported: 2020-03-24 19:33 UTC by Kyle Bader
Modified: 2020-06-17 18:10 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When `portable: false` was used in a StorageCluster CR; for example, Local Storage infrastructure, it was ignored and the PVC ID was used as the name for the host CRUSH bucket in Ceph. With this fix, `portable: false` is honored in the CephCluster and CRUSH bucket uses hostnames instead of PVC IDs for name of the host.
Clone Of:
Environment:
Last Closed: 2020-06-04 12:54:39 UTC
Target Upstream Version:
jrivera: needinfo? (kbader)
nberry: needinfo? (madam)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 480 0 None closed Removed hardcoded portable=true 2020-12-24 11:44:28 UTC
Github openshift ocs-operator pull 483 0 None closed Bug 1816820: Removed hardcoded portable=true 2020-12-24 11:43:55 UTC
Red Hat Product Errata RHBA-2020:2393 0 None None None 2020-06-04 12:54:53 UTC

Description Kyle Bader 2020-03-24 19:33:07 UTC
Description of problem:

If 'portable: false' is used in a StorageCluster CR, it is ignored and the PVC ID is used as the name for the host CRUSH bucket in Ceph.

Version-Release number of selected component (if applicable):

sh-4.4# ceph -v
ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)

mini:~ kyle$ oc get pod ocs-operator-66977dc7fc-gzx95 -o yaml | grep image
    image: quay.io/rhceph-dev/ocs-operator@sha256:40fd024ff48aa144df9b8147d893282069eeaa4016e16465a6b37e9d1296e4f5

Steps to Reproduce:
1. Deploy a cluster w/ LSO and 'portable: false"
2. Deploy Ceph toolbox
3. ceph osd tree

Actual results:

mini:~ kyle$ oc rsh rook-ceph-tools-5f7db56774-qslmn
sh-4.4# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME                                STATUS REWEIGHT PRI-AFF
 -1       13.64000 root default
 -5       13.64000     region us-west-2
-12        4.54700         zone us-west-2a
-11        2.27299             host ocs-deviceset-2-0-57szr
  2   ssd  2.27299                 osd.2                        up  1.00000 1.00000
-15        2.27299             host ocs-deviceset-2-1-n9vkw
  3   ssd  2.27299                 osd.3                        up  1.00000 1.00000
 -4        4.54700         zone us-west-2b
 -9        2.27299             host ocs-deviceset-0-0-wq5w4
  1   ssd  2.27299                 osd.1                        up  1.00000 1.00000
 -3        2.27299             host ocs-deviceset-0-1-wzpsh
  0   ssd  2.27299                 osd.0                        up  1.00000 1.00000
-18        4.54700         zone us-west-2c
-17        2.27299             host ocs-deviceset-1-0-c8gjz
  4   ssd  2.27299                 osd.4                        up  1.00000 1.00000
-21        2.27299             host ocs-deviceset-1-1-hv87f
  5   ssd  2.27299                 osd.5                        up  1.00000 1.00000
sh-4.4# exit


Expected results:

The instance / hostname should be used as the name for the host CRUSH bucket.

Additional info:

mini:~ kyle$ cat storagecluster-ec2.yaml
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  manageNodes: false
  monPVCTemplate:
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: gp2
      volumeMode: Filesystem
  storageDeviceSets:
  - count: 2
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1
        storageClassName: localblock
        volumeMode: Block
    name: ocs-deviceset
    placement: {}
    portable: false
    replica: 3
    resources: {}

Comment 2 Yaniv Kaul 2020-03-25 08:50:49 UTC
Isn't this an upstream issue? I'm not sure this portable flag is supported downstream?

Comment 3 Travis Nielsen 2020-03-25 14:35:52 UTC
@Kyle, can you confirm if you were using OCS 4.2? This should have been fixed in 4.3 by this upstream PR: https://github.com/rook/rook/pull/4658

Comment 4 Kyle Bader 2020-03-25 14:52:24 UTC
@Travis, I can confirm I'm using 4.3

ocs-operator.v4.3.0-377.ci (rhdev-ceph)

mini:~ kyle$ oc get pods rook-ceph-operator-577cb7dfd9-5lgxc -o yaml | grep image
    image: quay.io/rhceph-dev/rook-ceph@sha256:f42ce65085719f31e23c3459d35ccff442c0eceb217fc724796c7dcb1ba829f4

@Yaniv, we set 'portable: true' with dynamically provisioned PVCs (eg. aws-ebs/vsphere-volume) in OCS, 'portable: false' should work.

Independently, we could 'notfix', but combined with this issue [1] we have a big problem. We don't need to fix both for safety, but at least one needs to be fixed for the 4.3 release.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1814681

Comment 5 Jose A. Rivera 2020-03-25 15:15:23 UTC
What version and container image of Rook-Ceph is running?

Comment 6 Kyle Bader 2020-03-25 15:32:04 UTC
It's hard for me to discern the tag because it's a private registry, but I provided the sha256 for it in the previous comment.

Comment 7 Travis Nielsen 2020-03-25 17:22:59 UTC
From chat with Kyle, the rook version is 4.3:

mini:~ kyle$ oc logs rook-ceph-operator-577cb7dfd9-5lgxc | head
2020-03-24 18:07:00.981562 I | rookcmd: starting Rook 4.3-32.07d83470.ocs_4.3 with arguments '/usr/local/bin/rook ceph operator'

I was not able to repro this on upstream rook v1.2, which is the base for 4.3. I see the host name in the CRUSH map as expected when portable: false

sh-4.2$ ceph osd tree
ID CLASS WEIGHT  TYPE NAME                        STATUS REWEIGHT PRI-AFF 
-1       0.02939 root default                                             
-5       0.02939     region us-east-1                                     
-4       0.02939         zone us-east-1a                                  
-3       0.02939             host ip-10-0-136-227                         
 0   ssd 0.00980                 osd.0                up  1.00000 1.00000 

In OCS I see that portable is forced to true under certain conditions if there is no placement specified in the storagecluster CR and if there are more topology keys:

https://github.com/openshift/ocs-operator/blob/release-4.3/pkg/controller/storagecluster/reconcile.go#L902

From Kyle in chat, he is also seeing that portable: false is not being passed on to rook:

mini:~ kyle$ oc get storagecluster -o yaml | grep port
mini:~ kyle$ oc get cephcluster -o yaml | grep port
        portable: true
        portable: true
        portable: true

@Jose can you take a look?

Comment 8 Jose A. Rivera 2020-03-25 21:19:43 UTC
Travis found the spot, yeah. Currently we can't change this behavior as the OCS console UI depends on it, so the UI needs to change first. As a workaround, setting an explicit Placement should suffice.

Comment 9 Michael Adam 2020-03-25 22:01:45 UTC
I was telling since early January or even December that we would need to rework/adapt how we're doing portable vs non-portable OSDs... :-/

Currently ocs operator does racks if it doesn't find enough zones and will set the failure domain to rack forces portable = true.
If it finds enough zone labels it will do failure domain=zone and forces portable = true.

If we look at the storagecluster crd, it explains that portable is ignored if "placement" is not set.

https://github.com/openshift/ocs-operator/blob/release-4.3/deploy/olm-catalog/ocs-operator/4.3.0/storagecluster.crd.yaml#L96

So if we rely on automatic placement, portable will be ignored. If we do manual placement, portable will be honored.
The UI doesn't do placement afaik, and so we are forcing portable = true.

Maybe we need to revert to doing manual setup (non-UI) for direct attached disks.

Anyway, I think we can live with it for the 4.3 Tech Preview. But we should fix it for 4.4.

Comment 10 Rohan CJ 2020-03-26 07:29:18 UTC
UI seems to hardcode `portable:true`, so it wouldn't interfere with UI to honor `portable:false` in the CR.

Comment 11 Michael Adam 2020-03-26 07:40:26 UTC
(In reply to Rohan CJ from comment #10)
> UI seems to hardcode `portable:true`,

@Rohan, Can you please share the code reference?

Comment 12 Michael Adam 2020-03-26 07:53:38 UTC
re-establishing severity

Comment 13 Raz Tamir 2020-03-26 07:58:20 UTC
fixing severity to high.
Set it to medium by mistake

Comment 14 Rohan CJ 2020-03-30 08:28:53 UTC
> @Rohan, Can you please share the code reference?

I just created a StorageCluster via UI and checked.

Comment 16 Jose A. Rivera 2020-04-01 13:51:33 UTC
There is currently a workaround for this, and it does not impact supported functionality in OCS 4.3 or 4.4, so pushing back to OCS 4.5. But we do still need to fix it, so giving devel_ack+.

Kyle, can you confirm the workaround is valid?

Comment 17 Raz Tamir 2020-04-12 13:22:43 UTC
This was a candidate blocker for 4.3, therefore, moving to .4.4 as a blocker candidate

Comment 20 Jose A. Rivera 2020-04-14 14:17:24 UTC
I see nowhere in the BZ history where this was marked as a blocker candidate, and this is not a functionality blocker for the product. To even expose this bug in the product would require a UI change which will not happen for the OCS 4.4 timeframe. As such, we will not be fixing this in OCS 4.4 and I am removing devel_ack+ until we push this back to OCS 4.5.

Comment 21 Raz Tamir 2020-04-16 08:42:14 UTC
True, we should have add a summary in the bug itself.
This bug was discussed as one of the candidate blockers for 4.3 as the OSD distribution was severely impacted by this.

@Michael, @Anat,

As the discussion happen with you (from Eng side), please follow up on this

Comment 22 Travis Nielsen 2020-04-16 16:43:49 UTC
This doesn't affect any of supported OCS platforms that I'm aware and there is a workaround for baremetal. Not sure either why this would be a blocker.

Comment 24 Jose A. Rivera 2020-04-21 14:01:28 UTC
Seems there was a misunderstanding: The UI no longer requires the behavior that led us to force "portable: true". as such we can remove it and honor the "portable" setting in all configurations.

I still don't like doing this so late, but there's pressure from on high to do it so reinstating devel_ack.

Comment 25 Jose A. Rivera 2020-04-21 22:28:24 UTC
Initial PR is submitted to master: https://github.com/openshift/ocs-operator/pull/480

We'll backport to release-4.4 once it merges.

Comment 26 Michael Adam 2020-04-23 09:34:45 UTC
https://github.com/openshift/ocs-operator/pull/483
backport PR

Comment 29 Jose A. Rivera 2020-04-23 20:10:17 UTC
Backport PR has merged! Ready for downstream build.

Comment 30 Michael Adam 2020-04-24 16:24:09 UTC
4.4.0-414.ci aka 4.4.0-rc2 is now available with this fix

Comment 35 Michael Adam 2020-05-04 20:54:52 UTC
not requiring doc text since this shows up mostly when using LSO based deployments, and this is the first time we are supporting it

Comment 38 errata-xmlrpc 2020-06-04 12:54:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2393


Note You need to log in before you can comment on or make changes to this bug.