Bug 2152839 - [External Mode] If cert-manager is present when ODF gets installed a "rook-ceph-webhook" gets created that blocks deployment. [NEEDINFO]
Summary: [External Mode] If cert-manager is present when ODF gets installed a "rook-ce...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.12.0
Assignee: Subham Rai
QA Contact: Vijay Avuthu
URL:
Whiteboard:
Depends On:
Blocks: 2154163 2154164 2154165
TreeView+ depends on / blocked
 
Reported: 2022-12-13 08:46 UTC by Oscar Lindholm
Modified: 2023-08-09 17:03 UTC (History)
9 users (show)

Fixed In Version: 4.12.0-156
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2154163 2154165 (view as bug list)
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:
srai: needinfo? (mduasope)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 442 0 None open Bug 2152839: disable webhook 2022-12-16 08:14:49 UTC
Github rook rook pull 11432 0 None Merged operator: disable webhook by default 2022-12-16 03:41:23 UTC
Github rook rook pull 11448 0 None open operator: adding logs for debugging 2022-12-16 03:41:23 UTC

Description Oscar Lindholm 2022-12-13 08:46:17 UTC
Description of problem (please be detailed as possible and provide log
snippests):

When setting up ODF for an external Ceph cluster on an OpenShift cluster with cert-manager running a validating webhook gets created. This validating webhook is called rook-ceph-webhook and will deny the .json configuration. With following error message: 
```
Error while reconciling: admission webhook "cephcluster-wh-rook-ceph-admission-controller-openshift-storage.rook.io" denied the request: invalid create : external mode enabled cannot have mon,dashboard,monitoring,network,disruptionManagement,storage fields in CR
``` 
Link to webhook code: https://github.com/rook/rook/blob/master/pkg/apis/ceph.rook.io/v1/cluster.go#L49

If one deletes the created webhook it will connect the Ceph cluster to OpenShift like normal, with no observed issues. It continues to work even after reapplying the webhook to the OpenShift cluster.

If cert-manager is not installed since before, installing the ODF operator will not trigger the webhook to get installed at all. On the other hand, if cert-manager is installed afterwards and then the ODF operator gets upgraded, the webhook will get created.

Upgrades with the webhook present does not seem to affect the ceph cluster in any way. We have so far not observed any issues when upgrading the ODF operator with cert-manager installed. Thus, as far as we can see, the issue is only for the initial connection to the Ceph cluster.


Version of all relevant components (if applicable):

OpenShift
4.11.3
4.11.5
ODF
4.11.3
4.11.4
Cert-Manager
quay.io/jetstack/cert-manager-controller:v1.10.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This blocks the integration of ODF external with OCP. 


Is there any workaround available to the best of your knowledge?

* Install OCP cluster
* Install ODF operator
  - Observe that rook-ceph-webhook is not created.
* Connect to External ceph cluster
* Add cert-manager helm chart.
  - Observe that the rook-ceph-webhook is not created
  - On the next ODF operator upgrade, the rook-ceph-webhook will be created
* This allows for upgrades of ODF operator, without losing the connection to the external ceph cluster.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes.

Can this issue reproduce from the UI?

If a current OCP cluster exists, then yes.


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP cluster
2. Add the cert-manager helm chart
3. Install the ODF operator
  - Observe that the rook-ceph-webhook gets created
4. Try to connect the ODF external ceph cluster.



Actual results:

ODF operator starts using cert-manager to set up resources in the cluster (Issuer, Certificate). It then creates the "rook-ceph-webhook" which blocks the integration of ODF external.

Expected results:

ODF to not start using other software and both make unannounced changes in the cluster and block one from connecting to the external Ceph cluster.

Additional info:
Support case: https://access.redhat.com/support/cases/#/case/03381863

Comment 5 Subham Rai 2022-12-13 10:59:38 UTC
As mentioned in gChat, we'll disable the webhook in the downstream product until we officially support it.

Yes, we'll backport to till 4.10 or earlier if possible.

Comment 6 Travis Nielsen 2022-12-13 15:10:35 UTC
Marking as a blocker since it affects the first install experience, the workaround is difficult, and the fix is simple and low risk.

Comment 7 Subham Rai 2022-12-16 03:31:37 UTC
(In reply to Subham Rai from comment #5)
> As mentioned in gChat, we'll disable the webhook in the downstream product
> until we officially support it.
> 
> Yes, we'll backport to till 4.10 or earlier if possible.


The above issue will not be in 4.10 since we do not have the webhook with cert-manager in 4.10, the changes are from 4.11. So, I'll backport it till 4.11.

Comment 14 Vijay Avuthu 2023-01-18 06:45:09 UTC
Update:
==========

verified with below versions

openshift installer (4.12.0-0.nightly-2023-01-10-062211)
ocs-registry:4.12.0-167

1. install OCP (4.12.0-0.nightly-2023-01-10-062211)
2. install cert-manager
3. deploy ODF ( 4.12.0-167 )

ODF deployment is successfull without any issues

> no rook-ceph-webhook is created

oc get validatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                 WEBHOOKS   AGE
admissionwebhook.noobaa.io-2hrfx                     1          13m
alertmanagerconfigs.openshift.io                     1          157m
autoscaling.openshift.io                             2          167m
cert-manager-webhook                                 1          24m
cluster-baremetal-validating-webhook-configuration   1          167m
controlplanemachineset.machine.openshift.io          1          167m
machine-api                                          2          168m
multus.openshift.io                                  1          170m
performance-addon-operator                           1          172m
prometheusrules.openshift.io                         1          157m
snapshot.storage.k8s.io                              1          168m
validation.csi.vsphere.vmware.com                    1          167m

>  oc get csv
NAME                              DISPLAY                                       VERSION   REPLACES   PHASE
mcg-operator.v4.12.0              NooBaa Operator                               4.12.0               Succeeded
ocs-operator.v4.12.0              OpenShift Container Storage                   4.12.0               Succeeded
odf-csi-addons-operator.v4.12.0   CSI Addons                                    4.12.0               Succeeded
odf-operator.v4.12.0              OpenShift Data Foundation                     4.12.0               Succeeded
openshift-cert-manager.v1.7.1     cert-manager Operator for Red Hat OpenShift   1.7.1-1              Succeeded

$ oc get storagesystem
NAME                                        STORAGE-SYSTEM-KIND                  STORAGE-SYSTEM-NAME
ocs-external-storagecluster-storagesystem   storagecluster.ocs.openshift.io/v1   ocs-external-storagecluster

$ oc get storagecluster
NAME                          AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-external-storagecluster   7m37s   Ready   true       2023-01-17T09:20:50Z   4.12.0

$ oc describe storagecluster
Name:         ocs-external-storagecluster
Namespace:    openshift-storage

Status:
  Conditions:
    Last Heartbeat Time:   2023-01-17T09:28:27Z
    Last Transition Time:  2023-01-17T09:20:51Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2023-01-17T09:28:27Z
    Last Transition Time:  2023-01-17T09:22:49Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  Available
    Last Heartbeat Time:   2023-01-17T09:28:27Z
    Last Transition Time:  2023-01-17T09:22:49Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2023-01-17T09:28:27Z
    Last Transition Time:  2023-01-17T09:20:51Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2023-01-17T09:28:27Z
    Last Transition Time:  2023-01-17T09:22:49Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  Upgradeable
  External Secret Hash:    054181a0aa3997cfe5a1c170a0f97cebda33176bfaf358ad2a1648ffccfb1f25265818fc446f8c4831eacff2dc0715e5d621177a3c3bf632779585c621c86d6f
  External Storage:
    Granted Capacity:  0
  Images:
    Ceph:
      Desired Image:  quay.io/rhceph-dev/rhceph@sha256:957294824e1cbf89ca24a1a2aa2a8e8acd567cfb5a25535e2624989ad1046a60
    Noobaa Core:
      Actual Image:   quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:82bcc82a78933ae759127bc8917edbe91737c41e01f18638278f939a4548c8d3
      Desired Image:  quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:82bcc82a78933ae759127bc8917edbe91737c41e01f18638278f939a4548c8d3
    Noobaa DB:
      Actual Image:   quay.io/rhceph-dev/rhel8-postgresql-12@sha256:3d805540d777b09b4da6df99e7cddf9598d5ece4af9f6851721a9961df40f5a1
      Desired Image:  quay.io/rhceph-dev/rhel8-postgresql-12@sha256:3d805540d777b09b4da6df99e7cddf9598d5ece4af9f6851721a9961df40f5a1
  Kms Server Connection:
  Phase:  Ready

> from rook-ceph-operator log , webhook resources is deleted as per expected.

2023-01-17 09:10:55.226404 I | rookcmd: starting Rook v4.12.0-0.f4e99907f9b9f05a190303465f61d12d5d24cace with arguments '/usr/local/bin/rook ceph operator'
2023-01-17 09:10:55.226461 I | rookcmd: flag values: --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-level=INFO, --operator-image=, --service-account=
2023-01-17 09:10:55.226464 I | cephcmd: starting Rook-Ceph operator
2023-01-17 09:10:55.357359 I | cephcmd: base ceph version inside the rook operator image is "ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)"
2023-01-17 09:10:55.371796 I | op-k8sutil: ROOK_CURRENT_NAMESPACE_ONLY="true" (env var)
2023-01-17 09:10:55.371812 I | operator: watching the current namespace "openshift-storage" for a Ceph CRs
2023-01-17 09:10:55.371845 I | operator: setting up schemes
2023-01-17 09:10:55.373409 I | operator: setting up the controller-runtime manager
I0117 09:10:56.424146       1 request.go:601] Waited for 1.04665259s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
2023-01-17 09:10:58.227602 I | operator: delete webhook resources since webhook is disabled
2023-01-17 09:10:58.227619 I | operator: deleting validating webhook rook-ceph-webhook
2023-01-17 09:10:58.231537 I | operator: deleting webhook cert manager Certificate rook-admission-controller-cert
2023-01-17 09:10:58.289564 I | operator: deleting webhook cert manager Issuer %sselfsigned-issuer
2023-01-17 09:10:58.393056 I | operator: deleting validating webhook service %srook-ceph-admission-controller
2023-01-17 09:10:58.396828 I | ceph-cluster-controller: successfully started
2023-01-17 09:10:58.396894 I | ceph-cluster-controller: hotplug orchestration disabled
2023-01-17 09:10:58.396903 I | ceph-crashcollector-controller: successfully started
2023-01-17 09:10:58.396920 I | ceph-block-pool-controller: successfully started
2023-01-17 09:10:58.396934 I | ceph-object-store-user-controller: successfully started
2023-01-17 09:10:58.396947 I | ceph-object-realm-controller: successfully started

> logs are having %s, I will raise separate bug for logging issue


Note You need to log in before you can comment on or make changes to this bug.