Bug 2124379

Summary: ODF4.12 Installation, ocs-operator.v4.12.0 and mcg-operator.v4.12.0 failed
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Oded <oviner>
Component: ocs-operatorAssignee: umanga <uchapaga>
Status: CLOSED CURRENTRELEASE QA Contact: Oded <oviner>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.12CC: mparida, muagarwa, nberry, ocs-bugs, odf-bz-bot, sostapov, tnielsen, uchapaga, vavuthu
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2124591 2124593 (view as bug list) Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2124591, 2124593    

Description Oded 2022-09-05 22:26:20 UTC
Description of problem (please be detailed as possible and provide log
snippests):
ODF4.12 Installation, ocs-operator.v4.12.0  and mcg-operator.v4.12.0  failed

Version of all relevant components (if applicable):
OCP version:4.12.0-0.nightly-2022-09-05-090751
ODF Version: 4.12.0-29
Provider: [Tested on AWS and Vmare]

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

Test Process:
1.Install ODF 4.12 via UI:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr6408b3472/jnk-pr6408b3472_20220905T195700/logs/ui_logs_dir_1662411046/screenshots_ui/test_deployment/

2. Disabling default source: redhat-operators
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge

3.Adding CatalogSource:
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  labels:
    ocs-operator-internal: 'true'
  name: redhat-operators
  namespace: openshift-marketplace
spec:
  displayName: Openshift Container Storage
  icon:
    base64data: ''
    mediatype: ''
  image: quay.io/rhceph-dev/ocs-registry:4.12.0-29
  priority: 100
  publisher: Red Hat
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 15m

$ oc apply -f /tmp/catalog_source_manifestjjp4iruz

4. Verify Catalog source redhat-operators is in  READY state:
$ oc -n openshift-marketplace get CatalogSource redhat-operators -n openshift-marketplace -o yaml

5.Check CSV:
$ oc get csv -A
NAMESPACE                              NAME                              DISPLAY                       VERSION   REPLACES   PHASE
openshift-operator-lifecycle-manager   packageserver                     Package Server                0.19.0               Succeeded
openshift-storage                      mcg-operator.v4.12.0              NooBaa Operator               4.12.0               Failed
openshift-storage                      ocs-operator.v4.12.0              OpenShift Container Storage   4.12.0               Failed
openshift-storage                      odf-csi-addons-operator.v4.12.0   CSI Addons                    4.12.0               Succeeded
openshift-storage                      odf-operator.v4.12.0              OpenShift Data Foundation     4.12.0               Succeeded


$ oc get csv ocs-operator.v4.12.0 
NAME                   DISPLAY                       VERSION   REPLACES   PHASE
ocs-operator.v4.12.0   OpenShift Container Storage   4.12.0               Failed

Events:
  Type     Reason               Age                From                        Message
  ----     ------               ----               ----                        -------
  Normal   RequirementsUnknown  46m                operator-lifecycle-manager  requirements not yet checked
  Normal   RequirementsNotMet   46m                operator-lifecycle-manager  one or more requirements couldn't be found
  Normal   InstallWaiting       46m                operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability.
  Warning  InstallCheckFailed   41m                operator-lifecycle-manager  install timeout
  Normal   NeedsReinstall       41m (x2 over 41m)  operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: deployment "rook-ceph-operator" not available: Deployment does not have minimum availability.
  Normal   AllRequirementsMet   41m (x3 over 46m)  operator-lifecycle-manager  all requirements found, attempting install
  Normal   InstallSucceeded     41m (x3 over 46m)  operator-lifecycle-manager  waiting for install components to report healthy
  Normal   InstallWaiting       41m (x3 over 45m)  operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: deployment "rook-ceph-operator" not available: Deployment does not have minimum availability.
  Warning  InstallCheckFailed   36m                operator-lifecycle-manager  install failed: deployment rook-ceph-operator not ready before timeout: deployment "rook-ceph-operator" exceeded its progress deadline
  
6.Check StorageCluster:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   32m   Error              2022-09-05T19:32:18Z   4.11.0
Status:
  Conditions:
    Last Heartbeat Time:   2022-09-05T19:54:12Z
    Last Transition Time:  2022-09-05T19:32:19Z
    Message:               Error while reconciling: some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2022-09-05T19:32:19Z
    Last Transition Time:  2022-09-05T19:32:19Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2022-09-05T19:32:19Z
    Last Transition Time:  2022-09-05T19:32:19Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2022-09-05T19:32:19Z
    Last Transition Time:  2022-09-05T19:32:19Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2022-09-05T19:32:19Z
    Last Transition Time:  2022-09-05T19:32:19Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable



Actual results:


Expected results:


Additional info:
OCP+ODF Must Gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr6408b3472/jnk-pr6408b3472_20220905T195700/logs/failed_testcase_ocs_logs_1662411046/deployment_ocs_logs/

Comment 5 Mudit Agarwal 2022-09-06 15:00:52 UTC
Keeping this BZ for ocs-metrics-exporter, have cloned two BZs one for rook and another for noobaa

Comment 11 umanga 2022-10-13 06:31:06 UTC
https://github.com/red-hat-storage/ocs-operator/pull/1813 removes privileged access from ocs-metrics-exporter and should fix these SCC errors. Any latest 4.12 builds can be used to test.

Comment 12 Oded 2022-10-20 10:38:36 UTC
Bug Fixed.
PR Validation Job pass. 
https://github.com/red-hat-storage/ocs-ci/pull/6573/files

OCP Version: 4.12.0-0.nightly-2022-10-18-192348
ODF Version: 4.12.0-77
Provider: Vmware

Comment 13 Oded 2022-10-24 12:58:48 UTC
ODF4.12 installation failed on AWS_UPI_RHEL without WA

$ kubectl label --overwrite ns openshift-storage \  
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/warn=baseline \
  pod-security.kubernetes.io/audit=baseline


SetUp:
OCP Version: 4.12
ODF Version: 4.12
Provider: AWS_RHEL_UPI

OCP MG:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr6573b3719/jnk-pr6573b3719_20221024T112000/logs/failed_testcase_ocs_logs_1666610685/deployment_ocs_logs/

Jenkins Job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-test-pr/3719/testReport/tests.ecosystem.deployment/test_deployment/test_deployment/

Comment 14 Oded 2022-10-24 12:59:00 UTC
ODF4.12 installation failed on AWS_UPI_RHEL without WA

$ kubectl label --overwrite ns openshift-storage \  
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/warn=baseline \
  pod-security.kubernetes.io/audit=baseline


SetUp:
OCP Version: 4.12
ODF Version: 4.12
Provider: AWS_RHEL_UPI

OCP MG:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr6573b3719/jnk-pr6573b3719_20221024T112000/logs/failed_testcase_ocs_logs_1666610685/deployment_ocs_logs/

Jenkins Job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-test-pr/3719/testReport/tests.ecosystem.deployment/test_deployment/test_deployment/

Comment 15 Mudit Agarwal 2022-10-31 02:58:05 UTC
This is now fixed in OLM, please try with the latest build.

Comment 16 Oded 2022-10-31 11:17:40 UTC
Bug reproduced on latest version

OCP Version:4.12
ODF Version:4.12
Provider: AWS_UPI


https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-test-pr/3733/testReport/tests.ecosystem.deployment/test_deployment/test_deployment/


failed on setup with "ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n default create -f /tmp/POD_43ysp0qc -o yaml.
Error is Error from server (Forbidden): error when creating "/tmp/POD_43ysp0qc": pods "rhel-ansible" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "rhel" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "rhel" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "rhel" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "rhel" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "rhel" must not set runAsUser=0), seccompProfile (pod or container "rhel" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")"

Comment 17 Malay Kumar parida 2022-11-04 05:36:21 UTC
Hi Oded, I see you are trying to create a pod in the default namespace. And
That is what is the cause of the error.

In latest OLM changes, they are automatically labelling namespaces only which are
prefixed with openshift-,
They are not touching any other NS, and here the ns in question is not any
openshift-* NS, hence the error.

I am not very sure about how the installation happens in different methods, But it
seems like if we want to use the default ns, we have to label it beforehand.

Comment 18 Oded 2022-11-07 07:56:49 UTC
Bug Fixed.
rhel-ansible pod is part of OCS-CI infra.
PR validation job pass on AWS_UPI https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-test-pr/3744/