Bug 2222184

Summary: upgrade failed from odf 4.13.1 to 4.14 due to NON_EXISTING_ROOT_KEY in noobaa-core and CSI_ENABLE_TOPOLOGY not found in rook-ceph-operator
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Vijay Avuthu <vavuthu>
Component: Multi-Cloud Object GatewayAssignee: Liran Mauda <lmauda>
Status: ASSIGNED --- QA Contact: Vijay Avuthu <vavuthu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.14CC: kramdoss, muagarwa, nbecker, odf-bz-bot
Target Milestone: ---Keywords: Automation, Regression
Target Release: ODF 4.14.0Flags: nbecker: needinfo? (vavuthu)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijay Avuthu 2023-07-12 07:31:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):

upgrade failed from 4.13.1 to 4.14.0-61


Version of all relevant components (if applicable):
openshift installer (4.14.0-0.nightly-2023-07-11-092038)
upgrade from 4.13.1 to 4.14.0-61

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
2/2

Can this issue reproduce from the UI?
Not tried


If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. install odf 4.13.1 and upgrade to 4.14.0-61 using ocs-ci
2. verify all csv's

Actual results:
$ oc get csv
NAME                                        DISPLAY                       VERSION            REPLACES                                PHASE
mcg-operator.v4.14.0-61.stable              NooBaa Operator               4.14.0-61.stable   mcg-operator.v4.13.1-rhodf              Succeeded
ocs-operator.v4.13.1-rhodf                  OpenShift Container Storage   4.13.1-rhodf       ocs-operator.v4.13.0-rhodf              Replacing
ocs-operator.v4.14.0-61.stable              OpenShift Container Storage   4.14.0-61.stable   ocs-operator.v4.13.1-rhodf              Failed
odf-csi-addons-operator.v4.14.0-61.stable   CSI Addons                    4.14.0-61.stable   odf-csi-addons-operator.v4.13.1-rhodf   Succeeded
odf-operator.v4.14.0-61.stable              OpenShift Data Foundation     4.14.0-61.stable   odf-operator.v4.13.1-rhodf              Succeeded


Expected results:
All csv's should in Suceeded state.


Additional info:

$ oc describe csv ocs-operator.v4.14.0-61.stable

Events:
  Type     Reason               Age                    From                        Message
  ----     ------               ----                   ----                        -------
  Normal   RequirementsUnknown  12m                    operator-lifecycle-manager  requirements not yet checked
  Normal   RequirementsNotMet   11m                    operator-lifecycle-manager  one or more requirements couldn't be found
  Normal   InstallWaiting       11m                    operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability.
  Warning  InstallCheckFailed   6m34s (x2 over 6m34s)  operator-lifecycle-manager  install timeout
  Normal   NeedsReinstall       6m33s (x2 over 6m34s)  operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: deployment "rook-ceph-operator" not available: Deployment does not have minimum availability.
  Normal   AllRequirementsMet   6m30s (x4 over 11m)    operator-lifecycle-manager  all requirements found, attempting install
  Normal   InstallSucceeded     6m30s (x2 over 11m)    operator-lifecycle-manager  waiting for install components to report healthy
  Normal   InstallWaiting       6m29s (x3 over 11m)    operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: deployment "rook-ceph-operator" not available: Deployment does not have minimum availability.
  Warning  InstallCheckFailed   92s (x2 over 92s)      operator-lifecycle-manager  install failed: deployment rook-ceph-operator not ready before timeout: deployment "rook-ceph-operator" exceeded its progress deadline

> pod in not running state
$ oc get pods | egrep -v "Running|Completed"
NAME                                                              READY   STATUS                       RESTARTS      AGE
noobaa-core-0                                                     0/1     CrashLoopBackOff             7 (25s ago)   11m
rook-ceph-operator-6fd47df694-gwtqz                               0/1     CreateContainerConfigError   0             12m

> $ oc describe pod rook-ceph-operator-6fd47df694-gwtqz


Events:
  Type     Reason          Age                  From               Message
  ----     ------          ----                 ----               -------
  Normal   Scheduled       17m                  default-scheduler  Successfully assigned openshift-storage/rook-ceph-operator-6fd47df694-gwtqz to compute-1
  Normal   AddedInterface  17m                  multus             Add eth0 [10.131.0.47/23] from ovn-kubernetes
  Normal   Pulling         17m                  kubelet            Pulling image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:d19bd025bd17d8db3f918ed8ef65188a4a1d58f7756bb15d7e0504ee5fcf26cb"
  Normal   Pulled          16m                  kubelet            Successfully pulled image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:d19bd025bd17d8db3f918ed8ef65188a4a1d58f7756bb15d7e0504ee5fcf26cb" in 15.950938092s (15.950952025s including waiting)
  Warning  Failed          14m (x12 over 16m)   kubelet            Error: couldn't find key CSI_ENABLE_TOPOLOGY in ConfigMap openshift-storage/ocs-operator-config
  Normal   Pulled          112s (x71 over 16m)  kubelet            Container image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:d19bd025bd17d8db3f918ed8ef65188a4a1d58f7756bb15d7e0504ee5fcf26cb" already present on machine

> $ oc describe pod noobaa-core-0
Name:             noobaa-core-0
Namespace:        openshift-storage
Priority:         0
Service Account:  noobaa
Node:             compute-2/10.1.112.178
Start Time:       Wed, 12 Jul 2023 12:42:39 +0530
Labels:           app=noobaa
                  controller-revision-hash=noobaa-core-5656895cf5
                  noobaa-core=noobaa
                  noobaa-mgmt=noobaa
                  statefulset.kubernetes.io/pod-name=noobaa-core-0
Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       16m                 default-scheduler  Successfully assigned openshift-storage/noobaa-core-0 to compute-2
  Normal   AddedInterface  16m                 multus             Add eth0 [10.128.2.35/23] from ovn-kubernetes
  Normal   Pulled          15m (x5 over 16m)   kubelet            Container image "registry.redhat.io/odf4/mcg-core-rhel9@sha256:b51a63bc588431acc0306703a99562e7c2c35266bf90d16b69146944911728cd" already present on machine
  Normal   Created         15m (x5 over 16m)   kubelet            Created container core
  Normal   Started         15m (x5 over 16m)   kubelet            Started container core
  Warning  BackOff         93s (x70 over 16m)  kubelet            Back-off restarting failed container core in pod noobaa-core-0_openshift-storage(48b662a5-ce7c-4552-b7c9-7c197b852268)

job still running and will update once muster gather is collected. kubeconfig is provided to dev for live debugging

job: https://url.corp.redhat.com/c447398

Comment 3 Vijay Avuthu 2023-07-12 07:33:09 UTC
> noobaa-core-0 pod log

Jul-12 7:28:57.433 [Upgrade/20] [ERROR] core.server.system_services.system_store:: SystemStore: load failed Error: NON_EXISTING_ROOT_KEY
    at MasterKeysManager.load_root_key (/root/node_modules/noobaa-core/src/server/system_services/master_key_manager.js:64:40)
    at /root/node_modules/noobaa-core/src/server/system_services/system_store.js:414:41
    at Semaphore.surround (/root/node_modules/noobaa-core/src/util/semaphore.js:71:90)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Jul-12 7:28:57.433 [Upgrade/20] [ERROR] core.server.system_services.system_store:: SystemStore: load failed Error: NON_EXISTING_ROOT_KEY
    at MasterKeysManager.load_root_key (/root/node_modules/noobaa-core/src/server/system_services/master_key_manager.js:64:40)
    at /root/node_modules/noobaa-core/src/server/system_services/system_store.js:414:41
    at Semaphore.surround (/root/node_modules/noobaa-core/src/util/semaphore.js:71:90)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Jul-12 7:28:57.434 [Upgrade/20] [ERROR] UPGRADE:: failed to load system store!! Error: NON_EXISTING_ROOT_KEY
    at MasterKeysManager.load_root_key (/root/node_modules/noobaa-core/src/server/system_services/master_key_manager.js:64:40)
    at /root/node_modules/noobaa-core/src/server/system_services/system_store.js:414:41
    at Semaphore.surround (/root/node_modules/noobaa-core/src/util/semaphore.js:71:90)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Jul-12 7:28:57.434 [Upgrade/20] [ERROR] UPGRADE:: failed to init upgrade process!!
upgrade_manager failed with exit code 1
noobaa_init.sh finished
noobaa_init failed with exit code 1. aborting

Comment 4 Nitin Goyal 2023-07-12 08:44:59 UTC
We looked at the setup and found the version mismatch error 

"Storage cluster version (4.13.1) is higher than the OCS Operator version (4.13.0)"

This is due to the PR merged a few days ago https://github.com/red-hat-storage/ocs-operator/pull/2089. We need to update the downstream docker file to use the pkg's updated path while using the ldflags to update the version. We already notified Boris. He will make a fix in the downstream docker file. It will solve the rook issue. 

for the noobaa issue, someone needs to take a look from the noobaa team. I am moving it to the noobaa component.