Previously, as part of the Postgres upgrade, noobaa-operator configured a new init-container to handle the upgrade and when there was no existing init container, the operator crashed. This was because, in OpenShift Data Foundation 4.15, the Postgres upgrade flow assumed that the noobaa-db-pg-0 has an init container. This is a wrong assumption for systems installed in version OpenShift Data Foundation 4.18 and earlier versions.
With this fix, during reconciliation of noobaa-db, the case where there is no init container was addressed. As a result, Postgres upgraded starts without crashing.
Description of problem (please be detailed as possible and provide log
snippests):
On upgrading ODF fom 4.14 to 4.15, noobaa-operator panics and going to CLBO
~~~
noobaa-operator-55954db6bf-4lrwv 0/1 Running 231 21h
~~~
The operator crashed during upgrading db image and running /init/dumpdb.sh
less namespaces/openshift-storage/pods/noobaa-operator-55954db6bf-4lrwv/noobaa-operator/noobaa-operator/logs/current.log
~~~
2024-05-16T10:12:41.338813083Z time="2024-05-16T10:12:41Z" level=info msg="UpgradePostgresDB: current phase is Preparing" sys=openshift-storage/noobaa
2024-05-16T10:12:41.338813083Z time="2024-05-16T10:12:41Z" level=info msg="SetEndpointsDeploymentReplicas:: setting endpoints replica count to 0" sys=openshift-storage/noobaa
2024-05-16T10:12:41.339969703Z time="2024-05-16T10:12:41Z" level=info msg="ReconcileObject: Done - unchanged Deployment noobaa-endpoint " sys=openshift-storage/noobaa
2024-05-16T10:12:41.339969703Z time="2024-05-16T10:12:41Z" level=info msg="ReconcileSetDbImageAndInitCode:: changing DB image: registry.redhat.io/rhel8/postgresql-12@sha256:16a0cb66818ab8acb68abf40ac075eadd10a94612067769e055222dd412f0a16 and init contatiners script: /init/dumpdb.sh" sys=openshift-storage/noobaa
2024-05-16T10:12:41.343294392Z panic: runtime error: index out of range [0] with length 0 [recovered]
2024-05-16T10:12:41.343294392Z panic: runtime error: index out of range [0] with length 0
2024-05-16T10:12:41.343294392Z
2024-05-16T10:12:41.343294392Z goroutine 3003 [running]:
2024-05-16T10:12:41.343294392Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
2024-05-16T10:12:41.343321837Z /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:116 +0x1fa
2024-05-16T10:12:41.343321837Z panic({0x25b28c0, 0xc001d03728})
2024-05-16T10:12:41.343321837Z /usr/lib/golang/src/runtime/panic.go:884 +0x213
2024-05-16T10:12:41.343321837Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileSetDbImageAndInitCode.func1()
2024-05-16T10:12:41.343334234Z /remote-source/app/pkg/system/phase2_creating.go:1519 +0x336
2024-05-16T10:12:41.343334234Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObjectAndGetResult.func1()
2024-05-16T10:12:41.343334234Z /remote-source/app/pkg/system/reconciler.go:639 +0x22
2024-05-16T10:12:41.343357633Z sigs.k8s.io/controller-runtime/pkg/controller/controllerutil.mutate(0xc000c9f900?, {{0xc0014cc168?, 0x0?}, {0xc0015119b0?, 0x2d7bb40?}}, {0x2d9a500, 0xc000c9f900})
2024-05-16T10:12:41.343378459Z /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/controller/controllerutil/controllerutil.go:340 +0x4f
2024-05-16T10:12:41.343388804Z sigs.k8s.io/controller-runtime/pkg/controller/controllerutil.CreateOrUpdate({0x2d7bb40, 0xc000056058}, {0x2d89e40, 0xc000d0ed80}, {0x2d9a500?, 0xc000c9f900}, 0xc0001e4b10?)
2024-05-16T10:12:41.343410134Z /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/controller/controllerutil/controllerutil.go:212 +0x274
2024-05-16T10:12:41.343421242Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObjectAndGetResult(0xc00180a780, {0x2d9a500, 0xc000c9f900}, 0xc001af3550, 0x0)
2024-05-16T10:12:41.343431520Z /remote-source/app/pkg/system/reconciler.go:636 +0x169
2024-05-16T10:12:41.343442147Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObject(...)
2024-05-16T10:12:41.343442147Z /remote-source/app/pkg/system/reconciler.go:627
2024-05-16T10:12:41.343442147Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileObject(...)
2024-05-16T10:12:41.343442147Z /remote-source/app/pkg/system/reconciler.go:618
2024-05-16T10:12:41.343452838Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileSetDbImageAndInitCode(0xc00180a780, {0xc001394150, 0x6e}, {0x27627dd, 0xf}, 0x1)
2024-05-16T10:12:41.343470896Z /remote-source/app/pkg/system/phase2_creating.go:1515 +0x186
2024-05-16T10:12:41.343481167Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).UpgradePostgresDB(0xc00180a780)
2024-05-16T10:12:41.343481167Z /remote-source/app/pkg/system/phase2_creating.go:1642 +0xe51
2024-05-16T10:12:41.343491532Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhaseCreatingForMainClusters(0xc00180a780)
2024-05-16T10:12:41.343491532Z /remote-source/app/pkg/system/phase2_creating.go:138 +0x465
2024-05-16T10:12:41.343503080Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhaseCreating(0xc00180a780)
2024-05-16T10:12:41.343503080Z /remote-source/app/pkg/system/phase2_creating.go:66 +0x1e5
2024-05-16T10:12:41.343513196Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhases(0x27f95ed?)
2024-05-16T10:12:41.343513196Z /remote-source/app/pkg/system/reconciler.go:541 +0x47
2024-05-16T10:12:41.343523620Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).Reconcile(0xc00180a780)
2024-05-16T10:12:41.343523620Z /remote-source/app/pkg/system/reconciler.go:422 +0x33b
2024-05-16T10:12:41.343534342Z github.com/noobaa/noobaa-operator/v5/pkg/controller/noobaa.Add.func1({0xc001323ce0?, 0x40dd8a?}, {{{0xc0014cc168?, 0x30?}, {0xc001f2a316?, 0x2595ae0?}}})
2024-05-16T10:12:41.343564901Z /remote-source/app/pkg/controller/noobaa/noobaa_controller.go:53 +0xe5
~~~
The operator is running on 4.15 image, however postgres is still not on updated image. noobaa-db pod is running on below image while nooaba-operator is trying to update db image to postgresql-12@sha256:16a0cb66818ab8acb68abf40ac075eadd10a94612067769e055222dd412f0a16
~~~
image: registry.redhat.io/rhel8/postgresql-12@sha256:b96be9d3e8512046bae7d5a3e04fa151043eca051416305629b3ffd547370453
~~~
MCG csv is in installng state:
cat namespaces/openshift-storage/oc_output/csv
~~~
NAME DISPLAY VERSION REPLACES PHASE
gitlab-runner-operator.v1.21.0 GitLab Runner 1.21.0 gitlab-runner-operator.v1.18.1 Succeeded
mcg-operator.v4.15.2-rhodf NooBaa Operator 4.15.2-rhodf mcg-operator.v4.14.6-rhodf Installing
ocs-operator.v4.15.2-rhodf OpenShift Container Storage 4.15.2-rhodf ocs-operator.v4.14.6-rhodf Succeeded
odf-csi-addons-operator.v4.15.2-rhodf CSI Addons 4.15.2-rhodf odf-csi-addons-operator.v4.14.6-rhodf Succeeded
odf-operator.v4.15.2-rhodf OpenShift Data Foundation 4.15.2-rhodf odf-operator.v4.14.6-rhodf Succeeded
~~~
less namespaces/openshift-storage/operators.coreos.com/clusterserviceversions/mcg-operator.v4.15.2-rhodf.yaml
~~~
- lastTransitionTime: "2024-05-16T10:06:28Z"
lastUpdateTime: "2024-05-16T10:06:28Z"
message: 'installing: waiting for deployment noobaa-operator to become ready:
deployment "noobaa-operator" not available: Deployment does not have minimum
availability.'
phase: Installing
reason: InstallWaiting
~~~
init container is missing in the noobaa-db pod.
$omg get pod noobaa-db-pg-0 -o yaml|grep -i initContainers
$
From dumpdb.sh, I see it checks if use space of /var/lib/pgsql/data is greater than THRESHOLD (33%)
~~~
cat /init/dumpdb.sh
set -e
sed -i -e 's/^\(postgres:[^:]\):[0-9]*:[0-9]*:/\1:10001:0:/' /etc/passwd
su postgres -c "bash -x /usr/bin/run-postgresql" &
THRESHOLD=33
USE=$(df -h --output=pcent "/$HOME/data" | tail -n 1 | tr -d '[:space:]%')
# Check if the used space is more than the threshold
if [ "$USE" -gt "$THRESHOLD" ]; then
echo "Warning: Free space $USE% is above $THRESHOLD% threshold. Can't start upgrade!"
exit 1
fi
echo "Info: Free space $USE% is below $THRESHOLD% threshold. Starting upgrade!"
until pg_isready; do sleep 1; done;
pg_dumpall -U postgres > /$HOME/data/dump.sql
exit 0
~~~
The /var/lib/pgsql is only 2% USE, which is less than threshold.
~~~
$ oc rsh noobaa-db-pg-0
sh-4.4$ df -h /var/lib/pgsql/data
Filesystem Size Used Avail Use% Mounted on
/dev/rbd0 49G 776M 49G 2% /var/lib/pgsql
~~~
Version of all relevant components (if applicable):
ODF 4.15.2
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
This is causing the noobaa service to be down.
This is also preventing customer to upgrade ODF to 4.15 in other clusters.
Is there any workaround available to the best of your knowledge?
No
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
Yes, in customer environment
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
The upgrade from 4.14 to 4.15 worked fine in another cluster. Customer suspect this is a regression bug that only comes into play on "old" ODF-clusters. This cluster was first installed with version 4.4.
Steps to Reproduce:
1. Install ODF < 4.14
2. Upgrade ODF from 4.14 to 4.15
Actual results:
noobaa-operator panic on upgrade
Expected results:
ODF upgrade should go smooth
Additional info:
Customer ran the steps in c#7 and resolved their issue:
~~~~~~
We just added the initContainer to the noobaa-db-pg StatefulSet and that made the noobaa-operator happy again.
The upgrade of noobaa then started again and finished successfully.
~~~~~~
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Red Hat OpenShift Data Foundation 4.15.6 Bug Fix Update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2024:6397
Comment 47Red Hat Bugzilla
2025-01-04 04:25:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days
Description of problem (please be detailed as possible and provide log snippests): On upgrading ODF fom 4.14 to 4.15, noobaa-operator panics and going to CLBO ~~~ noobaa-operator-55954db6bf-4lrwv 0/1 Running 231 21h ~~~ The operator crashed during upgrading db image and running /init/dumpdb.sh less namespaces/openshift-storage/pods/noobaa-operator-55954db6bf-4lrwv/noobaa-operator/noobaa-operator/logs/current.log ~~~ 2024-05-16T10:12:41.338813083Z time="2024-05-16T10:12:41Z" level=info msg="UpgradePostgresDB: current phase is Preparing" sys=openshift-storage/noobaa 2024-05-16T10:12:41.338813083Z time="2024-05-16T10:12:41Z" level=info msg="SetEndpointsDeploymentReplicas:: setting endpoints replica count to 0" sys=openshift-storage/noobaa 2024-05-16T10:12:41.339969703Z time="2024-05-16T10:12:41Z" level=info msg="ReconcileObject: Done - unchanged Deployment noobaa-endpoint " sys=openshift-storage/noobaa 2024-05-16T10:12:41.339969703Z time="2024-05-16T10:12:41Z" level=info msg="ReconcileSetDbImageAndInitCode:: changing DB image: registry.redhat.io/rhel8/postgresql-12@sha256:16a0cb66818ab8acb68abf40ac075eadd10a94612067769e055222dd412f0a16 and init contatiners script: /init/dumpdb.sh" sys=openshift-storage/noobaa 2024-05-16T10:12:41.343294392Z panic: runtime error: index out of range [0] with length 0 [recovered] 2024-05-16T10:12:41.343294392Z panic: runtime error: index out of range [0] with length 0 2024-05-16T10:12:41.343294392Z 2024-05-16T10:12:41.343294392Z goroutine 3003 [running]: 2024-05-16T10:12:41.343294392Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() 2024-05-16T10:12:41.343321837Z /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:116 +0x1fa 2024-05-16T10:12:41.343321837Z panic({0x25b28c0, 0xc001d03728}) 2024-05-16T10:12:41.343321837Z /usr/lib/golang/src/runtime/panic.go:884 +0x213 2024-05-16T10:12:41.343321837Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileSetDbImageAndInitCode.func1() 2024-05-16T10:12:41.343334234Z /remote-source/app/pkg/system/phase2_creating.go:1519 +0x336 2024-05-16T10:12:41.343334234Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObjectAndGetResult.func1() 2024-05-16T10:12:41.343334234Z /remote-source/app/pkg/system/reconciler.go:639 +0x22 2024-05-16T10:12:41.343357633Z sigs.k8s.io/controller-runtime/pkg/controller/controllerutil.mutate(0xc000c9f900?, {{0xc0014cc168?, 0x0?}, {0xc0015119b0?, 0x2d7bb40?}}, {0x2d9a500, 0xc000c9f900}) 2024-05-16T10:12:41.343378459Z /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/controller/controllerutil/controllerutil.go:340 +0x4f 2024-05-16T10:12:41.343388804Z sigs.k8s.io/controller-runtime/pkg/controller/controllerutil.CreateOrUpdate({0x2d7bb40, 0xc000056058}, {0x2d89e40, 0xc000d0ed80}, {0x2d9a500?, 0xc000c9f900}, 0xc0001e4b10?) 2024-05-16T10:12:41.343410134Z /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/controller/controllerutil/controllerutil.go:212 +0x274 2024-05-16T10:12:41.343421242Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObjectAndGetResult(0xc00180a780, {0x2d9a500, 0xc000c9f900}, 0xc001af3550, 0x0) 2024-05-16T10:12:41.343431520Z /remote-source/app/pkg/system/reconciler.go:636 +0x169 2024-05-16T10:12:41.343442147Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObject(...) 2024-05-16T10:12:41.343442147Z /remote-source/app/pkg/system/reconciler.go:627 2024-05-16T10:12:41.343442147Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileObject(...) 2024-05-16T10:12:41.343442147Z /remote-source/app/pkg/system/reconciler.go:618 2024-05-16T10:12:41.343452838Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileSetDbImageAndInitCode(0xc00180a780, {0xc001394150, 0x6e}, {0x27627dd, 0xf}, 0x1) 2024-05-16T10:12:41.343470896Z /remote-source/app/pkg/system/phase2_creating.go:1515 +0x186 2024-05-16T10:12:41.343481167Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).UpgradePostgresDB(0xc00180a780) 2024-05-16T10:12:41.343481167Z /remote-source/app/pkg/system/phase2_creating.go:1642 +0xe51 2024-05-16T10:12:41.343491532Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhaseCreatingForMainClusters(0xc00180a780) 2024-05-16T10:12:41.343491532Z /remote-source/app/pkg/system/phase2_creating.go:138 +0x465 2024-05-16T10:12:41.343503080Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhaseCreating(0xc00180a780) 2024-05-16T10:12:41.343503080Z /remote-source/app/pkg/system/phase2_creating.go:66 +0x1e5 2024-05-16T10:12:41.343513196Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhases(0x27f95ed?) 2024-05-16T10:12:41.343513196Z /remote-source/app/pkg/system/reconciler.go:541 +0x47 2024-05-16T10:12:41.343523620Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).Reconcile(0xc00180a780) 2024-05-16T10:12:41.343523620Z /remote-source/app/pkg/system/reconciler.go:422 +0x33b 2024-05-16T10:12:41.343534342Z github.com/noobaa/noobaa-operator/v5/pkg/controller/noobaa.Add.func1({0xc001323ce0?, 0x40dd8a?}, {{{0xc0014cc168?, 0x30?}, {0xc001f2a316?, 0x2595ae0?}}}) 2024-05-16T10:12:41.343564901Z /remote-source/app/pkg/controller/noobaa/noobaa_controller.go:53 +0xe5 ~~~ The operator is running on 4.15 image, however postgres is still not on updated image. noobaa-db pod is running on below image while nooaba-operator is trying to update db image to postgresql-12@sha256:16a0cb66818ab8acb68abf40ac075eadd10a94612067769e055222dd412f0a16 ~~~ image: registry.redhat.io/rhel8/postgresql-12@sha256:b96be9d3e8512046bae7d5a3e04fa151043eca051416305629b3ffd547370453 ~~~ MCG csv is in installng state: cat namespaces/openshift-storage/oc_output/csv ~~~ NAME DISPLAY VERSION REPLACES PHASE gitlab-runner-operator.v1.21.0 GitLab Runner 1.21.0 gitlab-runner-operator.v1.18.1 Succeeded mcg-operator.v4.15.2-rhodf NooBaa Operator 4.15.2-rhodf mcg-operator.v4.14.6-rhodf Installing ocs-operator.v4.15.2-rhodf OpenShift Container Storage 4.15.2-rhodf ocs-operator.v4.14.6-rhodf Succeeded odf-csi-addons-operator.v4.15.2-rhodf CSI Addons 4.15.2-rhodf odf-csi-addons-operator.v4.14.6-rhodf Succeeded odf-operator.v4.15.2-rhodf OpenShift Data Foundation 4.15.2-rhodf odf-operator.v4.14.6-rhodf Succeeded ~~~ less namespaces/openshift-storage/operators.coreos.com/clusterserviceversions/mcg-operator.v4.15.2-rhodf.yaml ~~~ - lastTransitionTime: "2024-05-16T10:06:28Z" lastUpdateTime: "2024-05-16T10:06:28Z" message: 'installing: waiting for deployment noobaa-operator to become ready: deployment "noobaa-operator" not available: Deployment does not have minimum availability.' phase: Installing reason: InstallWaiting ~~~ init container is missing in the noobaa-db pod. $omg get pod noobaa-db-pg-0 -o yaml|grep -i initContainers $ From dumpdb.sh, I see it checks if use space of /var/lib/pgsql/data is greater than THRESHOLD (33%) ~~~ cat /init/dumpdb.sh set -e sed -i -e 's/^\(postgres:[^:]\):[0-9]*:[0-9]*:/\1:10001:0:/' /etc/passwd su postgres -c "bash -x /usr/bin/run-postgresql" & THRESHOLD=33 USE=$(df -h --output=pcent "/$HOME/data" | tail -n 1 | tr -d '[:space:]%') # Check if the used space is more than the threshold if [ "$USE" -gt "$THRESHOLD" ]; then echo "Warning: Free space $USE% is above $THRESHOLD% threshold. Can't start upgrade!" exit 1 fi echo "Info: Free space $USE% is below $THRESHOLD% threshold. Starting upgrade!" until pg_isready; do sleep 1; done; pg_dumpall -U postgres > /$HOME/data/dump.sql exit 0 ~~~ The /var/lib/pgsql is only 2% USE, which is less than threshold. ~~~ $ oc rsh noobaa-db-pg-0 sh-4.4$ df -h /var/lib/pgsql/data Filesystem Size Used Avail Use% Mounted on /dev/rbd0 49G 776M 49G 2% /var/lib/pgsql ~~~ Version of all relevant components (if applicable): ODF 4.15.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This is causing the noobaa service to be down. This is also preventing customer to upgrade ODF to 4.15 in other clusters. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes, in customer environment Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: The upgrade from 4.14 to 4.15 worked fine in another cluster. Customer suspect this is a regression bug that only comes into play on "old" ODF-clusters. This cluster was first installed with version 4.4. Steps to Reproduce: 1. Install ODF < 4.14 2. Upgrade ODF from 4.14 to 4.15 Actual results: noobaa-operator panic on upgrade Expected results: ODF upgrade should go smooth Additional info: