Bug 2281839 - [GSS]noobaa-opreator panics on upgrading ODF 4.14 to 4.15
Summary: [GSS]noobaa-opreator panics on upgrading ODF 4.14 to 4.15
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.15
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.15.6
Assignee: Nimrod Becker
QA Contact: Uday kurundwade
URL:
Whiteboard:
: 2281604 2293068 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-05-20 15:53 UTC by Sonal
Modified: 2025-01-04 04:25 UTC (History)
12 users (show)

Fixed In Version: 4.15.6-2
Doc Type: Bug Fix
Doc Text:
Previously, as part of the Postgres upgrade, noobaa-operator configured a new init-container to handle the upgrade and when there was no existing init container, the operator crashed. This was because, in OpenShift Data Foundation 4.15, the Postgres upgrade flow assumed that the noobaa-db-pg-0 has an init container. This is a wrong assumption for systems installed in version OpenShift Data Foundation 4.18 and earlier versions. With this fix, during reconciliation of noobaa-db, the case where there is no init container was addressed. As a result, Postgres upgraded starts without crashing.
Clone Of:
Environment:
Last Closed: 2024-09-05 04:53:55 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github noobaa noobaa-operator pull 1367 0 None Merged [Direct To 5.15] add InitCotnainers to DB sts if not exist 2024-07-30 13:38:55 UTC
Github noobaa noobaa-operator pull 1411 0 None Merged [Direct to 5.15] Handle empty initContainers array in postgres upgrade flow 2024-08-15 05:51:40 UTC
Red Hat Product Errata RHBA-2024:6397 0 None None None 2024-09-05 04:53:59 UTC

Description Sonal 2024-05-20 15:53:15 UTC
Description of problem (please be detailed as possible and provide log
snippests):

On upgrading ODF fom 4.14 to 4.15, noobaa-operator panics and going to CLBO

~~~
noobaa-operator-55954db6bf-4lrwv                                 0/1    Running  231       21h
~~~

The operator crashed during upgrading db image and running /init/dumpdb.sh

less namespaces/openshift-storage/pods/noobaa-operator-55954db6bf-4lrwv/noobaa-operator/noobaa-operator/logs/current.log
~~~
2024-05-16T10:12:41.338813083Z time="2024-05-16T10:12:41Z" level=info msg="UpgradePostgresDB: current phase is Preparing" sys=openshift-storage/noobaa
2024-05-16T10:12:41.338813083Z time="2024-05-16T10:12:41Z" level=info msg="SetEndpointsDeploymentReplicas:: setting endpoints replica count to 0" sys=openshift-storage/noobaa
2024-05-16T10:12:41.339969703Z time="2024-05-16T10:12:41Z" level=info msg="ReconcileObject: Done - unchanged Deployment noobaa-endpoint " sys=openshift-storage/noobaa
2024-05-16T10:12:41.339969703Z time="2024-05-16T10:12:41Z" level=info msg="ReconcileSetDbImageAndInitCode:: changing DB image: registry.redhat.io/rhel8/postgresql-12@sha256:16a0cb66818ab8acb68abf40ac075eadd10a94612067769e055222dd412f0a16 and init contatiners script: /init/dumpdb.sh" sys=openshift-storage/noobaa
2024-05-16T10:12:41.343294392Z panic: runtime error: index out of range [0] with length 0 [recovered]
2024-05-16T10:12:41.343294392Z  panic: runtime error: index out of range [0] with length 0
2024-05-16T10:12:41.343294392Z
2024-05-16T10:12:41.343294392Z goroutine 3003 [running]:
2024-05-16T10:12:41.343294392Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
2024-05-16T10:12:41.343321837Z  /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:116 +0x1fa
2024-05-16T10:12:41.343321837Z panic({0x25b28c0, 0xc001d03728})
2024-05-16T10:12:41.343321837Z  /usr/lib/golang/src/runtime/panic.go:884 +0x213
2024-05-16T10:12:41.343321837Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileSetDbImageAndInitCode.func1()
2024-05-16T10:12:41.343334234Z  /remote-source/app/pkg/system/phase2_creating.go:1519 +0x336
2024-05-16T10:12:41.343334234Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObjectAndGetResult.func1()
2024-05-16T10:12:41.343334234Z  /remote-source/app/pkg/system/reconciler.go:639 +0x22
2024-05-16T10:12:41.343357633Z sigs.k8s.io/controller-runtime/pkg/controller/controllerutil.mutate(0xc000c9f900?, {{0xc0014cc168?, 0x0?}, {0xc0015119b0?, 0x2d7bb40?}}, {0x2d9a500, 0xc000c9f900})
2024-05-16T10:12:41.343378459Z  /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/controller/controllerutil/controllerutil.go:340 +0x4f
2024-05-16T10:12:41.343388804Z sigs.k8s.io/controller-runtime/pkg/controller/controllerutil.CreateOrUpdate({0x2d7bb40, 0xc000056058}, {0x2d89e40, 0xc000d0ed80}, {0x2d9a500?, 0xc000c9f900}, 0xc0001e4b10?)
2024-05-16T10:12:41.343410134Z  /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/controller/controllerutil/controllerutil.go:212 +0x274
2024-05-16T10:12:41.343421242Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObjectAndGetResult(0xc00180a780, {0x2d9a500, 0xc000c9f900}, 0xc001af3550, 0x0)
2024-05-16T10:12:41.343431520Z  /remote-source/app/pkg/system/reconciler.go:636 +0x169
2024-05-16T10:12:41.343442147Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).reconcileObject(...)
2024-05-16T10:12:41.343442147Z  /remote-source/app/pkg/system/reconciler.go:627
2024-05-16T10:12:41.343442147Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileObject(...)
2024-05-16T10:12:41.343442147Z  /remote-source/app/pkg/system/reconciler.go:618
2024-05-16T10:12:41.343452838Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcileSetDbImageAndInitCode(0xc00180a780, {0xc001394150, 0x6e}, {0x27627dd, 0xf}, 0x1)
2024-05-16T10:12:41.343470896Z  /remote-source/app/pkg/system/phase2_creating.go:1515 +0x186
2024-05-16T10:12:41.343481167Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).UpgradePostgresDB(0xc00180a780)
2024-05-16T10:12:41.343481167Z  /remote-source/app/pkg/system/phase2_creating.go:1642 +0xe51
2024-05-16T10:12:41.343491532Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhaseCreatingForMainClusters(0xc00180a780)
2024-05-16T10:12:41.343491532Z  /remote-source/app/pkg/system/phase2_creating.go:138 +0x465
2024-05-16T10:12:41.343503080Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhaseCreating(0xc00180a780)
2024-05-16T10:12:41.343503080Z  /remote-source/app/pkg/system/phase2_creating.go:66 +0x1e5
2024-05-16T10:12:41.343513196Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).ReconcilePhases(0x27f95ed?)
2024-05-16T10:12:41.343513196Z  /remote-source/app/pkg/system/reconciler.go:541 +0x47
2024-05-16T10:12:41.343523620Z github.com/noobaa/noobaa-operator/v5/pkg/system.(*Reconciler).Reconcile(0xc00180a780)
2024-05-16T10:12:41.343523620Z  /remote-source/app/pkg/system/reconciler.go:422 +0x33b
2024-05-16T10:12:41.343534342Z github.com/noobaa/noobaa-operator/v5/pkg/controller/noobaa.Add.func1({0xc001323ce0?, 0x40dd8a?}, {{{0xc0014cc168?, 0x30?}, {0xc001f2a316?, 0x2595ae0?}}})
2024-05-16T10:12:41.343564901Z  /remote-source/app/pkg/controller/noobaa/noobaa_controller.go:53 +0xe5
~~~


The operator is running on 4.15 image, however postgres is still not on updated image. noobaa-db pod is running on below image while nooaba-operator is trying to update db image to postgresql-12@sha256:16a0cb66818ab8acb68abf40ac075eadd10a94612067769e055222dd412f0a16

~~~
    image: registry.redhat.io/rhel8/postgresql-12@sha256:b96be9d3e8512046bae7d5a3e04fa151043eca051416305629b3ffd547370453
~~~

MCG csv is in installng state:

cat namespaces/openshift-storage/oc_output/csv
 ~~~
NAME                                    DISPLAY                       VERSION        REPLACES                                PHASE
gitlab-runner-operator.v1.21.0          GitLab Runner                 1.21.0         gitlab-runner-operator.v1.18.1          Succeeded
mcg-operator.v4.15.2-rhodf              NooBaa Operator               4.15.2-rhodf   mcg-operator.v4.14.6-rhodf              Installing
ocs-operator.v4.15.2-rhodf              OpenShift Container Storage   4.15.2-rhodf   ocs-operator.v4.14.6-rhodf              Succeeded
odf-csi-addons-operator.v4.15.2-rhodf   CSI Addons                    4.15.2-rhodf   odf-csi-addons-operator.v4.14.6-rhodf   Succeeded
odf-operator.v4.15.2-rhodf              OpenShift Data Foundation     4.15.2-rhodf   odf-operator.v4.14.6-rhodf              Succeeded
~~~

less namespaces/openshift-storage/operators.coreos.com/clusterserviceversions/mcg-operator.v4.15.2-rhodf.yaml
~~~
  - lastTransitionTime: "2024-05-16T10:06:28Z"
    lastUpdateTime: "2024-05-16T10:06:28Z"
    message: 'installing: waiting for deployment noobaa-operator to become ready:
      deployment "noobaa-operator" not available: Deployment does not have minimum
      availability.'
    phase: Installing
    reason: InstallWaiting
~~~

init container is missing in the noobaa-db pod.

$omg get pod noobaa-db-pg-0 -o yaml|grep -i initContainers
$

From dumpdb.sh, I see it checks if use space of /var/lib/pgsql/data is greater than THRESHOLD (33%)

~~~
cat /init/dumpdb.sh 
set -e
sed -i -e 's/^\(postgres:[^:]\):[0-9]*:[0-9]*:/\1:10001:0:/' /etc/passwd
su postgres -c "bash -x /usr/bin/run-postgresql" &
THRESHOLD=33
USE=$(df -h --output=pcent "/$HOME/data" | tail -n 1 | tr -d '[:space:]%')
# Check if the used space is more than the threshold
if [ "$USE" -gt "$THRESHOLD" ]; then
  echo "Warning: Free space $USE% is above $THRESHOLD% threshold. Can't start upgrade!"
  exit 1
fi
echo "Info: Free space $USE% is below $THRESHOLD% threshold. Starting upgrade!"
until pg_isready; do sleep 1; done;
  pg_dumpall -U postgres > /$HOME/data/dump.sql
exit 0
~~~


The /var/lib/pgsql is only 2% USE, which is less than threshold.

~~~
$ oc rsh noobaa-db-pg-0
sh-4.4$ df -h /var/lib/pgsql/data
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0        49G  776M   49G   2% /var/lib/pgsql
~~~

Version of all relevant components (if applicable):
ODF 4.15.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is causing the noobaa service to be down.
This is also preventing customer to upgrade ODF to 4.15 in other clusters.  


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes, in customer environment

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
The upgrade from 4.14 to 4.15 worked fine in another cluster. Customer suspect this is a regression bug that only comes into play on "old" ODF-clusters. This cluster was first installed with version 4.4.


Steps to Reproduce:
1. Install ODF < 4.14
2. Upgrade ODF from 4.14 to 4.15


Actual results:
noobaa-operator panic on upgrade

Expected results:
ODF upgrade should go smooth

Additional info:

Comment 9 kelwhite 2024-05-24 16:31:22 UTC
Customer ran the steps in c#7 and resolved their issue:

~~~~~~
We just added the initContainer to the noobaa-db-pg StatefulSet and that made the noobaa-operator happy again. 
The upgrade of noobaa then started again and finished successfully.
~~~~~~

Comment 11 Liran Mauda 2024-05-27 13:02:50 UTC
*** Bug 2281604 has been marked as a duplicate of this bug. ***

Comment 15 Sonal 2024-06-13 14:27:10 UTC
Hi Nimrod,

The customer is waiting for the fix. Can you please prioritize this bug? 

Regards,
Sonal Arora

Comment 29 Sunil Kumar Acharya 2024-07-25 12:36:27 UTC
Please backport the fix to ODF-4.15 and update the RDT flag/text appropriately.

Comment 46 errata-xmlrpc 2024-09-05 04:53:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.15.6 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:6397

Comment 47 Red Hat Bugzilla 2025-01-04 04:25:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.