1995271 – [GSS] noobaa-db-pg-0 Pod get in stuck CrashLoopBackOff state when enabling hugepages

Bug 1995271 - [GSS] noobaa-db-pg-0 Pod get in stuck CrashLoopBackOff state when enabling hugepages

Summary: [GSS] noobaa-db-pg-0 Pod get in stuck CrashLoopBackOff state when enabling hu...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Alexander Indenbaum
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2001933 2001935 2006036 2011326
TreeView+	depends on / blocked

Reported:	2021-08-18 17:27 UTC by khover
Modified:	2023-12-08 04:25 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.`CrashLoopBackOff` state of `noobaa-db-pg-0` pod when enabling `hugepages` Previously, enabling `hugepages` on OpenShift Container Platform cluster caused the Multicloud Object Gateway (MCG) database pod to go into a `CrashLoopBackOff` state. This was due to wrong initialization of PostgreSQL. With this release, MCG database pod's initialization of PostgreSQL is fixed.
Clone Of:
Clones:	2001933 2001935 2006036 (view as bug list)
Environment:
Last Closed:	2021-12-13 17:45:28 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-operator pull 715	None	Merged	initdb huge_pages	2021-11-02 10:42:35 UTC
Github	noobaa noobaa-operator pull 740	None	open	Backport to 5.9 initdb huge_pages	2021-10-04 07:31:39 UTC
Github	red-hat-storage ocs-ci pull 5472	None	Merged	Applies huge pages and verifies cluster pods are running	2022-03-03 07:47:00 UTC
Red Hat Product Errata	RHSA-2021:5086	None	None	None	2021-12-13 17:46:09 UTC

Comment 11 khover 2021-09-06 15:04:06 UTC

@bkunal

as discussed in mtg, asked cu if they can wait to upgrade 4.9.

If not, backport fix needed for 4.7.5 and 4.8.3

Comment 12 khover 2021-09-13 12:56:13 UTC

@bkunal

Customer stated no, backport fix needed for 4.7.5 and 4.8.3

It can wait until the 4.9 release

Comment 18 Petr Balogh 2021-10-26 14:02:21 UTC

Verification of enabling huge pages after OCP deployment and before ODF deployment here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier1/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-tier1/38/

Here I scheduled job which will do regular deployment and will pause before tier execution:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-tier1/39/

Once it's paused we can enable huge pages and continue the run.

Comment 19 Petr Balogh 2021-10-26 17:31:36 UTC

I enabled huge pages on cluster which had ODF installed.

Enabled by this file:
oc apply -f https://raw.githubusercontent.com/red-hat-storage/ocs-ci/master/ocs_ci/templates/ocp-deployment/huge_pages.yaml

Then waited to nodes to restart:
pbalogh@pbalogh-mac hugepages $ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-5285892ff5c4de19c01780772d80a409   True      False      False      3              3                   3                     0                      176m
worker   rendered-worker-272f125863faaffd629cf8e12356da2e   False     True       False      3              2                   2                     0                      176m
pbalogh@pbalogh-mac hugepages $ oc get node
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-129-165.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   169m   v1.22.0-rc.0+a44d0f0
ip-10-0-147-46.us-east-2.compute.internal    Ready                      master   176m   v1.22.0-rc.0+a44d0f0
ip-10-0-174-6.us-east-2.compute.internal     Ready                      worker   170m   v1.22.0-rc.0+a44d0f0
ip-10-0-179-40.us-east-2.compute.internal    Ready                      master   177m   v1.22.0-rc.0+a44d0f0
ip-10-0-206-103.us-east-2.compute.internal   Ready                      worker   170m   v1.22.0-rc.0+a44d0f0
ip-10-0-219-185.us-east-2.compute.internal   Ready                      master   176m   v1.22.0-rc.0+a44d0f0
pbalogh@pbalogh-mac hugepages $ oc get node
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-129-165.us-east-2.compute.internal   Ready    worker   169m   v1.22.0-rc.0+a44d0f0
ip-10-0-147-46.us-east-2.compute.internal    Ready    master   176m   v1.22.0-rc.0+a44d0f0
ip-10-0-174-6.us-east-2.compute.internal     Ready    worker   170m   v1.22.0-rc.0+a44d0f0
ip-10-0-179-40.us-east-2.compute.internal    Ready    master   177m   v1.22.0-rc.0+a44d0f0
ip-10-0-206-103.us-east-2.compute.internal   Ready    worker   170m   v1.22.0-rc.0+a44d0f0
ip-10-0-219-185.us-east-2.compute.internal   Ready    master   177m   v1.22.0-rc.0+a44d0f0
pbalogh@pbalogh-mac hugepages $ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-5285892ff5c4de19c01780772d80a409   True      False      False      3              3                   3                     0                      176m
worker   rendered-worker-7b55731a49f9190cf8b37fcc7e23ca77   True      False      False      3              3                   3                     0                      176m

Checking status of cluster:

pbalogh@pbalogh-mac hugepages $ oc rsh -n openshift-storage rook-ceph-tools-57b9b69bc5-765r6
sh-4.4$ ceph status
  cluster:
    id:     7e48a5d1-14df-49fa-aee1-408264005a2f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 101s)
    mgr: a(active, since 5m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 99s), 3 in (since 2h)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 522 objects, 1.3 GiB
    usage:   3.0 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     97 active+clean

  io:
    client:   853 B/s rd, 5.7 KiB/s wr, 1 op/s rd, 0 op/s wr

sh-4.4$ ^C
sh-4.4$ exit
command terminated with exit code 130
pbalogh@pbalogh-mac hugepages $ oc get noobaa -n openshift-storage
NAME     MGMT-ENDPOINTS                   S3-ENDPOINTS                     IMAGE                                                                                                 PHASE   AGE
noobaa   ["https://10.0.206.103:32651"]   ["https://10.0.206.103:30801"]   quay.io/rhceph-dev/mcg-core@sha256:ff043dde04a8b83f10be1a2437c88b3cfd0c7e691868ed418b191a02fb8129c8   Ready   142m
pbalogh@pbalogh-mac hugepages $ oc get pod -n openshift-storage -o wide
NAME                                                              READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
csi-cephfsplugin-89zzv                                            3/3     Running   3          152m    10.0.129.165   ip-10-0-129-165.us-east-2.compute.internal   <none>           <none>
csi-cephfsplugin-hpvrh                                            3/3     Running   3          152m    10.0.174.6     ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
csi-cephfsplugin-provisioner-f5485c88c-599c4                      6/6     Running   0          9m20s   10.131.0.9     ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
csi-cephfsplugin-provisioner-f5485c88c-gfjfv                      6/6     Running   0          6m29s   10.128.2.21    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
csi-cephfsplugin-tvx2x                                            3/3     Running   3          152m    10.0.206.103   ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-fk5zw                                               3/3     Running   3          152m    10.0.174.6     ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
csi-rbdplugin-kfdb8                                               3/3     Running   3          152m    10.0.206.103   ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-provisioner-65d9bf8587-2hjrf                        6/6     Running   0          9m19s   10.131.0.14    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-provisioner-65d9bf8587-4g6ck                        6/6     Running   0          6m27s   10.128.2.25    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
csi-rbdplugin-z2xbb                                               3/3     Running   3          152m    10.0.129.165   ip-10-0-129-165.us-east-2.compute.internal   <none>           <none>
noobaa-core-0                                                     1/1     Running   0          8m49s   10.131.0.21    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running   0          9m2s    10.131.0.24    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
noobaa-endpoint-7bb45cc5c8-xgpkn                                  1/1     Running   0          9m20s   10.131.0.13    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
noobaa-operator-5b96d4cf64-fw8g6                                  1/1     Running   0          6m31s   10.128.2.7     ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
ocs-metrics-exporter-c894f4fd5-5q6sv                              1/1     Running   0          6m30s   10.128.2.17    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
ocs-operator-5454cb86b7-9s5sg                                     1/1     Running   0          9m20s   10.131.0.12    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
odf-console-77dc4875d4-82gl6                                      1/1     Running   0          9m21s   10.131.0.8     ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
odf-operator-controller-manager-568f657687-qlttw                  2/2     Running   0          6m28s   10.128.2.23    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
rook-ceph-crashcollector-32e003117c14a6a8adbfda64bd1f34bd-sq2kc   1/1     Running   0          9m28s   10.131.0.6     ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-a20d1188be8da66ce800d2f1ff2d5c6c-2lblx   1/1     Running   0          6m38s   10.128.2.6     ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
rook-ceph-crashcollector-f4797d902b7bc0a6bbc07b7c6fa6f896-k5ptn   1/1     Running   0          2m40s   10.129.2.6     ip-10-0-129-165.us-east-2.compute.internal   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7c8577f5qg5lt   2/2     Running   0          6m29s   10.128.2.19    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7b4fb48fgnrz7   2/2     Running   0          9m20s   10.131.0.15    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
rook-ceph-mgr-a-9bd978474-nrkxw                                   2/2     Running   0          6m30s   10.128.2.16    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
rook-ceph-mon-a-64986d5cfc-z569s                                  2/2     Running   0          11m     10.131.0.19    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-b-6468dc46cf-pb7wh                                  2/2     Running   0          8m53s   10.128.2.26    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
rook-ceph-mon-c-c87988669-xt9z6                                   2/2     Running   0          5m51s   10.129.2.9     ip-10-0-129-165.us-east-2.compute.internal   <none>           <none>
rook-ceph-operator-86bc97678-46vcd                                1/1     Running   0          6m28s   10.128.2.24    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-0-d57ccfbcb-grk8m                                   2/2     Running   0          3m56s   10.129.2.8     ip-10-0-129-165.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-84768d9c6f-p89tg                                  2/2     Running   0          11m     10.131.0.18    ip-10-0-206-103.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-2-5c4c88b9bd-cr2lv                                  2/2     Running   0          7m54s   10.128.2.27    ip-10-0-174-6.us-east-2.compute.internal     <none>           <none>
rook-ceph-tools-57b9b69bc5-765r6                                  1/1     Running   0          6m31s   10.0.174.6     ip-10-0-174-6.us-east-2.compute.internal     <none>           <none> 

$ oc get csv -n openshift-storage
NAME                     DISPLAY                       VERSION   REPLACES   PHASE
noobaa-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
ocs-operator.v4.9.0      OpenShift Container Storage   4.9.0                Succeeded
odf-operator.v4.9.0      OpenShift Data Foundation     4.9.0                Succeeded

So one of the scenario so far so good. 
Let's see how tier1 results will look like here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-tier1/39/

Comment 20 Petr Balogh 2021-10-27 14:00:16 UTC

The first one job failed in deployment, so it got re triggered here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-tier1/40/

The second scenario described above finished also tier1 and results looked OK beside the known issue with GCP MCG tests.

Comment 21 Petr Balogh 2021-11-02 13:53:20 UTC

Previous job also failed, the finally finished here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-tier1/41/console

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/2175/console

So marking as verified.

Comment 25 errata-xmlrpc 2021-12-13 17:45:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Comment 26 Red Hat Bugzilla 2023-12-08 04:25:58 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.