Bug 1946792

Summary:	noobaa-db-pg-0 Pod get in stuck CrashLoopBackOff state when enabling hugepages
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Denis Ollier <dollierp>
Component:	Multi-Cloud Object Gateway	Assignee:	Nobody <nobody>
Status:	VERIFIED ---	QA Contact:	Petr Balogh <pbalogh>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	achernet, aindenba, ddemoiti, ebenahar, edonnell, fbalak, fdeutsch, jsco, khover, kjosy, muagarwa, nberry, pbalogh, tdesala
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	v4.8.0-424.ci	Doc Type:	Known Issue
Doc Text:	Previously, Multicloud Object Gateway (MCG) db pod crashed as the Postgres failed to run on kubernetes when hugepages were enabled. With the current update, the hugepages for the MCG Postgres pods are disabled, and hence the MCG db pods do not crash.	Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1938134, 1968438

Description Denis Ollier 2021-04-06 20:56:45 UTC

Disclaimer
----------

Since I don't understand how nooba-db-pd-0 Pod failure and enabling hugepages can be related, I hesitated to open this issue.

However, I spent a long time enabling and disabling hugepages multiple times and I observed noobaa-db-pg-0 Pod crashing each time I enabled hugepages and working back each time I disabled them.

Description of problem
----------------------

When enabling hugepages on my OCP cluster, noobaa-db-pg-0 pod get stuck in CrashLoopBackOff state.


Version of all relevant components
----------------------------------

OCP: 4.8 nightly (also reproduced with 4.7)
OCS: v4.8.0-303.ci (also reproduced with 4.7)


Does this issue impact your ability to continue to work with the product
------------------------------------------------------------------------

Yes, for instance, OpenShift Virtualization workloads can't be used with both OCS and hugepages enabled.


Is there any workaround available to the best of your knowledge
---------------------------------------------------------------

No, the only "solution" I found is to disable hugepages.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug
-----------------------------------------------------------------------------

1 - Reproduced easily by creating a single Tuned Resource.


Can this issue reproducible
---------------------------

Yes.


Can this issue reproduce from the UI
------------------------------------

Yes since it's not related to the UI.


Steps to Reproduce
------------------

1. Enable hugepages on an OCP cluster with the following Tuned Resource:

> kind: Tuned
> apiVersion: tuned.openshift.io/v1
> metadata:
>   name: hugepages
>   namespace: openshift-cluster-node-tuning-operator
> spec:
>   profile:
>     - name: openshift-node-hugepages
>       data: |
>         [main]
>         summary=Boot time configuration for hugepages
>         include=openshift-node
>         [bootloader]
>         # Allocate 32x2Mi pages and 1x1Gi pages
>         cmdline_openshift_node_hugepages=default_hugepagesz=2M hugepages=32 hugepagesz=1G hugepages=1
>   recommend:
>     - machineConfigLabels:
>         machineconfiguration.openshift.io/role: "worker"
>       priority: 25
>       profile: openshift-node-hugepages

Note: the issue also occurs when enabling hugepages via the Performance Addon Operator (https://docs.openshift.com/container-platform/latest/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-allocating-multiple-huge-page-sizes_cnf-master).

2. Wait for the MachineConfigOperator to reboot each nodes to apply the changes.

3. If OCS was not deployed on the OCP cluster already, deploy it.

Note: the issue also occurs if OCS is already deployed when enabling hugepages.

4. Look at noobaa-db-pg-0 Pod status.


Actual results
--------------

noobaa-db-pg-0 Pod get in stuck CrashLoopBackOff state.


Expected results
----------------

noobaa-db-pg-0 Pod should start properly with hugepages enabled.


Additional info
---------------

When disabling hugepages by removing the Tuned Resources, the noobaa-db-pg-0 Pod get Running after some time.

Logs from the nooba-db-pg-0 Pod do not provide useful information:

> oc -n openshift-storage logs noobaa-db-pg-0 -c init
> 
> uid change has been identified - will change from uid: 0 to new uid: 10001
> setting permissions of /var/lib/pgsql/lost+found for user 10001
> changed permissions of /var/lib/pgsql/lost+found successfully
> setting permissions of /var/lib/pgsql for user 10001
> changed permissions of /var/lib/pgsql successfully
> 
> real  0m0.005s
> user  0m0.003s
> sys   0m0.002s

> oc -n openshift-storage logs noobaa-db-pg-0 -c db
> 
> pg_ctl: another server might be running; trying to start server anyway
> waiting for server to start....2021-04-06 17:00:28.142 UTC [22] LOG:  starting PostgreSQL 12.5 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
> 2021-04-06 17:00:28.143 UTC [22] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
> 2021-04-06 17:00:28.151 UTC [22] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
>  stopped waiting
> pg_ctl: could not start server
> Examine the log output.

Comment 21 Petr Balogh 2021-06-28 10:07:17 UTC

Triggered verification jobs:

4.8 deployment verification
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Deployment/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-deployment/3/console

Upgrade from 4.7.2 internal build to 4.8 internal build scheduled here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/36/

Comment 22 Denis Ollier 2021-07-12 17:03:09 UTC

Hi,

I managed to verify the fix on my envs with OCS-4.8 and OCS-4.9 development versions.

Thanks!

Comment 23 Petr Balogh 2021-07-13 13:41:19 UTC

The deployment of 4.8 verified from the link in my last comment.

The upgrade was blocked by other BZ.

Trying again the upgrade here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-acceptance/1/

Comment 25 Petr Balogh 2021-07-14 18:39:47 UTC

Based on this execution:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-acceptance/1/testReport/
where upgrade passed well and also acceptance suite I am moving this one to verified.

Comment 26 Jimmy Scott 2021-07-28 11:50:31 UTC

Hi all, I installed OCP yesterday which went fine, but when creating a StorageCluster, I traced it back to this issue.

After some debugging I found out that initdb does not use the workaround from /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf.
Instead, initdb copies a template form somewhere and uses that to startup postgres during initdb (strace -f -e opentat,execve).

After disabling hugepages on the hosts, initdb was successful and we managed to create the StorageCluster.
Then we enabled hugepages again, and posgres is now working because after initdb the config changes from noobaa-postgres.conf are in effect.

My error logs were however a bit different (see below).
Using image: registry.redhat.io/rhel8/postgresql-12@sha256:f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf

### oc -n openshift-storage logs noobaa-db-pg-0 #######################################################################

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... sh: line 1:    22 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    30 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    33 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    35 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    37 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
20
selecting default shared_buffers ... sh: line 1:    39 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    41 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    43 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    45 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    47 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    49 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    51 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    53 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    55 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    57 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    59 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    61 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    63 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    65 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    67 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    69 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    71 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    73 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    75 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... child process was terminated by signal 7: Bus error
initdb: removing contents of data directory "/var/lib/pgsql/data/userdata"

Comment 27 Petr Balogh 2021-08-02 12:59:24 UTC

Hello Jimmy,

I am trying another one deployment with hugepages enabled here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Deployment/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-deployment/4/

With the latest 4.8 build 4.8.0-175.ci .

What build have you tried?

Let's see how my deployment goes.

Petr

Comment 28 Jimmy Scott 2021-08-02 15:57:49 UTC

Hi Petr,

Thanks for your quick response!

We're using latest stable-4.7 version:
    Provider
        BareMetal
    OpenShift version
        4.7.21
    Update channel
        stable-4.7

Kind regards,
Jimmy Scott

Comment 29 Petr Balogh 2021-08-02 16:24:21 UTC

Hey Jimmy,

I see that my 4.8 deployment passed well

$ oc get node
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-138-189.us-east-2.compute.internal   Ready    worker   140m   v1.21.1+38b3ecc
ip-10-0-159-209.us-east-2.compute.internal   Ready    master   148m   v1.21.1+38b3ecc
ip-10-0-178-116.us-east-2.compute.internal   Ready    worker   140m   v1.21.1+38b3ecc
ip-10-0-186-127.us-east-2.compute.internal   Ready    master   148m   v1.21.1+38b3ecc
ip-10-0-197-178.us-east-2.compute.internal   Ready    master   148m   v1.21.1+38b3ecc
ip-10-0-223-61.us-east-2.compute.internal    Ready    worker   140m   v1.21.1+38b3ecc
$ oc debug node/ip-10-0-223-61.us-east-2.compute.internal
Creating debug namespace/openshift-debug-node-lsz4q ...
Starting pod/ip-10-0-223-61us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.223.61
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host

sh-4.4# grep -i HugePages /proc/meminfo
AnonHugePages:   1544192 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      32
HugePages_Free:       32
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

I see that OCS running well:
$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.8.0-175.ci   OpenShift Container Storage   4.8.0-175.ci              Succeeded
$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-c8mcj                                            3/3     Running     0          108m
csi-cephfsplugin-lcrn9                                            3/3     Running     0          108m
csi-cephfsplugin-provisioner-5dd599f584-fnrjs                     6/6     Running     0          108m
csi-cephfsplugin-provisioner-5dd599f584-zjwft                     6/6     Running     0          108m
csi-cephfsplugin-w49sg                                            3/3     Running     0          108m
csi-rbdplugin-gq7t2                                               3/3     Running     0          108m
csi-rbdplugin-l7j8m                                               3/3     Running     0          108m
csi-rbdplugin-provisioner-85b4b68989-qdrnd                        6/6     Running     0          108m
csi-rbdplugin-provisioner-85b4b68989-rl9mf                        6/6     Running     0          108m
csi-rbdplugin-wth74                                               3/3     Running     0          108m
noobaa-core-0                                                     1/1     Running     0          105m
noobaa-db-pg-0                                                    1/1     Running     0          105m
noobaa-endpoint-69f747b466-4hdcb                                  1/1     Running     0          103m
noobaa-operator-5949d9576f-fx7bw                                  1/1     Running     0          108m
ocs-metrics-exporter-69896b547b-kt62k                             1/1     Running     0          108m
ocs-operator-59d47555b5-qn4mv                                     1/1     Running     0          108m
rook-ceph-crashcollector-ip-10-0-138-189-5d7d75bf97-2bpnh         1/1     Running     0          106m
rook-ceph-crashcollector-ip-10-0-178-116-55bdf67865-jg2gt         1/1     Running     0          106m
rook-ceph-crashcollector-ip-10-0-223-61-75b85947f5-lflcp          1/1     Running     0          105m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-d66b8f7cbrfpr   2/2     Running     0          105m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5f68d596l9mlx   2/2     Running     0          105m
rook-ceph-mgr-a-745c8c4b48-wxwvt                                  2/2     Running     0          106m
rook-ceph-mon-a-544d55cc56-ctct9                                  2/2     Running     0          107m
rook-ceph-mon-b-854f6db444-8p5t8                                  2/2     Running     0          107m
rook-ceph-mon-c-5df87545cb-49z2g                                  2/2     Running     0          106m
rook-ceph-operator-7d7cf8b6b4-p2qlk                               1/1     Running     0          108m
rook-ceph-osd-0-cf77f9677-r4rvh                                   2/2     Running     0          105m
rook-ceph-osd-1-6dd4d6df95-k5m2s                                  2/2     Running     0          105m
rook-ceph-osd-2-6ffdbfd78-sgv8g                                   2/2     Running     0          105m
rook-ceph-osd-prepare-ocs-deviceset-0-data-04zc85-8r97j           0/1     Completed   0          106m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0q2w4t-pq285           0/1     Completed   0          106m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0cl5s6-t4gvl           0/1     Completed   0          106m
rook-ceph-tools-bd9b4677b-k9kw7                                   1/1     Running     0          103m


And now I see you updated version 4.7 but this bug was with 4.8. So can you please rather update the second bug for 4.7?

Can you please check on the OCS CSV if you are using 4.7.2?

https://bugzilla.redhat.com/show_bug.cgi?id=1968438

This is the 4.7 bug.

Comment 30 Petr Balogh 2021-08-02 16:26:41 UTC

BTW for 4.7 GAed version I am trying here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-deployment/5/

I am testing this on AWS not on BM as you.

Petr

Comment 31 khover 2021-08-02 18:45:17 UTC

VZw using hugepages hitting this bug case # 02998338

[corona.mec.ouroath.com ~]$ oc logs noobaa-db-pg-0
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... sh: line 1:    22 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    24 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    26 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    28 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    30 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
20
selecting default shared_buffers ... sh: line 1:    32 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    34 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    36 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    38 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    40 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    42 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    44 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    46 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    48 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    50 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    52 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    54 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    56 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    58 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    60 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    62 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    64 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    66 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    68 Bus error               (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... child process was terminated by signal 7: Bus error
initdb: removing contents of data directory "/var/lib/pgsql/data/userdata"

Comment 32 Petr Balogh 2021-08-02 19:34:33 UTC

4.7.2 deployment passed for me as well with hugepages enabled

Comment 33 khover 2021-08-02 20:45:10 UTC

(In reply to khover from comment #31)
> VZw using hugepages hitting this bug case # 02998338
> 
> [corona.mec.ouroath.com ~]$ oc logs noobaa-db-pg-0
> The files belonging to this database system will be owned by user "postgres".
> This user must also own the server process.
> 
> The database cluster will be initialized with locale "en_US.utf8".
> The default database encoding has accordingly been set to "UTF8".
> The default text search configuration will be set to "english".
> 
> Data page checksums are disabled.
> 
> fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
> creating subdirectories ... ok
> selecting dynamic shared memory implementation ... posix
> selecting default max_connections ... sh: line 1:    22 Bus error           
> (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=100 -c
> shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" >
> "/dev/null" 2>&1
> sh: line 1:    24 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    26 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    28 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    30 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> 20
> selecting default shared_buffers ... sh: line 1:    32 Bus error            
> (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c
> shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" >
> "/dev/null" 2>&1
> sh: line 1:    34 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    36 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    38 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    40 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    42 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    44 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    46 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    48 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    50 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    52 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    54 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    56 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    58 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    60 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    62 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    64 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    66 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> sh: line 1:    68 Bus error               (core dumped) "/usr/bin/postgres"
> --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c
> dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
> 400kB
> selecting default time zone ... Etc/UTC
> creating configuration files ... ok
> running bootstrap script ... child process was terminated by signal 7: Bus
> error
> initdb: removing contents of data directory "/var/lib/pgsql/data/userdata"

OCS Version is : 4.7.2

Comment 34 Alexander Indenbaum 2021-08-03 17:09:09 UTC

(In reply to khover from comment #33)
> (In reply to khover from comment #31)
> > VZw using hugepages hitting this bug case # 02998338
> OCS Version is : 4.7.2

Hello @khover,

Looking at noobaa-operator-6874fd5d96-pcxcd-noobaa-operator.log part of noobaa_diagnostics_1627661740.tar.gz, mustgather-ocs.tar.gz attached at case/02998338

> time="2021-07-28T04:03:38Z" level=info msg="CLI version: 5.7.0\n"
> time="2021-07-28T04:03:38Z" level=info msg="noobaa-image: noobaa/noobaa-core:5.7.0\n"
> time="2021-07-28T04:03:38Z" level=info msg="operator-image: noobaa/noobaa-operator:5.7.0\n"

Versionn 5.7.0 is expected to fail. Please try NooNaa >= 5.7.2

Comment 35 khover 2021-08-03 19:45:21 UTC

(In reply to Alexander Indenbaum from comment #34)
> (In reply to khover from comment #33)
> > (In reply to khover from comment #31)
> > > VZw using hugepages hitting this bug case # 02998338
> > OCS Version is : 4.7.2
> 
> Hello @khover,
> 
> Looking at noobaa-operator-6874fd5d96-pcxcd-noobaa-operator.log part of
> noobaa_diagnostics_1627661740.tar.gz, mustgather-ocs.tar.gz attached at
> case/02998338
> 
> > time="2021-07-28T04:03:38Z" level=info msg="CLI version: 5.7.0\n"
> > time="2021-07-28T04:03:38Z" level=info msg="noobaa-image: noobaa/noobaa-core:5.7.0\n"
> > time="2021-07-28T04:03:38Z" level=info msg="operator-image: noobaa/noobaa-operator:5.7.0\n"
> 
> Versionn 5.7.0 is expected to fail. Please try NooNaa >= 5.7.2

Hello @aindenba

How do I get the Version of noobaa from 5.7.0 to > 5.7.2 in the current version of OCS ?

I dont think noobaa operator was manually installed.

Comment 36 Petr Balogh 2021-08-04 07:51:53 UTC

In my 4.7.2 I have the same:
time="2021-08-02T17:42:59Z" level=info msg="CLI version: 5.7.0\n"
time="2021-08-02T17:42:59Z" level=info msg="noobaa-image: noobaa/noobaa-core:5.7.0\n"
time="2021-08-02T17:42:59Z" level=info msg="operator-image: noobaa/noobaa-operator:5.7.0\n"

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j005ai3c33-d/j005ai3c33-d_20210802T162953/logs/deployment_1627922635/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-1949179411885858ec719ab052868c734b98b49787498a8297f1a4ace0283eae/noobaa/logs/openshift-storage/noobaa-operator-66d84fc498-qkvsw.log

This was fixed in 4.7.2 as well as part of this BZ.


What about opening new bug for the issue you see?

Comment 37 Alexander Indenbaum 2021-08-04 12:16:51 UTC

The version string might be just a print, common for 5.7.* branches.
The actual noobaa-operator image is: registry.redhat.io/ocs4/mcg-rhel8-operator@sha256:6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee

@khover,

Could you please share the output of the following commands:

> oc get -n openshift-storage pod noobaa-db-pg-0 -o yaml
> oc get -n openshift-storage cm noobaa-postgres-config -o yaml

Is it possible to get access to the cluster running the setup, so I could take a closer look?

Thank you!

Comment 38 khover 2021-08-04 12:59:10 UTC

@aindenba


omg get pod -o yaml noobaa-operator-6874fd5d96-pcxcd

   - name: NOOBAA_CORE_IMAGE
      value: registry.redhat.io/ocs4/mcg-core-rhel8@sha256:6ff8645efdde95fa97d496084d3555b7680895f0b79c147f2a880b43742af3a4
    - name: NOOBAA_DB_IMAGE
      value: registry.redhat.io/rhel8/postgresql-12@sha256:f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf
    - name: OPERATOR_CONDITION_NAME
      value: ocs-operator.v4.7.2
    image: registry.redhat.io/ocs4/mcg-rhel8-operator@sha256:6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee
    imagePullPolicy: IfNotPresent
    name: noobaa-operator


omg get cm -o yaml noobaa-postgres-config

apiVersion: v1
data:
  noobaa-postgres.conf: '# disable huge_pages trial

    # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792

    huge_pages = off

    '
kind: ConfigMap
metadata:
  creationTimestamp: '2021-07-28T04:16:11Z'
  labels:
    app: noobaa
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:noobaa-postgres.conf: {}
      f:metadata:
        f:labels:
          .: {}
          f:app: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"f77c19b7-dbb4-4cb7-920b-681888196944"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
    manager: noobaa-operator
    operation: Update
    time: '2021-07-28T04:16:11Z'
  name: noobaa-postgres-config
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: noobaa.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: NooBaa
    name: noobaa
    uid: f77c19b7-dbb4-4cb7-920b-681888196944
  resourceVersion: '97630'
  uid: 66883910-0430-475b-9955-11c48d9f3fdd


I will check on customer availability for remote session.

Comment 39 Alexander Indenbaum 2021-08-04 15:16:43 UTC

(In reply to khover from comment #38)
> @aindenba
> 
> 
> omg get pod -o yaml noobaa-operator-6874fd5d96-pcxcd
> 
>    - name: NOOBAA_CORE_IMAGE
>       value:
> registry.redhat.io/ocs4/mcg-core-rhel8@sha256:
> 6ff8645efdde95fa97d496084d3555b7680895f0b79c147f2a880b43742af3a4
>     - name: NOOBAA_DB_IMAGE
>       value:
> registry.redhat.io/rhel8/postgresql-12@sha256:
> f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf
>     - name: OPERATOR_CONDITION_NAME
>       value: ocs-operator.v4.7.2
>     image:
> registry.redhat.io/ocs4/mcg-rhel8-operator@sha256:
> 6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee
>     imagePullPolicy: IfNotPresent
>     name: noobaa-operator

@khover, thank you!

It is interesting if the noobaa-postgres-config is mounted by the Postgres db pod

> ➜ kubectl get pod -n openshift-storage noobaa-db-pg-0 -o yaml
> apiVersion: v1
> kind: Pod
> metadata:
> ...
> spec:
>   containers:
> ...
>     image: centos/postgresql-12-centos7
>     imagePullPolicy: IfNotPresent
>     name: db
> ...
>     volumeMounts:
>     - mountPath: /var/lib/pgsql
>       name: db
>     - mountPath: /opt/app-root/src/postgresql-cfg
>       name: noobaa-postgres-config-volume
>     - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
>       name: noobaa-token-8w5r6
>       readOnly: true
> ...
>   volumes:
>   - name: db
>     persistentVolumeClaim:
>       claimName: db-noobaa-db-pg-0
>   - configMap:
>       defaultMode: 420
>       name: noobaa-postgres-config
>     name: noobaa-postgres-config-volume
>   - name: noobaa-token-8w5r6
>     secret:
>       defaultMode: 420
>       secretName: noobaa-token-8w5r6
> status:
> ...

Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the Postgres DB pod filesystem?

> ➜  kubectl exec -ti -n openshift-storage noobaa-db-pg-0 -c db -- bash
> bash-4.2$ cat /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf
> # disable huge_pages trial
> # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792
> huge_pages = off
> 
> # postgres tuning
> max_connections = 300
> ...

Best regards,
~baum

Comment 40 khover 2021-08-04 16:08:58 UTC

(In reply to Alexander Indenbaum from comment #39)
> (In reply to khover from comment #38)
> > @aindenba
> > 
> > 
> > omg get pod -o yaml noobaa-operator-6874fd5d96-pcxcd
> > 
> >    - name: NOOBAA_CORE_IMAGE
> >       value:
> > registry.redhat.io/ocs4/mcg-core-rhel8@sha256:
> > 6ff8645efdde95fa97d496084d3555b7680895f0b79c147f2a880b43742af3a4
> >     - name: NOOBAA_DB_IMAGE
> >       value:
> > registry.redhat.io/rhel8/postgresql-12@sha256:
> > f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf
> >     - name: OPERATOR_CONDITION_NAME
> >       value: ocs-operator.v4.7.2
> >     image:
> > registry.redhat.io/ocs4/mcg-rhel8-operator@sha256:
> > 6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee
> >     imagePullPolicy: IfNotPresent
> >     name: noobaa-operator
> 
> @khover, thank you!
> 
> It is interesting if the noobaa-postgres-config is mounted by the Postgres
> db pod
> 
> > ➜ kubectl get pod -n openshift-storage noobaa-db-pg-0 -o yaml
> > apiVersion: v1
> > kind: Pod
> > metadata:
> > ...
> > spec:
> >   containers:
> > ...
> >     image: centos/postgresql-12-centos7
> >     imagePullPolicy: IfNotPresent
> >     name: db
> > ...
> >     volumeMounts:
> >     - mountPath: /var/lib/pgsql
> >       name: db
> >     - mountPath: /opt/app-root/src/postgresql-cfg
> >       name: noobaa-postgres-config-volume
> >     - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
> >       name: noobaa-token-8w5r6
> >       readOnly: true
> > ...
> >   volumes:
> >   - name: db
> >     persistentVolumeClaim:
> >       claimName: db-noobaa-db-pg-0
> >   - configMap:
> >       defaultMode: 420
> >       name: noobaa-postgres-config
> >     name: noobaa-postgres-config-volume
> >   - name: noobaa-token-8w5r6
> >     secret:
> >       defaultMode: 420
> >       secretName: noobaa-token-8w5r6
> > status:
> > ...
> 
> Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the
> Postgres DB pod filesystem?
> 
> > ➜  kubectl exec -ti -n openshift-storage noobaa-db-pg-0 -c db -- bash
> > bash-4.2$ cat /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf
> > # disable huge_pages trial
> > # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792
> > huge_pages = off
> > 
> > # postgres tuning
> > max_connections = 300
> > ...
> 
> Best regards,
> ~baum

@aindenba


From customer
To access this BR lab cluster, we need the incoming IP so we can add to the IP ACL allow list. Then we can share the private key to access the cluster's admin node.

Comment 41 Alexander Indenbaum 2021-08-05 11:59:57 UTC

Kevan, thank you! Sending my IP address in the email.

Comment 42 Jimmy Scott 2021-08-05 13:44:55 UTC

Hi @aindenba

> Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the Postgres DB pod filesystem?

For me this was the case, BUT, the core dump happens during 'initdb', and initdb doesn't use that file, but another template instead.

I'm using version 4.7.2 as well.

Comment 43 Alexander Indenbaum 2021-08-08 13:08:27 UTC

(In reply to Jimmy Scott from comment #42)
> Hi @aindenba
> 
> > Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the Postgres DB pod filesystem?
> 
> For me this was the case, BUT, the core dump happens during 'initdb', and
> initdb doesn't use that file, but another template instead.
> 
> I'm using version 4.7.2 as well.

@Jimmy Scott, nice to meet you virtually. Thank you for your input.

From what I can see now this is the failing scenario.

(a) Enable huge pages
(b) Install OCS

In such scenario, the Postgres start up script tries to initialize the DB by running initdb from here:

/usr/share/container-scripts/postgresql/common.sh, line 195


> function initialize_database() {
>   initdb_wrapper initdb
>   ^^^^^^^^^^^^^^^^^^^^^ (*)
>
>   # PostgreSQL configuration.
>   cat >> "$PGDATA/postgresql.conf" <<EOF
>
> # Custom OpenShift configuration:
> include '${POSTGRESQL_CONFIG_FILE}'
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  (**)
> EOF
> ...
> }

(*) executes initdb: src/bin/initdb/initdb.c, test_config_settings() function, line 955

>     /*
>      * Probe for max_connections before shared_buffers, since it is subject to
>      * more constraints than shared_buffers.
>      */
>     printf(_("selecting default max_connections ... "));
>     fflush(stdout);
>
>     for (i = 0; i < connslen; i++)
>     {
>         test_conns = trial_conns[i];
>         test_buffs = MIN_BUFS_FOR_CONNS(test_conns);
>
>         snprintf(cmd, sizeof(cmd),
>                  "\"%s\" --check %s %s "
>                  "-c max_connections=%d "
>                  "-c shared_buffers=%d "
>                  "-c dynamic_shared_memory_type=%s "
>                  "< \"%s\" > \"%s\" 2>&1",
>                  backend_exec, boot_options, extra_options,
>                  test_conns, test_buffs,
>                  dynamic_shared_memory_type,
>                  DEVNULL, DEVNULL);
>         status = system(cmd);
>         ^^^^^^^^^^^^^^^^^^^^^  (***)

executing Postgres at this stage causes BUS error (from pod logs above):

> selecting default shared_buffers ... sh: line 1:    32 Bus error            
> (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c
> shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" >
> "/dev/null" 2>&1
> sh: line 1:    34 Bus error               (core dumped) "/usr/bin/postgres"

The reason for BUS_ERROR is that initialize_database() in common.sh first runs initdb (*) and only then appends the custom Postgres configuration  openshift-custom-postgresql.conf (**) which includes noobaa-postgres.conf disabling Postgres huge pages probe.

So basically Postgres binary runs in such a scenario before huge_pages were disabled by the noobaa-postgres.conf configuration.

So as a workaround first install OCS and then enable huge pages using tuned or disable huge pages temporarily during OCS installation.

Best regards,
~baum

Comment 44 Jimmy Scott 2021-08-12 07:42:03 UTC

Hi @aindenba,

Nice to meet you as well! And thank you very much for your great analysis!

This is indeed exactly what is happening, and disabling huge pages works around the issue, and can afterwards be enabled again without issues indeed.

We are however worried that we will never be able to do that on a production system in case of issues (or maybe even operator upgrades?), since we need the huge pages for SR-IOV.

Thank you very much for your great work so far!

Kind regards,
Jimmy Scott