Bug 1946792
Summary: | noobaa-db-pg-0 Pod get in stuck CrashLoopBackOff state when enabling hugepages | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Denis Ollier <dollierp> |
Component: | Multi-Cloud Object Gateway | Assignee: | Nobody <nobody> |
Status: | VERIFIED --- | QA Contact: | Petr Balogh <pbalogh> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.8 | CC: | achernet, aindenba, belimele, ddemoiti, ebenahar, edonnell, fbalak, fdeutsch, jsco, khover, kjosy, muagarwa, nberry, pbalogh, tdesala |
Target Milestone: | --- | Keywords: | AutomationBackLog |
Target Release: | OCS 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | v4.8.0-424.ci | Doc Type: | Known Issue |
Doc Text: |
Previously, Multicloud Object Gateway (MCG) db pod crashed as the Postgres failed to run on kubernetes when hugepages were enabled. With the current update, the hugepages for the MCG Postgres pods are disabled, and hence the MCG db pods do not crash.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1938134, 1968438 |
Description
Denis Ollier
2021-04-06 20:56:45 UTC
Triggered verification jobs: 4.8 deployment verification https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Deployment/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-deployment/3/console Upgrade from 4.7.2 internal build to 4.8 internal build scheduled here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/36/ Hi, I managed to verify the fix on my envs with OCS-4.8 and OCS-4.9 development versions. Thanks! The deployment of 4.8 verified from the link in my last comment. The upgrade was blocked by other BZ. Trying again the upgrade here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-acceptance/1/ Based on this execution: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-acceptance/1/testReport/ where upgrade passed well and also acceptance suite I am moving this one to verified. Hi all, I installed OCP yesterday which went fine, but when creating a StorageCluster, I traced it back to this issue. After some debugging I found out that initdb does not use the workaround from /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf. Instead, initdb copies a template form somewhere and uses that to startup postgres during initdb (strace -f -e opentat,execve). After disabling hugepages on the hosts, initdb was successful and we managed to create the StorageCluster. Then we enabled hugepages again, and posgres is now working because after initdb the config changes from noobaa-postgres.conf are in effect. My error logs were however a bit different (see below). Using image: registry.redhat.io/rhel8/postgresql-12@sha256:f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf ### oc -n openshift-storage logs noobaa-db-pg-0 ####################################################################### The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_US.utf8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... sh: line 1: 22 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 30 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 33 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 35 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 37 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 20 selecting default shared_buffers ... sh: line 1: 39 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 41 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 43 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 45 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 47 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 49 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 51 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 53 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 55 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 57 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 59 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 61 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 63 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 65 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 67 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 69 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 71 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 73 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 75 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 400kB selecting default time zone ... Etc/UTC creating configuration files ... ok running bootstrap script ... child process was terminated by signal 7: Bus error initdb: removing contents of data directory "/var/lib/pgsql/data/userdata" Hello Jimmy, I am trying another one deployment with hugepages enabled here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Deployment/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-deployment/4/ With the latest 4.8 build 4.8.0-175.ci . What build have you tried? Let's see how my deployment goes. Petr Hi Petr, Thanks for your quick response! We're using latest stable-4.7 version: Provider BareMetal OpenShift version 4.7.21 Update channel stable-4.7 Kind regards, Jimmy Scott Hey Jimmy, I see that my 4.8 deployment passed well $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-138-189.us-east-2.compute.internal Ready worker 140m v1.21.1+38b3ecc ip-10-0-159-209.us-east-2.compute.internal Ready master 148m v1.21.1+38b3ecc ip-10-0-178-116.us-east-2.compute.internal Ready worker 140m v1.21.1+38b3ecc ip-10-0-186-127.us-east-2.compute.internal Ready master 148m v1.21.1+38b3ecc ip-10-0-197-178.us-east-2.compute.internal Ready master 148m v1.21.1+38b3ecc ip-10-0-223-61.us-east-2.compute.internal Ready worker 140m v1.21.1+38b3ecc $ oc debug node/ip-10-0-223-61.us-east-2.compute.internal Creating debug namespace/openshift-debug-node-lsz4q ... Starting pod/ip-10-0-223-61us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.223.61 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# grep -i HugePages /proc/meminfo AnonHugePages: 1544192 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 32 HugePages_Free: 32 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB I see that OCS running well: $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.8.0-175.ci OpenShift Container Storage 4.8.0-175.ci Succeeded $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-c8mcj 3/3 Running 0 108m csi-cephfsplugin-lcrn9 3/3 Running 0 108m csi-cephfsplugin-provisioner-5dd599f584-fnrjs 6/6 Running 0 108m csi-cephfsplugin-provisioner-5dd599f584-zjwft 6/6 Running 0 108m csi-cephfsplugin-w49sg 3/3 Running 0 108m csi-rbdplugin-gq7t2 3/3 Running 0 108m csi-rbdplugin-l7j8m 3/3 Running 0 108m csi-rbdplugin-provisioner-85b4b68989-qdrnd 6/6 Running 0 108m csi-rbdplugin-provisioner-85b4b68989-rl9mf 6/6 Running 0 108m csi-rbdplugin-wth74 3/3 Running 0 108m noobaa-core-0 1/1 Running 0 105m noobaa-db-pg-0 1/1 Running 0 105m noobaa-endpoint-69f747b466-4hdcb 1/1 Running 0 103m noobaa-operator-5949d9576f-fx7bw 1/1 Running 0 108m ocs-metrics-exporter-69896b547b-kt62k 1/1 Running 0 108m ocs-operator-59d47555b5-qn4mv 1/1 Running 0 108m rook-ceph-crashcollector-ip-10-0-138-189-5d7d75bf97-2bpnh 1/1 Running 0 106m rook-ceph-crashcollector-ip-10-0-178-116-55bdf67865-jg2gt 1/1 Running 0 106m rook-ceph-crashcollector-ip-10-0-223-61-75b85947f5-lflcp 1/1 Running 0 105m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-d66b8f7cbrfpr 2/2 Running 0 105m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5f68d596l9mlx 2/2 Running 0 105m rook-ceph-mgr-a-745c8c4b48-wxwvt 2/2 Running 0 106m rook-ceph-mon-a-544d55cc56-ctct9 2/2 Running 0 107m rook-ceph-mon-b-854f6db444-8p5t8 2/2 Running 0 107m rook-ceph-mon-c-5df87545cb-49z2g 2/2 Running 0 106m rook-ceph-operator-7d7cf8b6b4-p2qlk 1/1 Running 0 108m rook-ceph-osd-0-cf77f9677-r4rvh 2/2 Running 0 105m rook-ceph-osd-1-6dd4d6df95-k5m2s 2/2 Running 0 105m rook-ceph-osd-2-6ffdbfd78-sgv8g 2/2 Running 0 105m rook-ceph-osd-prepare-ocs-deviceset-0-data-04zc85-8r97j 0/1 Completed 0 106m rook-ceph-osd-prepare-ocs-deviceset-1-data-0q2w4t-pq285 0/1 Completed 0 106m rook-ceph-osd-prepare-ocs-deviceset-2-data-0cl5s6-t4gvl 0/1 Completed 0 106m rook-ceph-tools-bd9b4677b-k9kw7 1/1 Running 0 103m And now I see you updated version 4.7 but this bug was with 4.8. So can you please rather update the second bug for 4.7? Can you please check on the OCS CSV if you are using 4.7.2? https://bugzilla.redhat.com/show_bug.cgi?id=1968438 This is the 4.7 bug. BTW for 4.7 GAed version I am trying here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-deployment/5/ I am testing this on AWS not on BM as you. Petr VZw using hugepages hitting this bug case # 02998338 [corona.mec.ouroath.com ~]$ oc logs noobaa-db-pg-0 The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_US.utf8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... sh: line 1: 22 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 24 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 26 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 28 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 30 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 20 selecting default shared_buffers ... sh: line 1: 32 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 34 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 36 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 38 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 40 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 42 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 44 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 46 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 48 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 50 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 52 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 54 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 56 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 58 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 60 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 62 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 64 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 66 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 sh: line 1: 68 Bus error (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 400kB selecting default time zone ... Etc/UTC creating configuration files ... ok running bootstrap script ... child process was terminated by signal 7: Bus error initdb: removing contents of data directory "/var/lib/pgsql/data/userdata" 4.7.2 deployment passed for me as well with hugepages enabled (In reply to khover from comment #31) > VZw using hugepages hitting this bug case # 02998338 > > [corona.mec.ouroath.com ~]$ oc logs noobaa-db-pg-0 > The files belonging to this database system will be owned by user "postgres". > This user must also own the server process. > > The database cluster will be initialized with locale "en_US.utf8". > The default database encoding has accordingly been set to "UTF8". > The default text search configuration will be set to "english". > > Data page checksums are disabled. > > fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok > creating subdirectories ... ok > selecting dynamic shared memory implementation ... posix > selecting default max_connections ... sh: line 1: 22 Bus error > (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=100 -c > shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > > "/dev/null" 2>&1 > sh: line 1: 24 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 26 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 28 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 30 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > 20 > selecting default shared_buffers ... sh: line 1: 32 Bus error > (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c > shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > > "/dev/null" 2>&1 > sh: line 1: 34 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 36 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 38 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 40 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 42 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 44 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 46 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 48 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 50 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 52 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 54 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 56 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 58 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 60 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 62 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 64 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 66 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > sh: line 1: 68 Bus error (core dumped) "/usr/bin/postgres" > --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c > dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1 > 400kB > selecting default time zone ... Etc/UTC > creating configuration files ... ok > running bootstrap script ... child process was terminated by signal 7: Bus > error > initdb: removing contents of data directory "/var/lib/pgsql/data/userdata" OCS Version is : 4.7.2 (In reply to khover from comment #33) > (In reply to khover from comment #31) > > VZw using hugepages hitting this bug case # 02998338 > OCS Version is : 4.7.2 Hello @khover, Looking at noobaa-operator-6874fd5d96-pcxcd-noobaa-operator.log part of noobaa_diagnostics_1627661740.tar.gz, mustgather-ocs.tar.gz attached at case/02998338 > time="2021-07-28T04:03:38Z" level=info msg="CLI version: 5.7.0\n" > time="2021-07-28T04:03:38Z" level=info msg="noobaa-image: noobaa/noobaa-core:5.7.0\n" > time="2021-07-28T04:03:38Z" level=info msg="operator-image: noobaa/noobaa-operator:5.7.0\n" Versionn 5.7.0 is expected to fail. Please try NooNaa >= 5.7.2 (In reply to Alexander Indenbaum from comment #34) > (In reply to khover from comment #33) > > (In reply to khover from comment #31) > > > VZw using hugepages hitting this bug case # 02998338 > > OCS Version is : 4.7.2 > > Hello @khover, > > Looking at noobaa-operator-6874fd5d96-pcxcd-noobaa-operator.log part of > noobaa_diagnostics_1627661740.tar.gz, mustgather-ocs.tar.gz attached at > case/02998338 > > > time="2021-07-28T04:03:38Z" level=info msg="CLI version: 5.7.0\n" > > time="2021-07-28T04:03:38Z" level=info msg="noobaa-image: noobaa/noobaa-core:5.7.0\n" > > time="2021-07-28T04:03:38Z" level=info msg="operator-image: noobaa/noobaa-operator:5.7.0\n" > > Versionn 5.7.0 is expected to fail. Please try NooNaa >= 5.7.2 Hello @aindenba How do I get the Version of noobaa from 5.7.0 to > 5.7.2 in the current version of OCS ? I dont think noobaa operator was manually installed. In my 4.7.2 I have the same: time="2021-08-02T17:42:59Z" level=info msg="CLI version: 5.7.0\n" time="2021-08-02T17:42:59Z" level=info msg="noobaa-image: noobaa/noobaa-core:5.7.0\n" time="2021-08-02T17:42:59Z" level=info msg="operator-image: noobaa/noobaa-operator:5.7.0\n" http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j005ai3c33-d/j005ai3c33-d_20210802T162953/logs/deployment_1627922635/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-1949179411885858ec719ab052868c734b98b49787498a8297f1a4ace0283eae/noobaa/logs/openshift-storage/noobaa-operator-66d84fc498-qkvsw.log This was fixed in 4.7.2 as well as part of this BZ. What about opening new bug for the issue you see? The version string might be just a print, common for 5.7.* branches.
The actual noobaa-operator image is: registry.redhat.io/ocs4/mcg-rhel8-operator@sha256:6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee
@khover,
Could you please share the output of the following commands:
> oc get -n openshift-storage pod noobaa-db-pg-0 -o yaml
> oc get -n openshift-storage cm noobaa-postgres-config -o yaml
Is it possible to get access to the cluster running the setup, so I could take a closer look?
Thank you!
@aindenba omg get pod -o yaml noobaa-operator-6874fd5d96-pcxcd - name: NOOBAA_CORE_IMAGE value: registry.redhat.io/ocs4/mcg-core-rhel8@sha256:6ff8645efdde95fa97d496084d3555b7680895f0b79c147f2a880b43742af3a4 - name: NOOBAA_DB_IMAGE value: registry.redhat.io/rhel8/postgresql-12@sha256:f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf - name: OPERATOR_CONDITION_NAME value: ocs-operator.v4.7.2 image: registry.redhat.io/ocs4/mcg-rhel8-operator@sha256:6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee imagePullPolicy: IfNotPresent name: noobaa-operator omg get cm -o yaml noobaa-postgres-config apiVersion: v1 data: noobaa-postgres.conf: '# disable huge_pages trial # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792 huge_pages = off ' kind: ConfigMap metadata: creationTimestamp: '2021-07-28T04:16:11Z' labels: app: noobaa managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:data: .: {} f:noobaa-postgres.conf: {} f:metadata: f:labels: .: {} f:app: {} f:ownerReferences: .: {} k:{"uid":"f77c19b7-dbb4-4cb7-920b-681888196944"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} manager: noobaa-operator operation: Update time: '2021-07-28T04:16:11Z' name: noobaa-postgres-config namespace: openshift-storage ownerReferences: - apiVersion: noobaa.io/v1alpha1 blockOwnerDeletion: true controller: true kind: NooBaa name: noobaa uid: f77c19b7-dbb4-4cb7-920b-681888196944 resourceVersion: '97630' uid: 66883910-0430-475b-9955-11c48d9f3fdd I will check on customer availability for remote session. (In reply to khover from comment #38) > @aindenba > > > omg get pod -o yaml noobaa-operator-6874fd5d96-pcxcd > > - name: NOOBAA_CORE_IMAGE > value: > registry.redhat.io/ocs4/mcg-core-rhel8@sha256: > 6ff8645efdde95fa97d496084d3555b7680895f0b79c147f2a880b43742af3a4 > - name: NOOBAA_DB_IMAGE > value: > registry.redhat.io/rhel8/postgresql-12@sha256: > f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf > - name: OPERATOR_CONDITION_NAME > value: ocs-operator.v4.7.2 > image: > registry.redhat.io/ocs4/mcg-rhel8-operator@sha256: > 6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee > imagePullPolicy: IfNotPresent > name: noobaa-operator @khover, thank you! It is interesting if the noobaa-postgres-config is mounted by the Postgres db pod > ➜ kubectl get pod -n openshift-storage noobaa-db-pg-0 -o yaml > apiVersion: v1 > kind: Pod > metadata: > ... > spec: > containers: > ... > image: centos/postgresql-12-centos7 > imagePullPolicy: IfNotPresent > name: db > ... > volumeMounts: > - mountPath: /var/lib/pgsql > name: db > - mountPath: /opt/app-root/src/postgresql-cfg > name: noobaa-postgres-config-volume > - mountPath: /var/run/secrets/kubernetes.io/serviceaccount > name: noobaa-token-8w5r6 > readOnly: true > ... > volumes: > - name: db > persistentVolumeClaim: > claimName: db-noobaa-db-pg-0 > - configMap: > defaultMode: 420 > name: noobaa-postgres-config > name: noobaa-postgres-config-volume > - name: noobaa-token-8w5r6 > secret: > defaultMode: 420 > secretName: noobaa-token-8w5r6 > status: > ... Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the Postgres DB pod filesystem? > ➜ kubectl exec -ti -n openshift-storage noobaa-db-pg-0 -c db -- bash > bash-4.2$ cat /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf > # disable huge_pages trial > # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792 > huge_pages = off > > # postgres tuning > max_connections = 300 > ... Best regards, ~baum (In reply to Alexander Indenbaum from comment #39) > (In reply to khover from comment #38) > > @aindenba > > > > > > omg get pod -o yaml noobaa-operator-6874fd5d96-pcxcd > > > > - name: NOOBAA_CORE_IMAGE > > value: > > registry.redhat.io/ocs4/mcg-core-rhel8@sha256: > > 6ff8645efdde95fa97d496084d3555b7680895f0b79c147f2a880b43742af3a4 > > - name: NOOBAA_DB_IMAGE > > value: > > registry.redhat.io/rhel8/postgresql-12@sha256: > > f486bbe07f1ddef166bab5a2a6bdcd0e63e6e14d15b42d2425762f83627747bf > > - name: OPERATOR_CONDITION_NAME > > value: ocs-operator.v4.7.2 > > image: > > registry.redhat.io/ocs4/mcg-rhel8-operator@sha256: > > 6faecc43b775d9083d01f11705334e2afdee11eb585b7761851781c94df124ee > > imagePullPolicy: IfNotPresent > > name: noobaa-operator > > @khover, thank you! > > It is interesting if the noobaa-postgres-config is mounted by the Postgres > db pod > > > ➜ kubectl get pod -n openshift-storage noobaa-db-pg-0 -o yaml > > apiVersion: v1 > > kind: Pod > > metadata: > > ... > > spec: > > containers: > > ... > > image: centos/postgresql-12-centos7 > > imagePullPolicy: IfNotPresent > > name: db > > ... > > volumeMounts: > > - mountPath: /var/lib/pgsql > > name: db > > - mountPath: /opt/app-root/src/postgresql-cfg > > name: noobaa-postgres-config-volume > > - mountPath: /var/run/secrets/kubernetes.io/serviceaccount > > name: noobaa-token-8w5r6 > > readOnly: true > > ... > > volumes: > > - name: db > > persistentVolumeClaim: > > claimName: db-noobaa-db-pg-0 > > - configMap: > > defaultMode: 420 > > name: noobaa-postgres-config > > name: noobaa-postgres-config-volume > > - name: noobaa-token-8w5r6 > > secret: > > defaultMode: 420 > > secretName: noobaa-token-8w5r6 > > status: > > ... > > Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the > Postgres DB pod filesystem? > > > ➜ kubectl exec -ti -n openshift-storage noobaa-db-pg-0 -c db -- bash > > bash-4.2$ cat /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf > > # disable huge_pages trial > > # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792 > > huge_pages = off > > > > # postgres tuning > > max_connections = 300 > > ... > > Best regards, > ~baum @aindenba From customer To access this BR lab cluster, we need the incoming IP so we can add to the IP ACL allow list. Then we can share the private key to access the cluster's admin node. Kevan, thank you! Sending my IP address in the email. Hi @aindenba
> Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the Postgres DB pod filesystem?
For me this was the case, BUT, the core dump happens during 'initdb', and initdb doesn't use that file, but another template instead.
I'm using version 4.7.2 as well.
(In reply to Jimmy Scott from comment #42) > Hi @aindenba > > > Do you see /opt/app-root/src/postgresql-cfg/noobaa-postgres.conf in the Postgres DB pod filesystem? > > For me this was the case, BUT, the core dump happens during 'initdb', and > initdb doesn't use that file, but another template instead. > > I'm using version 4.7.2 as well. @Jimmy Scott, nice to meet you virtually. Thank you for your input. From what I can see now this is the failing scenario. (a) Enable huge pages (b) Install OCS In such scenario, the Postgres start up script tries to initialize the DB by running initdb from here: /usr/share/container-scripts/postgresql/common.sh, line 195 > function initialize_database() { > initdb_wrapper initdb > ^^^^^^^^^^^^^^^^^^^^^ (*) > > # PostgreSQL configuration. > cat >> "$PGDATA/postgresql.conf" <<EOF > > # Custom OpenShift configuration: > include '${POSTGRESQL_CONFIG_FILE}' > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (**) > EOF > ... > } (*) executes initdb: src/bin/initdb/initdb.c, test_config_settings() function, line 955 > /* > * Probe for max_connections before shared_buffers, since it is subject to > * more constraints than shared_buffers. > */ > printf(_("selecting default max_connections ... ")); > fflush(stdout); > > for (i = 0; i < connslen; i++) > { > test_conns = trial_conns[i]; > test_buffs = MIN_BUFS_FOR_CONNS(test_conns); > > snprintf(cmd, sizeof(cmd), > "\"%s\" --check %s %s " > "-c max_connections=%d " > "-c shared_buffers=%d " > "-c dynamic_shared_memory_type=%s " > "< \"%s\" > \"%s\" 2>&1", > backend_exec, boot_options, extra_options, > test_conns, test_buffs, > dynamic_shared_memory_type, > DEVNULL, DEVNULL); > status = system(cmd); > ^^^^^^^^^^^^^^^^^^^^^ (***) executing Postgres at this stage causes BUS error (from pod logs above): > selecting default shared_buffers ... sh: line 1: 32 Bus error > (core dumped) "/usr/bin/postgres" --boot -x0 -F -c max_connections=20 -c > shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > > "/dev/null" 2>&1 > sh: line 1: 34 Bus error (core dumped) "/usr/bin/postgres" The reason for BUS_ERROR is that initialize_database() in common.sh first runs initdb (*) and only then appends the custom Postgres configuration openshift-custom-postgresql.conf (**) which includes noobaa-postgres.conf disabling Postgres huge pages probe. So basically Postgres binary runs in such a scenario before huge_pages were disabled by the noobaa-postgres.conf configuration. So as a workaround first install OCS and then enable huge pages using tuned or disable huge pages temporarily during OCS installation. Best regards, ~baum Hi @aindenba, Nice to meet you as well! And thank you very much for your great analysis! This is indeed exactly what is happening, and disabling huge pages works around the issue, and can afterwards be enabled again without issues indeed. We are however worried that we will never be able to do that on a production system in case of issues (or maybe even operator upgrades?), since we need the huge pages for SR-IOV. Thank you very much for your great work so far! Kind regards, Jimmy Scott |