Description of problem (please be detailed as possible and provide log snippests): After rebuild of ODF and rebuild of Noobaa, noobaa-db-pg-0 stuck in CLBO state. grep -A6 'conditions' noobaa/namespaces/openshift-storage/noobaa.io/noobaas/noobaa.yaml conditions: - lastHeartbeatTime: "2022-02-04T19:22:35Z" lastTransitionTime: "2022-02-04T19:22:35Z" message: 'RPC: connection (0xc0016d0000) already closed &{RPC:0xc000502a50 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s}' reason: TemporaryError status: "False" type: Available omg get pods | grep -iv running NAME READY STATUS RESTARTS AGE noobaa-db-pg-0 0/1 Pending 0 1h44m rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-0zs6mfhwlmm 0/1 Succeeded 0 2h47m rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-12p4t7qvxp6 0/1 Succeeded 0 2h47m rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-2q4mnmwg254 0/1 Succeeded 0 2h47m namespaces/openshift-storage/pods/noobaa-db-pg-0/noobaa-db-pg-0.yaml state: waiting: message: back-off 5m0s restarting failed container=initialize-database pod=noobaa-db-pg-0_openshift-storage(546f8321-e79b-4a81-b5d7-e69fa8e45b61) reason: CrashLoopBackOff phase: Pending logs noobaa-operator-5746b8bf88-jqnjk 2022-02-04T22:08:10.611587696Z time="2022-02-04T22:08:10Z" level=info msg="Update event detected for ocs-storagecluster-cephcluster (openshift-storage), queuing Reconcile" 2022-02-04T22:08:10.632800121Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists: \"ocs-storagecluster-cephcluster\"\n" 2022-02-04T22:08:10.643587507Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists: NooBaa \"noobaa\"\n" 2022-02-04T22:08:10.646713150Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists: Service \"noobaa-mgmt\"\n" 2022-02-04T22:08:10.649641242Z time="2022-02-04T22:08:10Z" level=info msg="<U+274C> Not Found: Secret \"noobaa-operator\"\n" 2022-02-04T22:08:10.649665470Z time="2022-02-04T22:08:10Z" level=error msg="Could not connect to system Connect(): SecretOp not found" openshift-storage noobaa-db-pg-0 FailedScheduling 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. Version of all relevant components (if applicable): ocs-operator.v4.8.6 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Blocking ODF deployment Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Had customer run the following $ oc rsh noobaa-operator-5746b8bf88-jqnjk sh-4.4$ sh-4.4$ sh-4.4$ sh-4.4$ curl -kv https://noobaa-mgmt.openshift-storage.svc:443 * Rebuilt URL to: https://noobaa-mgmt.openshift-storage.svc:443/ * Trying 172.30.88.33... * TCP_NODELAY set * Connected to noobaa-mgmt.openshift-storage.svc (172.30.88.33) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.2 (IN), TLS handshake, Certificate (11): * TLSv1.2 (IN), TLS handshake, Server key exchange (12): * TLSv1.2 (IN), TLS handshake, Server finished (14): * TLSv1.2 (OUT), TLS handshake, Client key exchange (16): * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.2 (OUT), TLS handshake, Finished (20): * TLSv1.2 (IN), TLS handshake, Finished (20): * SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 * ALPN, server accepted to use http/1.1 * Server certificate: * subject: CN=noobaa-mgmt.openshift-storage.svc * start date: Feb 4 19:22:39 2022 GMT * expire date: Feb 4 19:22:40 2024 GMT * issuer: CN=openshift-service-serving-signer@1638811499 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. > GET / HTTP/1.1 > Host: noobaa-mgmt.openshift-storage.svc > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 302 Found < Location: /fe/ < Vary: Accept, Accept-Encoding < Content-Type: text/plain; charset=utf-8 < Content-Length: 26 < Date: Mon, 07 Feb 2022 15:25:49 GMT < Connection: keep-alive < Keep-Alive: timeout=5 < * Connection #0 to host noobaa-mgmt.openshift-storage.svc left intact Found. Redirecting to /fe/sh-4.4$
Hello, Any updates or workaround we can share with the customer ? The customer is loosing patience and faith in the product.
Hello, So if the workaround includes disabling huge pages, what is the process/steps for that ?
Hello According to ODF must gather the Postgres container image is registry.redhat.io/rhel8/postgresql-12@sha256:623bdaa1c6ae047db7f62d82526220fac099837afd8770ccc6acfac4c7cff100 i.e. this image uses RHEL8 as a base: ^^^^^ > bash-4.4$ cat /etc/redhat-release > Red Hat Enterprise Linux release 8.5 (Ootpa) Previously Postgres container images used RHEL7 as a base. ^^^^^ For instance, the upstream uses: > bash-4.2$ cat /etc/redhat-release > CentOS Linux release 7.8.2003 (Core) Is Postgres container image base OS change expected? @khover the posted workaround PR would support RHEL8 fs layout and run Postgres with huge pages disabled. Best regards!
My customer already has hugepages enabled in the cluster. As per my understanding so far, the workaround is uninstall OCS, disable hugepages, reinstall OCS. How do I disable huge pages temporarily during OCS installation ? The customer temp is high so I just want to be sure that what we try next succeeds.
@
@khover any input about why RHEL8 based Postgres container is used? As a workaround see https://github.com/noobaa/noobaa-operator/pull/853/files, it is a tiny change. To get through DB initialization, "oc edit cm noobaa-postgres-initdb-sh" and add if block bellow marked by a plus sign. This procedure does not require reinstalling OCS. # Wrap the postgres binary, force huge_pages=off for initdb # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792 p=/opt/rh/rh-postgresql12/root/usr/bin/postgres + + # Latest RH images moved the postgres binary + # from /opt/rh/rh-postgresql12/root/usr/bin/postgres to /usr/bin/postgres + # see https://bugzilla.redhat.com/show_bug.cgi?id=2051249 + if [ ! -x $p ]; then + p=/usr/bin/postgres + fi + mv $p $p.orig echo exec $p.orig \"\$@\" -c huge_pages=off > $p chmod 755 $p Alternatively: - you could disable huge pages during OCS/NooBaa installation and then re-enabled huge pages - use RHEL7 based Postgres container Let me know if you need any additional help. Best regards!
Hi Alex, Thanks for all your help on this. Re: @khover any input about why RHEL8 based Postgres container is used? I honestly dont know, this is a install of OCS ocs-operator.v4.8.6 IF there is some way to check or info needed id be happy to help you collect.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days