Bug 2010702
| Summary: | [Tracker] Pod noobaa-db-pg-0 crash loops (CrashLoopBackOff) upon reschedule during ocs-ci test run | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Dirk Haubenreisser <dhaubenr> | ||||||||||||
| Component: | Multi-Cloud Object Gateway | Assignee: | Nimrod Becker <nbecker> | ||||||||||||
| Status: | NEW --- | QA Contact: | Raz Tamir <ratamir> | ||||||||||||
| Severity: | low | Docs Contact: | |||||||||||||
| Priority: | unspecified | ||||||||||||||
| Version: | 4.9 | CC: | belimele, muagarwa, odf-bz-bot | ||||||||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | s390x | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2021-10-25 07:57:28 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | 2013698 | ||||||||||||||
| Bug Blocks: | |||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Dirk Haubenreisser
2021-10-05 11:37:34 UTC
Increasing the severity as it is blocking IBM team. Hi Dirk, Can you please provide the must-gather? Can you also provide the DB logs (including logs from the pod before the first crash)? Is this happening on platforms other than s390x? Created attachment 1829899 [details]
OCP must_gather report for cluster exhibiting issue
Created must_gather report for cluster exhibiting the issue:
- ran must_gather on cluster bastion host with this command:
```
oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 --dest-dir=./ocs_must_gather
```
- data compiled by must_gather:
```
ClusterID: 755a520c-29ad-4f64-bfc6-020f9b796108
ClusterVersion: Stable at "4.9.0-0.nightly-s390x-2021-09-09-135631"
ClusterOperators:
All healthy and stable
```
> Can you please provide the must-gather? Please see the report attached to this BZ. The report was generated with the noobaa-db-pg-0 pod crashlooping and no other workload running on the test cluster. > Can you also provide the DB logs (including logs from the pod before the first crash)? I will do so, I just need some guidance on how to fetch the actual PostgreSQL logs - are the noobaa-db-pg-0 pod logs sufficient? Or do I need to exec into the pod and fetch the log files somewhere from the filesystem? > Is this happening on platforms other than s390x? I don't know, I only have access to s390x - as no one else seems to have reported this issue before I can only assume that it happens on s390x only. I've been in touch with Liran via Slack and we both agreed to re-run the ocs-ci test cases that led to the issue and then try to fetch a PostgreSQL database dump via `pg_dump nbcore` from the crashlooping pod `noobaa-db-pg-0` that would involve some trickery editing the noobaa PostgreSQL statefulset. I re-ran the osc-ci test against the very same test cluster using the same OCS and LSO driver builds and the same ocs-ci release as in the previous tests only to realize that this time the tests were just running fine and that the issues with the PostgreSQL pod did not occur again. Everything was fine. While investigating why this time around the test were successful I found out that a seemingly unrelated ocs-ci framework test parameter indeed was changed for this latest test run. The parameter determines where the log information generated by ocs-ci is actually written to: a local filesystem (on the same host that is running ocs-ci) or a remote filesystem. The previous test runs stored the log files on the remote filesystem whereas the last one stored them on the local filesystem. Long story short - with that said I would assume that PostgreSQL is working as expected. But I will look further into this. Created attachment 1831745 [details]
database log after crash
This is the database log I fetched from the db container after the crash happened.
My suspicions about the ocs-ci framework configuration were unconfirmed. Another test run failed again with the same error. I've attached the database log file to this ticket. Additionally I send Liran the whole content of the database containers '/var/lib/pgsql/data/userdata' directory for further analysis. Created attachment 1831747 [details]
database files
This is the contents of the database 'userdata' directory
With the help of support article https://access.redhat.com/solutions/5668581 we were able to get the noobaa-db-pg-0 pod back into a working state. After that I created a database dump as suggested by Liran via `pg_dump nbcore`. Created attachment 1831783 [details]
nbcore database dump
This is the database dump for PostgreSQL database nbcore
Created attachment 1831850 [details]
database logs
These are the database logs
This is a Postgres bug that has workaround written here: https://access.redhat.com/solutions/5668581 Another tier4a test run late last night using the most recent ODF and LSO builds did *not* have this issue at all. I'm going to do more test runs this week so fingers crossed this works fine now and the issue is a thing of the past. The issue manifested again on the latest tier4a test run: ``` [haubenr@m4208001 ~]$ oc logs noobaa-db-pg-0 -n openshift-storage pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2021-10-20 14:07:26.117 UTC [27] LOG: starting PostgreSQL 12.7 on s390x-ibm-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit 2021-10-20 14:07:26.119 UTC [27] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2021-10-20 14:07:26.139 UTC [27] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2021-10-20 14:07:26.188 UTC [27] LOG: redirecting log output to logging collector process 2021-10-20 14:07:26.188 UTC [27] HINT: Future log output will appear in directory "log". done server started /var/run/postgresql:5432 - accepting connections => sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ... ERROR: tuple already updated by self ``` Installed latest available release of ODF via ocs-ci prior to the test run: ``` [haubenr@m4208001 ~]$ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded ``` I dug into the test suite some more and tried to narrow down the circumstances under which the issue occurs. I'm pretty confident that the issue is provoked by the test case 'test_drain_mcg_pod_node' (https://github.com/red-hat-storage/ocs-ci/blob/40a14ab5cfb447a14e81277bae2b94d86c11b8a1/tests/manage/mcg/test_mcg_resources_disruptions.py#L130) evicts all pods from the cluster worker node that hosts the noobaa pods (noobaa-core as well as noobaa-db-pg). Prior to this particular test case, noobaa was running fine (including the noobaa-db-pg pod). Here's the relevant part of the test logs: tests/manage/mcg/test_mcg_resources_disruptions.py::TestMCGResourcesDisruptions::test_drain_mcg_pod_node[noobaa_core] -------------------------------- live log setup -------------------------------- 14:04:38 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 14:04:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc wait --for condition=ready pod -l app=rook-ceph-tools -n openshift-storage --timeout=120s 14:04:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}' 14:04:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-6bb55cd9f7-jwf4g -- ceph health 14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK. 14:04:40 - MainThread - tests.conftest - INFO - Ceph health check passed at setup 14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Executing command: ['oc', 'login', '-u', 'kubeadmin', '-p', '*****'] 14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-monitoring whoami --show-token 14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-monitoring get Route prometheus-k8s -n openshift-monitoring -o yaml 14:04:41 - MainThread - /opt/ocs-ci/ocs_ci/utility/prometheus.py - INFO - Generated CA certification file: /tmp/tmpj66g6_ea 14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get storagecluster -n openshift-storage -o yaml -------------------------------- live log call --------------------------------- 14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod -n openshift-storage --selector=noobaa-core=noobaa -o yaml 14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage -o yaml 14:04:41 - MainThread - ocs_ci.ocs.node - INFO - Draining nodes worker-001.m4208001ocs.lnxne.boe 14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc adm drain worker-001.m4208001ocs.lnxne.boe --force=true --ignore-daemonsets --delete-emptydir-data 14:06:03 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-tjhrg, openshift-dns/dns-default-bjhb4, openshift-dns/node-resolver-8w7h9, openshift-image-registry/node-ca-l2bqw, openshift-ingress-canary/ingress-canary-rtc8g, openshift-local-storage/diskmaker-discovery-r5cks, openshift-local-storage/diskmaker-manager-6q7q5, openshift-machine-config-operator/machine-config-daemon-l4j69, openshift-monitoring/node-exporter-jlxjm, openshift-multus/multus-additional-cni-plugins-jvrrf, openshift-multus/multus-wsm97, openshift-multus/network-metrics-daemon-tq9ps, openshift-network-diagnostics/network-check-target-vjbc5, openshift-sdn/sdn-vlc27, openshift-storage/csi-cephfsplugin-pjhmr, openshift-storage/csi-rbdplugin-7p49g I1020 14:04:44.397121 4018 request.go:665] Waited for 1.002263087s due to client-side throttling, not priority and fairness, request: POST:https://api.m4208001ocs.lnxne.boe:6443/api/v1/namespaces/openshift-monitoring/pods/prometheus-adapter-8648594f68-hfth9/eviction I1020 14:04:54.838507 4018 request.go:665] Waited for 1.054854154s due to client-side throttling, not priority and fairness, request: GET:https://api.m4208001ocs.lnxne.boe:6443/api/v1/namespaces/openshift-storage/pods/noobaa-core-0 I1020 14:05:05.838443 4018 request.go:665] Waited for 1.056408596s due to client-side throttling, not priority and fairness, request: GET:https://api.m4208001ocs.lnxne.boe:6443/api/v1/namespaces/openshift-storage/pods/noobaa-core-0 14:06:03 - MainThread - ocs_ci.ocs.node - INFO - Waiting for nodes ['worker-001.m4208001ocs.lnxne.boe'] to reach status Ready,SchedulingDisabled 14:06:03 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc get node -o yaml 14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc get Node worker-001.m4208001ocs.lnxne.boe 14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc get Node -o yaml 14:06:04 - MainThread - ocs_ci.ocs.node - INFO - Node worker-001.m4208001ocs.lnxne.boe reached status Ready,SchedulingDisabled 14:06:04 - MainThread - ocs_ci.ocs.node - INFO - The following nodes reached status Ready,SchedulingDisabled: ['worker-001.m4208001ocs.lnxne.boe'] 14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod -n openshift-storage --selector=noobaa-core=noobaa -o yaml 14:06:04 - MainThread - ocs_ci.ocs.ocp - INFO - Waiting for a resource(s) of kind Pod identified by name 'noobaa-core-0' using selector None at column name STATUS to reach desired condition Running 14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage -o yaml 14:06:05 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage 14:06:05 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod -n openshift-storage -o yaml 14:06:07 - MainThread - ocs_ci.ocs.ocp - INFO - status of noobaa-core-0 at STATUS reached condition! 14:06:07 - MainThread - ocs_ci.helpers.helpers - INFO - Pod noobaa-core-0 reached state Running 14:06:07 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get noobaa noobaa -n openshift-storage -o yaml 14:06:08 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get backingstore noobaa-default-backing-store -n openshift-storage -o yaml 14:06:08 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get bucketclass noobaa-default-bucket-class -n openshift-storage -o yaml 14:06:08 - MainThread - ocs_ci.utility.retry - WARNING - Cluster health is NOT OK, Retrying in 5 seconds... Note that the occurrence of the message 'WARNING - Cluster health is NOT OK, Retrying in 5 seconds' indicates that the noobaa component is not running correctly due to the pod 'noobaa-db-pg-0' not coming up correctly. Note also the intermittent '... client-side throttling ...' log lines that may indicate a negative impact on overall cluster health or noobaa health in particular. I was able to get a different test cluster running on z/VM on s390x and I ran the tier4a test on that one. This test run did not exhibit the issue. So I assume that some sort of hardware issue exists on the original cluster that I used for testing initially. I would need to dig deeper into it together with the hardware admins to get to the bottom of the issue but for now I am closing this ticket. I'd like to reopen the bug - we keep on seeing it in the OCS-CI regression runs. It was initially discussed in the email thread of run 1635778034, and then I and Jacky Albo messaged about it privately on Slack. Jacky have found the same workaround, and since then we did not run into the issue again, until now. This issue has been pestering me for quite some time - on and off which is why it's so annoying. I've seen test runs where it didn't occur and other test runs where it occurred repeatedly. Evicting the pod off of the node it runs on using the `force` option seems to be the key issue. I can only guess that there is some sort of race condition going on here where the PGSQL process doesn't get enough time (sometimes, not all the time) to free all resources, especially open file handles. But that's only an educated guess. This seems to be an issue in the latest ODF 4.10 as well which is deployed on OCP 4.10. There is a noobaa defect opened here which is being tracked by (6953 in noobaa-core repo) |