Bug 2010702

Summary:

[Tracker] Pod noobaa-db-pg-0 crash loops (CrashLoopBackOff) upon reschedule during ocs-ci test run

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

Dirk Haubenreisser <dhaubenr>

Component:

Multi-Cloud Object Gateway

Assignee:

Nimrod Becker <nbecker>

Status:

NEW ---

QA Contact:

Raz Tamir <ratamir>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

4.9

CC:

muagarwa, odf-bz-bot

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

s390x

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-10-25 07:57:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2013698

Bug Blocks:

Attachments:

Description	Flags
OCP must_gather report for cluster exhibiting issue	none
database log after crash	none
database files	none
nbcore database dump	none
database logs	none

Description Dirk Haubenreisser 2021-10-05 11:37:34 UTC

Description of problem (please be detailed as possible and provide log
snippests):

I've been running the ocs-ci *tier4a* tests on the s390x platform (see version details below). Many of the testcases that are part of the tier4a suite fail due to a an issue with noobaa (more specifically: the noobaa-db-pg-0 pod). It would seem that when this pod gets terminated abruptly the PostgreSQL database process in that pod is unable to shut down in an orderly fashion, most likely leaving unwanted artifacts behind on the persistent volume that is associated with that pod via claim 'db-noobaa-db-pg-0'. Upon reschedule of the pod, the main PostgreSQL process refuses to start with this error message:

```
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2021-10-05 11:09:12.175 UTC [28] LOG:  starting PostgreSQL 12.7 on s390x-ibm-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit
2021-10-05 11:09:12.176 UTC [28] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-10-05 11:09:12.180 UTC [28] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-10-05 11:09:12.226 UTC [28] LOG:  redirecting log output to logging collector process
2021-10-05 11:09:12.226 UTC [28] HINT:  Future log output will appear in directory "log".
 done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ERROR:  tuple already updated by self
```

effectively putting the pod into an infinite crash loop. There is no other way to recover from this situation than to scale down the statefulset 'noobaa-db-pg' to zero replicas and to delete the persistent volume claim 'db-noobaa-db-pg-0' afterwards. Naturally this means total loss of data for the PostgreSQL database (and for noobaa in general I would assume).


Version of all relevant components (if applicable):

---------- oc version ----------
Client Version: 4.9.0-0.nightly-s390x-2021-09-09-135631
Server Version: 4.9.0-0.nightly-s390x-2021-09-09-135631
Kubernetes Version: v1.22.0-rc.0+75ee307
----------

---------- oc -n openshift-storage get csv ----------
NAME                            DISPLAY                       VERSION        REPLACES   PHASE
noobaa-operator.v4.9.0-164.ci   NooBaa Operator               4.9.0-164.ci              Succeeded
ocs-operator.v4.9.0-164.ci      OpenShift Container Storage   4.9.0-164.ci              Succeeded
odf-operator.v4.9.0-164.ci      OpenShift Data Foundation     4.9.0-164.ci              Succeeded
----------


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

GA readiness for ODF / OCS 4.9 is at risk as the reported issue repeatedly blocks the successful execution of the tier4a ocs-ci tests on platform s390x.


Is there any workaround available to the best of your knowledge?

No.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Hard to tell (as issue was discovered by running ocs-ci tier4a tests), maybe 3 (?)


Is this issue reproducible?

Yes, running ocs-ci tier4 test suite on a vanilla OCP and ODF / OCS deployment reliably produces the same issue over and over again. Tested it multiple times.


Can this issue reproduce from the UI?

Not that I'm aware of.


If this is a regression, please provide more details to justify this:

n/a


Steps to Reproduce:

1. install OCP 4.9 on s390x z/VM-based cluster (UPI)
2. deploy ODF / OCS 4.9 via ocs-ci (using 'deployment' test suite)
3. run ocs-ci tier4 test suite against said cluster


Actual results:

Pod 'noobaa-db-pg-0' enters crash loop after being rescheduled (see detailed error description above).


Expected results:

ocs-ci tier4 tests are completed successfully.


Additional info:

Comment 2 Mudit Agarwal 2021-10-06 08:29:34 UTC

Increasing the severity as it is blocking IBM team.

Comment 4 Liran Mauda 2021-10-06 10:53:27 UTC

Hi Dirk,
Can you please provide the must-gather? 

Can you also provide the DB logs (including logs from the pod before the first crash)?

Is this happening on platforms other than s390x?

Comment 5 Dirk Haubenreisser 2021-10-06 14:01:23 UTC

Created attachment 1829899 [details]
OCP must_gather report for cluster exhibiting issue

Created must_gather report for cluster exhibiting the issue:

- ran must_gather on cluster bastion host with this command:

```
oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.9 --dest-dir=./ocs_must_gather
```

- data compiled by must_gather:

```
ClusterID: 755a520c-29ad-4f64-bfc6-020f9b796108
ClusterVersion: Stable at "4.9.0-0.nightly-s390x-2021-09-09-135631"
ClusterOperators:
	All healthy and stable
```

Comment 6 Dirk Haubenreisser 2021-10-07 08:01:54 UTC

> Can you please provide the must-gather?

Please see the report attached to this BZ. The report was generated with the noobaa-db-pg-0 pod crashlooping and no other workload running on the test cluster.

> Can you also provide the DB logs (including logs from the pod before the first crash)?

I will do so, I just need some guidance on how to fetch the actual PostgreSQL logs - are the noobaa-db-pg-0 pod logs sufficient? Or do I need to exec into the pod and fetch the log files somewhere from the filesystem?

> Is this happening on platforms other than s390x?

I don't know, I only have access to s390x - as no one else seems to have reported this issue before I can only assume that it happens on s390x only.

Comment 7 Dirk Haubenreisser 2021-10-08 08:49:53 UTC

I've been in touch with Liran via Slack and we both agreed to re-run the ocs-ci test cases that led to the issue and then try to fetch a PostgreSQL database dump via `pg_dump nbcore` from the crashlooping pod `noobaa-db-pg-0` that would involve some trickery editing the noobaa PostgreSQL statefulset.

I re-ran the osc-ci test against the very same test cluster using the same OCS and LSO driver builds and the same ocs-ci release as in the previous tests only to realize that this time the tests were just running fine and that the issues with the PostgreSQL pod did not occur again. Everything was fine.

While investigating why this time around the test were successful I found out that a seemingly unrelated ocs-ci framework test parameter indeed was changed for this latest test run. The parameter determines where the log information generated by ocs-ci is actually written to: a local filesystem (on the same host that is running ocs-ci) or a remote filesystem. The previous test runs stored the log files on the remote filesystem whereas the last one stored them on the local filesystem.

Long story short - with that said I would assume that PostgreSQL is working as expected. But I will look further into this.

Comment 8 Dirk Haubenreisser 2021-10-11 07:40:57 UTC

Created attachment 1831745 [details]
database log after crash

This is the database log I fetched from the db container after the crash happened.

Comment 9 Dirk Haubenreisser 2021-10-11 07:43:06 UTC

My suspicions about the ocs-ci framework configuration were unconfirmed. Another test run failed again with the same error. I've attached the database log file to this ticket. Additionally I send Liran the whole content of the database containers '/var/lib/pgsql/data/userdata' directory for further analysis.

Comment 10 Dirk Haubenreisser 2021-10-11 07:47:06 UTC

Created attachment 1831747 [details]
database files

This is the contents of the database 'userdata' directory

Comment 11 Dirk Haubenreisser 2021-10-11 09:16:18 UTC

With the help of support article https://access.redhat.com/solutions/5668581 we were able to get the noobaa-db-pg-0 pod back into a working state. After that I created a database dump as suggested by Liran via `pg_dump nbcore`.

Comment 12 Dirk Haubenreisser 2021-10-11 09:17:01 UTC

Created attachment 1831783 [details]
nbcore database dump

This is the database dump for PostgreSQL database nbcore

Comment 13 Dirk Haubenreisser 2021-10-11 11:43:05 UTC

Created attachment 1831850 [details]
database logs

These are the database logs

Comment 14 Liran Mauda 2021-10-14 06:45:09 UTC

This is a Postgres bug that has workaround written here: https://access.redhat.com/solutions/5668581

Comment 15 Dirk Haubenreisser 2021-10-20 08:39:37 UTC

Another tier4a test run late last night using the most recent ODF and LSO builds did *not* have this issue at all. I'm going to do more test runs this week so fingers crossed this works fine now and the issue is a thing of the past.

Comment 16 Dirk Haubenreisser 2021-10-20 14:09:39 UTC

The issue manifested again on the latest tier4a test run:

```
[haubenr@m4208001 ~]$ oc logs noobaa-db-pg-0 -n openshift-storage
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2021-10-20 14:07:26.117 UTC [27] LOG:  starting PostgreSQL 12.7 on s390x-ibm-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit
2021-10-20 14:07:26.119 UTC [27] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-10-20 14:07:26.139 UTC [27] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-10-20 14:07:26.188 UTC [27] LOG:  redirecting log output to logging collector process
2021-10-20 14:07:26.188 UTC [27] HINT:  Future log output will appear in directory "log".
 done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ERROR:  tuple already updated by self
```

Installed latest available release of ODF via ocs-ci prior to the test run:

```
[haubenr@m4208001 ~]$ oc get csv -n openshift-storage
NAME                     DISPLAY                       VERSION   REPLACES   PHASE
noobaa-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
ocs-operator.v4.9.0      OpenShift Container Storage   4.9.0                Succeeded
odf-operator.v4.9.0      OpenShift Data Foundation     4.9.0                Succeeded
```

Comment 17 Dirk Haubenreisser 2021-10-21 08:45:35 UTC

I dug into the test suite some more and tried to narrow down the circumstances under which the issue occurs. I'm pretty confident that the issue is provoked by the test case 'test_drain_mcg_pod_node' (https://github.com/red-hat-storage/ocs-ci/blob/40a14ab5cfb447a14e81277bae2b94d86c11b8a1/tests/manage/mcg/test_mcg_resources_disruptions.py#L130) evicts all pods from the cluster worker node that hosts the noobaa pods (noobaa-core as well as noobaa-db-pg).

Prior to this particular test case, noobaa was running fine (including the noobaa-db-pg pod).

Here's the relevant part of the test logs:

tests/manage/mcg/test_mcg_resources_disruptions.py::TestMCGResourcesDisruptions::test_drain_mcg_pod_node[noobaa_core] 
-------------------------------- live log setup --------------------------------
14:04:38 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 
14:04:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc wait --for condition=ready pod -l app=rook-ceph-tools -n openshift-storage --timeout=120s
14:04:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}'
14:04:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-6bb55cd9f7-jwf4g -- ceph health
14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.
14:04:40 - MainThread - tests.conftest - INFO - Ceph health check passed at setup
14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Executing command: ['oc', 'login', '-u', 'kubeadmin', '-p', '*****']
14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-monitoring whoami --show-token
14:04:40 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-monitoring get Route prometheus-k8s -n openshift-monitoring -o yaml
14:04:41 - MainThread - /opt/ocs-ci/ocs_ci/utility/prometheus.py - INFO - Generated CA certification file: /tmp/tmpj66g6_ea
14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get storagecluster  -n openshift-storage -o yaml
-------------------------------- live log call ---------------------------------
14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod  -n openshift-storage --selector=noobaa-core=noobaa -o yaml
14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage -o yaml
14:04:41 - MainThread - ocs_ci.ocs.node - INFO - Draining nodes worker-001.m4208001ocs.lnxne.boe
14:04:41 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc adm drain worker-001.m4208001ocs.lnxne.boe --force=true --ignore-daemonsets --delete-emptydir-data
14:06:03 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-tjhrg, openshift-dns/dns-default-bjhb4, openshift-dns/node-resolver-8w7h9, openshift-image-registry/node-ca-l2bqw, openshift-ingress-canary/ingress-canary-rtc8g, openshift-local-storage/diskmaker-discovery-r5cks, openshift-local-storage/diskmaker-manager-6q7q5, openshift-machine-config-operator/machine-config-daemon-l4j69, openshift-monitoring/node-exporter-jlxjm, openshift-multus/multus-additional-cni-plugins-jvrrf, openshift-multus/multus-wsm97, openshift-multus/network-metrics-daemon-tq9ps, openshift-network-diagnostics/network-check-target-vjbc5, openshift-sdn/sdn-vlc27, openshift-storage/csi-cephfsplugin-pjhmr, openshift-storage/csi-rbdplugin-7p49g
I1020 14:04:44.397121    4018 request.go:665] Waited for 1.002263087s due to client-side throttling, not priority and fairness, request: POST:https://api.m4208001ocs.lnxne.boe:6443/api/v1/namespaces/openshift-monitoring/pods/prometheus-adapter-8648594f68-hfth9/eviction
I1020 14:04:54.838507    4018 request.go:665] Waited for 1.054854154s due to client-side throttling, not priority and fairness, request: GET:https://api.m4208001ocs.lnxne.boe:6443/api/v1/namespaces/openshift-storage/pods/noobaa-core-0
I1020 14:05:05.838443    4018 request.go:665] Waited for 1.056408596s due to client-side throttling, not priority and fairness, request: GET:https://api.m4208001ocs.lnxne.boe:6443/api/v1/namespaces/openshift-storage/pods/noobaa-core-0

14:06:03 - MainThread - ocs_ci.ocs.node - INFO - Waiting for nodes ['worker-001.m4208001ocs.lnxne.boe'] to reach status Ready,SchedulingDisabled
14:06:03 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc get node  -o yaml
14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc get Node worker-001.m4208001ocs.lnxne.boe
14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc get Node  -o yaml
14:06:04 - MainThread - ocs_ci.ocs.node - INFO - Node worker-001.m4208001ocs.lnxne.boe reached status Ready,SchedulingDisabled
14:06:04 - MainThread - ocs_ci.ocs.node - INFO - The following nodes reached status Ready,SchedulingDisabled: ['worker-001.m4208001ocs.lnxne.boe']
14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod  -n openshift-storage --selector=noobaa-core=noobaa -o yaml
14:06:04 - MainThread - ocs_ci.ocs.ocp - INFO - Waiting for a resource(s) of kind Pod identified by name 'noobaa-core-0' using selector None at column name STATUS to reach desired condition Running
14:06:04 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage -o yaml
14:06:05 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage
14:06:05 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod  -n openshift-storage -o yaml
14:06:07 - MainThread - ocs_ci.ocs.ocp - INFO - status of noobaa-core-0 at STATUS reached condition!
14:06:07 - MainThread - ocs_ci.helpers.helpers - INFO - Pod noobaa-core-0 reached state Running
14:06:07 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get noobaa noobaa -n openshift-storage -o yaml
14:06:08 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get backingstore noobaa-default-backing-store -n openshift-storage -o yaml
14:06:08 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get bucketclass noobaa-default-bucket-class -n openshift-storage -o yaml
14:06:08 - MainThread - ocs_ci.utility.retry - WARNING - Cluster health is NOT OK, Retrying in 5 seconds...

Note that the occurrence of the message 'WARNING - Cluster health is NOT OK, Retrying in 5 seconds' indicates that the noobaa component is not running correctly due to the pod 'noobaa-db-pg-0' not coming up correctly.

Note also the intermittent '... client-side throttling ...' log lines that may indicate a negative impact on overall cluster health or noobaa health in particular.

Comment 18 Dirk Haubenreisser 2021-10-25 07:57:28 UTC

I was able to get a different test cluster running on z/VM on s390x and I ran the tier4a test on that one. This test run did not exhibit the issue. So I assume that some sort of hardware issue exists on the original cluster that I used for testing initially. I would need to dig deeper into it together with the hardware admins to get to the bottom of the issue but for now I am closing this ticket.

Comment 19 Ben Eli 2022-01-10 12:13:00 UTC

I'd like to reopen the bug - we keep on seeing it in the OCS-CI regression runs.
It was initially discussed in the email thread of run 1635778034, and then I and Jacky Albo messaged about it privately on Slack.
Jacky have found the same workaround, and since then we did not run into the issue again, until now.

Comment 20 Dirk Haubenreisser 2022-01-10 13:28:38 UTC

This issue has been pestering me for quite some time - on and off which is why it's so annoying. I've seen test runs where it didn't occur and other test runs where it occurred repeatedly. Evicting the pod off of the node it runs on using the `force` option seems to be the key issue. I can only guess that there is some sort of race condition going on here where the PGSQL process doesn't get enough time (sometimes, not all the time) to free all resources, especially open file handles. But that's only an educated guess.

Comment 21 Ravi K Komanduri 2022-09-29 09:49:59 UTC

This seems to be an issue in the latest ODF 4.10 as well which is deployed on OCP 4.10. There is a noobaa defect opened here which is being tracked by (6953 in noobaa-core repo)