Bug 2237742 - Prometheus stops responding. Error 504
Summary: Prometheus stops responding. Error 504
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: arun kumar mohan
QA Contact: Harish NV Rao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-09-06 16:33 UTC by Daniel Osypenko
Modified: 2024-08-30 10:27 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Daniel Osypenko 2023-09-06 16:33:22 UTC
Created attachment 1987371 [details]
no-metrics-1

Created attachment 1987371 [details]
no-metrics-1

Description of problem (please be detailed as possible and provide log
snippests):

During the tests test_cephfs_capacity_workload_alerts, test_rbd_capacity_workload_alerts  cluster is filled with data up to a ceph full ratio to trigger CephClusterNearFull and CephClusterCriticallyFull alerts.
During the test the Prometheus is been requested every 3 seconds to receive existing alerts.
Test spends avg 25 min to fill the storage for 100 Gb size cluster.

3 tests from 5 are failing with Prometheus crashed and not responding. Prometheus restores and starts to respond only after 4-5 minutes. During this time UI does not show any metrics update, Alerts, cluster utilization graphics (see attachments). curl request to https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts receives Error 504.  



17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR  - Request https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts?silenced=False&inhibited=False failed
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR  - 504
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR  - <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>


Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-09-02-132842
Kubernetes Version: v1.27.4+2c83a9f

OCS verison:
ocs-operator.v4.14.0-125.stable              OpenShift Container Storage   4.14.0-125.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-09-02-132842   True        False         3d23h   Cluster version is 4.14.0-0.nightly-2023-09-02-132842

Rook version:
rook: v4.14.0-0.194fd1e22bcb701c69a6d80f1b051f210ff89ee0
go: go1.20.5

Ceph version:
ceph version 17.2.6-115.el9cp (968b780fae1bced13d322da769a9d7223d701a01) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create cephfs or RBD PVC with the size of 95% of the Cluster and attach it to a Deployment pod
2. Run load command via dd or other way on a pod to fill up the storage
3. Request periodically alerts via https://$ROUTE:443/api/v1/alerts 
4. Open browser, login to management console with kubeadmin user to observe ODF storage Overview page


Actual results:

Prometheus returns response 504 on any request. Management console stops to show alerts / metrics / utilization on Overview pages

Expected results:

Prometheus returns resp on any legitimate request due to API documentation. Management console shows alerts on alerts page / metrics on metrics page/ utilization on Overview pages

Additional info:
Issue happens on deployments ODF 4.12, 4.13, 4.14

must-gather logs: https://drive.google.com/file/d/1i5Sf2T5XDKxdD4zKnf9AG78r-5x_brt2/view?usp=sharing - issue happened at 17:56:03

test log: http://pastebin.test.redhat.com/1108931

metrics: 
ceph_cluster_total_used_bytes.json
{"status": "success", "data": {"resultType": "matrix", "result": [{"metric": {"__name__": "ceph_cluster_total_used_bytes", "container": "mgr", "endpoint": "http-metrics", "instance": "10.131.0.24:9283", "job": "rook-ceph-mgr", "managedBy": "ocs-storagecluster", "namespace": "openshift-storage", "pod": "rook-ceph-mgr-a-68fb789468-6kqnf", "service": "rook-ceph-mgr"}, "values": [[1694012693.494, "75253993472"], [1694012694.494, "75253993472"], [1694012695.494, "75253993472"], [1694012696.494, "75253993472"]]}]}}

cluster/memory_usage_bytes/sum.json 
{"status": "success", "data": {"resultType": "matrix", "result": [{"metric": {"__name__": "cluster:memory_usage_bytes:sum"}, "values": [[1694012693.494, "53300436992"], [1694012694.494, "53300436992"], [1694012695.494, "53300436992"], [1694012696.494, "53300436992"]]}]}}

Comment 4 Juan Miguel Olmo 2023-09-08 10:47:39 UTC
@dosypenk :

Unfortunately, I cannot find any error, and the dates i can see in the must-gather information (basically log files for prometheus/alert manager pods) does not match with the dates i can see in the test error log.
Can you confirm that you collected the logs in the moment where the error was happening? if not. Can you execute again the test and launch must gather in the moment the test fail?

Test log: ( I do not know the date of the execution, but the time registered is completely outside the time range provided in the prometheus and alert manager logs:)
...
17:55:29 - MainThread - ocs_ci.ocs.ocp - INFO  - fetching logs from Pod/fio-8w4bm
17:55:29 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n namespace-test-7c961e5d531d49ee92f0994cc logs Pod/fio-8w4bm
17:55:30 - MainThread - ocs_ci.ocs.fiojob - INFO  - fio disk_util stats: None
17:55:30 - MainThread - ocs_ci.utility.workloadfixture - INFO  - Starting 480.0s sleep for the purposes of measurement.
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR  - Request https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts?silenced=False&inhibited=False failed
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR  - 504
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR  - <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
....

==================
prometheus-k8s-0 (prometheus-k8s-1 is similar)
time range between 2023-09-06T13:42:12 and 2023-09-06T15:00:33. The error in the test is outside the time range logged.

previous log:
2023-09-06T13:42:12.001001009Z ts=2023-09-06T13:42:12.000Z caller=main.go:585 level=info msg="Starting Prometheus Server" mode=server version="(version=2.46.0, branch=rhaos-4.14-rhel-8, revision=6db7b73de85c077a10d1740a32e04f1aaa4220a2)"
2023-09-06T13:42:12.001001009Z ts=2023-09-06T13:42:12.000Z caller=main.go:590 level=info build_context="(go=go1.20.5 X:strictfipsruntime, platform=linux/amd64, user=root@d83135830a30, date=20230726-15:37:03, tags=netgo,builtinassets,stringlabels,strictfipsruntime)"
.....
2023-09-06T14:58:27.112304472Z ts=2023-09-06T14:58:27.112Z caller=notifier.go:603 level=info component=notifier msg="Stopping notification manager..."
2023-09-06T14:58:27.112405373Z ts=2023-09-06T14:58:27.112Z caller=main.go:1157 level=info msg="Notifier manager stopped"
2023-09-06T14:58:27.113329681Z ts=2023-09-06T14:58:27.113Z caller=main.go:1169 level=info msg="See you next time!"

current log:
2023-09-06T14:58:27.479547353Z ts=2023-09-06T14:58:27.479Z caller=main.go:585 level=info msg="Starting Prometheus Server" mode=server version="(version=2.46.0, branch=rhaos-4.14-rhel-8, revision=6db7b73de85c077a10d1740a32e04f1aaa4220a2)"
2023-09-06T14:58:27.479547353Z ts=2023-09-06T14:58:27.479Z caller=main.go:590 level=info build_context="(go=go1.20.5 X:strictfipsruntime, platform=linux/amd64, user=root@d83135830a30, date=20230726-15:37:03, tags=netgo,builtinassets,stringlabels,strictfipsruntime)"
....
2023-09-06T14:58:41.817471180Z ts=2023-09-06T14:58:41.817Z caller=main.go:1011 level=info msg="Server is ready to receive web requests."
2023-09-06T14:58:41.817471180Z ts=2023-09-06T14:58:41.817Z caller=manager.go:1009 level=info component="rule manager" msg="Starting rule manager..."
2023-09-06T15:00:27.687311691Z ts=2023-09-06T15:00:27.687Z caller=compact.go:523 level=info component=tsdb msg="write block" mint=1694001600000 maxt=1694008800000 ulid=01H9NG3RJ70MVC95A6KXFKQEW1 duration=27.487906607s
2023-09-06T15:00:28.347761310Z ts=2023-09-06T15:00:28.347Z caller=head.go:1293 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=656.195866ms
2023-09-06T15:00:28.389830034Z ts=2023-09-06T15:00:28.389Z caller=checkpoint.go:100 level=info component=tsdb msg="Creating checkpoint" from_segment=166 to_segment=170 mint=1694008800000
2023-09-06T15:00:33.987206774Z ts=2023-09-06T15:00:33.987Z caller=head.go:1261 level=info component=tsdb msg="WAL checkpoint complete" first=166 last=170 duration=5.597474311s

No requests

Comment 6 Daniel Osypenko 2023-09-10 15:28:35 UTC
"""Can you confirm that you collected the logs in the moment where the error was happening?"""  - the must-gather logs where captured in the next 20 minutes after the test failed. It may be so the timezone shifted 

must-gather time is a UTC timestamp and the tests time is in Israeli Standard Time UTC +2

running another job test_cephfs_capacity_workload_alerts on 4.14 ODF on IBMCloud 3M, 3W

test execution - http://pastebin.test.redhat.com/1109122
---
First error appeared at 17:21:22
First successful request after multiple failed requests happened at 17:25:05
---
During the 3.5 min Prometheus was not responding on req https://prometheus-k8s-openshift-monitoring.apps.dosypenk-109-ibm.ibmcloud2.qe.rh-ocs.com/api/v1/alerts  
- "The application is currently not serving requests at this endpoint. It may not have been started or is still starting"
Although Prometheus has been restored and the test passed, in most cases, it still fails to respond within 4-5 minutes, resulting in a test failure.
During this malfunction management-console looks like this https://bugzilla.redhat.com/attachment.cgi?id=1987977 

---
Must gather logs where taken immediately after the test execution (took 20 min to collect all logs) 17:32:20 - 17:52:35
https://drive.google.com/file/d/1N9VU39sEe0xu8U5S3CA1KvEpJarOxcCM/view?usp=sharing - must-gather-ocp
https://drive.google.com/file/d/1KbliB5Hwu12FpDnbMhiRxLCJXoSQirRJ/view?usp=sharing - must-gather-ocs

Additional info: 
I was checking  if malfunction happens in mgr pod logs, and it seems that there is no problem there
debug 2023-09-10T14:21:22.288+0000 7f7aa6066640 0 log_channel(cluster) log [DBG] : pgmap v3893: 113 pgs: 12 active+clean+laggy, 101 active+clean; 72 GiB data, 217 GiB used, 83 GiB / 300 GiB avail; 85 B/s rd, 68 MiB/s wr, 18 op/s
::ffff:10.131.0.29 - - [10/Sep/2023:14:21:23] "GET /metrics HTTP/1.1" 200 36917 "" "Prometheus/2.46.0"

Comment 10 Daniel Osypenko 2023-09-21 15:21:57 UTC
issue confirmed on 3 recent test runs on External mode deployment, ODF 4.11

Comment 12 Daniel Osypenko 2024-01-21 12:48:08 UTC
Issue confirmed on ODF 4.13.7, and is stable for reproduction across all ODF versions

Behaviour:
 
while storage filled with Flexible I/O https://linux.die.net/man/1/fio app the issue started appear when cluster was filled from initial 27% to only to 34%; after this first Prometheus responce failure  "<html><body><h1>504 Gateway Time-out</h1>" The server didn't respond in time. </body></html> starts happen more frequently, and takes more time to restore. 
Along this behavior System Degraded notifications showed, 2 MDSs report slow metadata IOs, 1 MDSs report slow requests

On screenrecording https://drive.google.com/file/d/1qemRIo243eJHKRnIMpvKIA1JQ5GgSjvs/view?usp=sharing the IOPS, Latency and Throughput graphs are visible
Mustgather OCP, OCS logs: https://drive.google.com/file/d/18AUa_B2e5mY6ZFq3A2XE4Rn2w8COKEOL/view?usp=sharing


Note You need to log in before you can comment on or make changes to this bug.