Created attachment 1987371 [details] no-metrics-1 Created attachment 1987371 [details] no-metrics-1 Description of problem (please be detailed as possible and provide log snippests): During the tests test_cephfs_capacity_workload_alerts, test_rbd_capacity_workload_alerts cluster is filled with data up to a ceph full ratio to trigger CephClusterNearFull and CephClusterCriticallyFull alerts. During the test the Prometheus is been requested every 3 seconds to receive existing alerts. Test spends avg 25 min to fill the storage for 100 Gb size cluster. 3 tests from 5 are failing with Prometheus crashed and not responding. Prometheus restores and starts to respond only after 4-5 minutes. During this time UI does not show any metrics update, Alerts, cluster utilization graphics (see attachments). curl request to https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts receives Error 504. 17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - Request https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts?silenced=False&inhibited=False failed 17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - 504 17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> Version of all relevant components (if applicable): OC version: Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-09-02-132842 Kubernetes Version: v1.27.4+2c83a9f OCS verison: ocs-operator.v4.14.0-125.stable OpenShift Container Storage 4.14.0-125.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-02-132842 True False 3d23h Cluster version is 4.14.0-0.nightly-2023-09-02-132842 Rook version: rook: v4.14.0-0.194fd1e22bcb701c69a6d80f1b051f210ff89ee0 go: go1.20.5 Ceph version: ceph version 17.2.6-115.el9cp (968b780fae1bced13d322da769a9d7223d701a01) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create cephfs or RBD PVC with the size of 95% of the Cluster and attach it to a Deployment pod 2. Run load command via dd or other way on a pod to fill up the storage 3. Request periodically alerts via https://$ROUTE:443/api/v1/alerts 4. Open browser, login to management console with kubeadmin user to observe ODF storage Overview page Actual results: Prometheus returns response 504 on any request. Management console stops to show alerts / metrics / utilization on Overview pages Expected results: Prometheus returns resp on any legitimate request due to API documentation. Management console shows alerts on alerts page / metrics on metrics page/ utilization on Overview pages Additional info: Issue happens on deployments ODF 4.12, 4.13, 4.14 must-gather logs: https://drive.google.com/file/d/1i5Sf2T5XDKxdD4zKnf9AG78r-5x_brt2/view?usp=sharing - issue happened at 17:56:03 test log: http://pastebin.test.redhat.com/1108931 metrics: ceph_cluster_total_used_bytes.json {"status": "success", "data": {"resultType": "matrix", "result": [{"metric": {"__name__": "ceph_cluster_total_used_bytes", "container": "mgr", "endpoint": "http-metrics", "instance": "10.131.0.24:9283", "job": "rook-ceph-mgr", "managedBy": "ocs-storagecluster", "namespace": "openshift-storage", "pod": "rook-ceph-mgr-a-68fb789468-6kqnf", "service": "rook-ceph-mgr"}, "values": [[1694012693.494, "75253993472"], [1694012694.494, "75253993472"], [1694012695.494, "75253993472"], [1694012696.494, "75253993472"]]}]}} cluster/memory_usage_bytes/sum.json {"status": "success", "data": {"resultType": "matrix", "result": [{"metric": {"__name__": "cluster:memory_usage_bytes:sum"}, "values": [[1694012693.494, "53300436992"], [1694012694.494, "53300436992"], [1694012695.494, "53300436992"], [1694012696.494, "53300436992"]]}]}}
@dosypenk : Unfortunately, I cannot find any error, and the dates i can see in the must-gather information (basically log files for prometheus/alert manager pods) does not match with the dates i can see in the test error log. Can you confirm that you collected the logs in the moment where the error was happening? if not. Can you execute again the test and launch must gather in the moment the test fail? Test log: ( I do not know the date of the execution, but the time registered is completely outside the time range provided in the prometheus and alert manager logs:) ... 17:55:29 - MainThread - ocs_ci.ocs.ocp - INFO - fetching logs from Pod/fio-8w4bm 17:55:29 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-7c961e5d531d49ee92f0994cc logs Pod/fio-8w4bm 17:55:30 - MainThread - ocs_ci.ocs.fiojob - INFO - fio disk_util stats: None 17:55:30 - MainThread - ocs_ci.utility.workloadfixture - INFO - Starting 480.0s sleep for the purposes of measurement. 17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - Request https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts?silenced=False&inhibited=False failed 17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - 504 17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> .... ================== prometheus-k8s-0 (prometheus-k8s-1 is similar) time range between 2023-09-06T13:42:12 and 2023-09-06T15:00:33. The error in the test is outside the time range logged. previous log: 2023-09-06T13:42:12.001001009Z ts=2023-09-06T13:42:12.000Z caller=main.go:585 level=info msg="Starting Prometheus Server" mode=server version="(version=2.46.0, branch=rhaos-4.14-rhel-8, revision=6db7b73de85c077a10d1740a32e04f1aaa4220a2)" 2023-09-06T13:42:12.001001009Z ts=2023-09-06T13:42:12.000Z caller=main.go:590 level=info build_context="(go=go1.20.5 X:strictfipsruntime, platform=linux/amd64, user=root@d83135830a30, date=20230726-15:37:03, tags=netgo,builtinassets,stringlabels,strictfipsruntime)" ..... 2023-09-06T14:58:27.112304472Z ts=2023-09-06T14:58:27.112Z caller=notifier.go:603 level=info component=notifier msg="Stopping notification manager..." 2023-09-06T14:58:27.112405373Z ts=2023-09-06T14:58:27.112Z caller=main.go:1157 level=info msg="Notifier manager stopped" 2023-09-06T14:58:27.113329681Z ts=2023-09-06T14:58:27.113Z caller=main.go:1169 level=info msg="See you next time!" current log: 2023-09-06T14:58:27.479547353Z ts=2023-09-06T14:58:27.479Z caller=main.go:585 level=info msg="Starting Prometheus Server" mode=server version="(version=2.46.0, branch=rhaos-4.14-rhel-8, revision=6db7b73de85c077a10d1740a32e04f1aaa4220a2)" 2023-09-06T14:58:27.479547353Z ts=2023-09-06T14:58:27.479Z caller=main.go:590 level=info build_context="(go=go1.20.5 X:strictfipsruntime, platform=linux/amd64, user=root@d83135830a30, date=20230726-15:37:03, tags=netgo,builtinassets,stringlabels,strictfipsruntime)" .... 2023-09-06T14:58:41.817471180Z ts=2023-09-06T14:58:41.817Z caller=main.go:1011 level=info msg="Server is ready to receive web requests." 2023-09-06T14:58:41.817471180Z ts=2023-09-06T14:58:41.817Z caller=manager.go:1009 level=info component="rule manager" msg="Starting rule manager..." 2023-09-06T15:00:27.687311691Z ts=2023-09-06T15:00:27.687Z caller=compact.go:523 level=info component=tsdb msg="write block" mint=1694001600000 maxt=1694008800000 ulid=01H9NG3RJ70MVC95A6KXFKQEW1 duration=27.487906607s 2023-09-06T15:00:28.347761310Z ts=2023-09-06T15:00:28.347Z caller=head.go:1293 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=656.195866ms 2023-09-06T15:00:28.389830034Z ts=2023-09-06T15:00:28.389Z caller=checkpoint.go:100 level=info component=tsdb msg="Creating checkpoint" from_segment=166 to_segment=170 mint=1694008800000 2023-09-06T15:00:33.987206774Z ts=2023-09-06T15:00:33.987Z caller=head.go:1261 level=info component=tsdb msg="WAL checkpoint complete" first=166 last=170 duration=5.597474311s No requests
"""Can you confirm that you collected the logs in the moment where the error was happening?""" - the must-gather logs where captured in the next 20 minutes after the test failed. It may be so the timezone shifted must-gather time is a UTC timestamp and the tests time is in Israeli Standard Time UTC +2 running another job test_cephfs_capacity_workload_alerts on 4.14 ODF on IBMCloud 3M, 3W test execution - http://pastebin.test.redhat.com/1109122 --- First error appeared at 17:21:22 First successful request after multiple failed requests happened at 17:25:05 --- During the 3.5 min Prometheus was not responding on req https://prometheus-k8s-openshift-monitoring.apps.dosypenk-109-ibm.ibmcloud2.qe.rh-ocs.com/api/v1/alerts - "The application is currently not serving requests at this endpoint. It may not have been started or is still starting" Although Prometheus has been restored and the test passed, in most cases, it still fails to respond within 4-5 minutes, resulting in a test failure. During this malfunction management-console looks like this https://bugzilla.redhat.com/attachment.cgi?id=1987977 --- Must gather logs where taken immediately after the test execution (took 20 min to collect all logs) 17:32:20 - 17:52:35 https://drive.google.com/file/d/1N9VU39sEe0xu8U5S3CA1KvEpJarOxcCM/view?usp=sharing - must-gather-ocp https://drive.google.com/file/d/1KbliB5Hwu12FpDnbMhiRxLCJXoSQirRJ/view?usp=sharing - must-gather-ocs Additional info: I was checking if malfunction happens in mgr pod logs, and it seems that there is no problem there debug 2023-09-10T14:21:22.288+0000 7f7aa6066640 0 log_channel(cluster) log [DBG] : pgmap v3893: 113 pgs: 12 active+clean+laggy, 101 active+clean; 72 GiB data, 217 GiB used, 83 GiB / 300 GiB avail; 85 B/s rd, 68 MiB/s wr, 18 op/s ::ffff:10.131.0.29 - - [10/Sep/2023:14:21:23] "GET /metrics HTTP/1.1" 200 36917 "" "Prometheus/2.46.0"
issue confirmed on 3 recent test runs on External mode deployment, ODF 4.11
Issue confirmed on ODF 4.13.7, and is stable for reproduction across all ODF versions Behaviour: while storage filled with Flexible I/O https://linux.die.net/man/1/fio app the issue started appear when cluster was filled from initial 27% to only to 34%; after this first Prometheus responce failure "<html><body><h1>504 Gateway Time-out</h1>" The server didn't respond in time. </body></html> starts happen more frequently, and takes more time to restore. Along this behavior System Degraded notifications showed, 2 MDSs report slow metadata IOs, 1 MDSs report slow requests On screenrecording https://drive.google.com/file/d/1qemRIo243eJHKRnIMpvKIA1JQ5GgSjvs/view?usp=sharing the IOPS, Latency and Throughput graphs are visible Mustgather OCP, OCS logs: https://drive.google.com/file/d/18AUa_B2e5mY6ZFq3A2XE4Rn2w8COKEOL/view?usp=sharing