Description of problem: ----------------------- Had 3 OCS node with 1 osd each of size 4TB, Filled 85% and saw prometheus-k8-0 in Terminating state and Did not get any alerts in UI $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-04-222846 True False 3d2h Error while reconciling 4.3.0-0.nightly-2020-03-04-222846: the cluster operator monitoring is degraded prometheus-k8s-0 6/7 Running 1 12h prometheus-k8s-1 0/7 Terminating 19 2d22h prometheus-operator-65ff485b85-v5cw8 1/1 Running 0 2d22h telemeter-client-68c9f8cbf6-g9jqd 3/3 Running 3 3d1h thanos-querier-674bf557f8-27xlc 4/4 Running 4 3d1h thanos-querier-674bf557f8-fqjzd 0/4 Pending 0 120m Version of all relevant components (if applicable): --------------------------------------------------- OCP- 4.3.0-0.nightly-2020-03-04-222846 sh-4.4# ceph version ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable) rook: 4.3-34.0ea5457b.ocs_4.3 go: go1.11.13 ocs-operator.v4.3.0-379.ci OpenShift Container Storage 4.3.0-379.ci Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, Unable to access UI with ease, No alerts seen. Is there any workaround available to the best of your knowledge? I can add capacity and recover the cluster Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: ------------------ 1. Install monitoring backed by PVC 2. Fill the cluster up to 85% capacity 3. See UI for alerts Actual results: --------------- The UI does not show metrics and Failed to show alerts Expected results: ---------------- UI should show Metrics/Alerts even if the cluster capacity is 85% According to https://issues.redhat.com/browse/KNIP-1223, Monitoring backed PVC should not become read-only. Additional info: ---------------- $ oc get pods -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 3/3 Running 0 2d23h alertmanager-main-1 3/3 Running 0 2d23h alertmanager-main-2 3/3 Running 0 2d23h cluster-monitoring-operator-66dc99c65b-spdrx 1/1 Running 1 3d2h grafana-5f4fc9f9cc-qrlc2 2/2 Running 2 3d2h kube-state-metrics-6bcc97c9d6-slr97 3/3 Running 3 3d2h node-exporter-2mvf7 2/2 Running 2 3d1h node-exporter-2rz6s 2/2 Running 2 3d1h node-exporter-448bc 2/2 Running 2 3d1h node-exporter-574td 2/2 Running 2 3d1h node-exporter-66ckd 2/2 Running 2 3d1h node-exporter-6bwjb 2/2 Running 2 3d1h node-exporter-6w92l 2/2 Running 2 3d1h node-exporter-78s8d 2/2 Running 2 3d1h node-exporter-7m49w 2/2 Running 2 3d1h node-exporter-7rc9b 2/2 Running 2 3d1h node-exporter-7sdsd 2/2 Running 2 3d1h node-exporter-7t9gh 2/2 Running 2 3d1h node-exporter-7w67v 2/2 Running 2 3d1h node-exporter-7zxq2 2/2 Running 2 3d2h node-exporter-8wbnx 2/2 Running 2 3d1h node-exporter-9bjdw 2/2 Running 2 3d1h node-exporter-9k5dv 2/2 Running 2 3d2h node-exporter-9n9b7 2/2 Running 0 3d2h node-exporter-9pj5w 2/2 Running 2 3d1h node-exporter-b2k4v 2/2 Running 2 3d1h node-exporter-djhlg 2/2 Running 2 3d1h node-exporter-dz49q 2/2 Running 2 3d1h node-exporter-fl2ld 2/2 Running 2 3d1h node-exporter-flwf9 2/2 Running 2 3d1h node-exporter-fnqcz 2/2 Running 4 3d1h node-exporter-g5cp2 2/2 Running 2 3d1h node-exporter-gq4jd 2/2 Running 4 3d1h node-exporter-hfs64 2/2 Running 2 3d1h node-exporter-jrpmx 2/2 Running 2 3d2h node-exporter-jw8lw 2/2 Running 0 3d2h node-exporter-k9pqk 2/2 Running 2 3d1h node-exporter-kd7tv 2/2 Running 2 3d1h node-exporter-l27lj 2/2 Running 2 3d1h node-exporter-lcsnp 2/2 Running 2 3d1h node-exporter-lg5h4 2/2 Running 2 3d1h node-exporter-lnvpw 2/2 Running 2 3d1h node-exporter-mkx2f 2/2 Running 4 3d1h node-exporter-nx9bz 2/2 Running 2 3d1h node-exporter-p442g 2/2 Running 2 3d1h node-exporter-p5qjk 2/2 Running 2 3d1h node-exporter-rq8nk 2/2 Running 2 3d1h node-exporter-rvtnf 2/2 Running 2 3d1h node-exporter-vdz6l 2/2 Running 2 3d1h node-exporter-vzw9d 2/2 Running 2 3d1h node-exporter-w5f24 2/2 Running 2 3d2h node-exporter-wb9k7 2/2 Running 4 3d1h node-exporter-wf2k4 2/2 Running 2 3d1h node-exporter-xmnhr 2/2 Running 2 3d1h node-exporter-xssxc 2/2 Running 2 3d1h node-exporter-z7ckm 2/2 Running 2 3d1h node-exporter-zhd7p 2/2 Running 2 3d1h node-exporter-zmdlt 2/2 Running 2 3d1h node-exporter-zmtrd 2/2 Running 2 3d1h node-exporter-zrprz 2/2 Running 4 3d1h openshift-state-metrics-7479448b69-5ws2c 3/3 Running 0 167m prometheus-adapter-559b8c5c5b-c6vnz 1/1 Running 0 2d2h prometheus-adapter-559b8c5c5b-vbtzr 1/1 Running 0 2d2h prometheus-k8s-0 6/7 Running 1 13h prometheus-k8s-1 0/7 Terminating 19 2d23h prometheus-operator-65ff485b85-v5cw8 1/1 Running 0 2d23h telemeter-client-68c9f8cbf6-g9jqd 3/3 Running 3 3d2h thanos-querier-674bf557f8-27xlc 4/4 Running 4 3d2h thanos-querier-674bf557f8-fqjzd 0/4 Pending 0 167m $ oc get pvc -n openshift-monitoring NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-alertmanager-claim-alertmanager-main-0 Bound pvc-d4ce98d4-2498-4a81-a045-7135f45e007f 40Gi RWO ocs-storagecluster-ceph-rbd 2d23h my-alertmanager-claim-alertmanager-main-1 Bound pvc-bff677cf-387d-486c-95b4-d9f5b6c6d609 40Gi RWO ocs-storagecluster-ceph-rbd 2d23h my-alertmanager-claim-alertmanager-main-2 Bound pvc-43ebae72-998a-4c5c-9f01-f29ba18b7945 40Gi RWO ocs-storagecluster-ceph-rbd 2d23h my-prometheus-claim-prometheus-k8s-0 Bound pvc-36dcd278-28e9-441a-9a3b-5a7f69cf80b6 40Gi RWO ocs-storagecluster-ceph-rbd 2d23h my-prometheus-claim-prometheus-k8s-1 Bound pvc-86d68810-4fd0-439b-955c-9060b3b0f40a 40Gi RWO ocs-storagecluster-ceph-rbd 2d23h sh-4.4# ceph -s cluster: id: 6ca0c5b2-1051-4ee8-92c5-ed439d213a96 health: HEALTH_ERR 3 full osd(s) 10 pool(s) full Degraded data redundancy: 30/2708412 objects degraded (0.001%), 24 pgs degraded Full OSDs blocking recovery: 25 pgs recovery_toofull 1 subtrees have overcommitted pool target_size_ratio services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 2h), 3 in (since 3d) data: pools: 10 pools, 192 pgs objects: 902.80k objects, 3.4 TiB usage: 10 TiB used, 1.8 TiB / 12 TiB avail pgs: 30/2708412 objects degraded (0.001%) 167 active+clean 24 active+recovery_toofull+degraded 1 active+recovery_toofull io: client: 852 B/s rd, 1 op/s rd, 0 op/s wr sh-4.4# sh-4.4# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 3.99899 1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192 up 1 hdd 3.99899 1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192 up 2 hdd 3.99899 1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 4.9 GiB 609 GiB 85.13 1.00 192 up TOTAL 12 TiB 10 TiB 10 TiB 4.0 MiB 15 GiB 1.8 TiB 85.14 MIN/MAX VAR: 1.00/1.00 STDDEV: 0 Note: Reboot of worker nodes and master nodes were also performed on the cluster.
Screenshot here http://rhsqe-repo.lab.eng.blr.redhat.com/cns/ocs-qe-bugs/bz-1818736/
Logs?!
@sraghave, Is this behavior consistent? Are you able to reproduce every time capacity reaches 85%?
@yaniv logs are same location as comment#2, Logs are being copied some network issues, Apologies @Nishanth I tried once and I hit it for the first time itself, My understanding is that PVC backed by monitoring is getting readonly, Hence the UI is crashing and very slow and not throwing alerts. Not sure, But above might be the issue.
I see some old alerts in your cluster(will attach the screenshot). However, the screenshot you've provided looks more like a network issue. Since OCP metrics are also not being shows(CPU util..).
Created attachment 1675025 [details] Alert Screenshot Screenshot for the above comment.
@Bipul The alerts are seen when its 75%... The screenshot you attached is the same. But the BZ is raised when all the osds are full ie when it reaches 85% Capacity. sh-4.4# ceph -s cluster: id: 6ca0c5b2-1051-4ee8-92c5-ed439d213a96 health: HEALTH_ERR 3 full osd(s) 10 pool(s) full Degraded data redundancy: 30/2708412 objects degraded (0.001%), 24 pgs degraded Full OSDs blocking recovery: 25 pgs recovery_toofull 1 subtrees have overcommitted pool target_size_ratio services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 2h), 3 in (since 3d) data: pools: 10 pools, 192 pgs objects: 902.80k objects, 3.4 TiB usage: 10 TiB used, 1.8 TiB / 12 TiB avail pgs: 30/2708412 objects degraded (0.001%) 167 active+clean 24 active+recovery_toofull+degraded 1 active+recovery_toofull io: client: 852 B/s rd, 1 op/s rd, 0 op/s wr sh-4.4# sh-4.4# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 3.99899 1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192 up 1 hdd 3.99899 1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192 up 2 hdd 3.99899 1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 4.9 GiB 609 GiB 85.13 1.00 192 up TOTAL 12 TiB 10 TiB 10 TiB 4.0 MiB 15 GiB 1.8 TiB 85.14 MIN/MAX VAR: 1.00/1.00 STDDEV: 0
The UI is not crashing however the metrics are not being displayed, this is not a UI issue. It's a backend issue. @Nishanth please move it the correct component.
@@sraghave I think this is happening because of PVC's backing Prometheus datastore getting full quickly. Can you please verify this? Also what size of PVC's are being used.
@anmol The prometheus pod backed by PVC is getting into terminating state, everytime when the OSDs reach 85% and hence the Alerts are not seen in UI. I added capacity and waited till the OSD's capacity to get below 85% and then the UI was looking fine again. Everytime we cannot add capacity to escape from this issue as the supported osd is only 3 per node. When the OSD reached 85% its becoming readonly, So stopping the alerts coming to UI Please have a look at KNIP-1223 The size we are using is 40G as mentioned in the OCS and OCP docs.
So this is working as expected. Or am I missing something?
I think it's a very bad experience. However, it's not a regression, to the best of my understanding.
(In reply to Bipul Adhikari from comment #9) > The UI is not crashing however the metrics are not being displayed, Can we have the dscription of the BZ updated to reflect this?
(In reply to Nishanth Thomas from comment #12) > So this is working as expected. Or am I missing something? Do I get it right that this is basically a chicken-and-egg problem with prometheus running on PVs affected by the OSD filling state that it is supposed to be monitoring? Moving to 4.5 for further processing (4.3 is about to be released, and 4.4 is essentially closed).
(In reply to Michael Adam from comment #15) > (In reply to Nishanth Thomas from comment #12) > > So this is working as expected. Or am I missing something? > > Do I get it right that this is basically a chicken-and-egg problem with > prometheus running on PVs affected by the OSD filling state that it is > supposed to be monitoring? > > Moving to 4.5 for further processing (4.3 is about to be released, and 4.4 > is essentially closed). As we discussed on the gchat thread, the best solution at the moment is to lower the monitoring thresholds so that it will get a chance to send alerts, though there are some corner cases where this might not work. @Anmol, Have you checked the feasibility of cleaning up the historical data or can we trigger auto expand Pvs up on reaching the first threshold?
> As we discussed on the gchat thread, the best solution at the moment is to > lower the monitoring thresholds so that it will get a chance to send alerts, > though there are some corner cases where this might not work. We will be changing the monitoring thresholds according to this bug and https://bugzilla.redhat.com/show_bug.cgi?id=1809248 and the changed mon_osd_full_ratio, mon_osd_nearfull_ratio. So there are few cases that need to be taken care of together. > > @Anmol, Have you checked the feasibility of cleaning up the historical data. That is OCP functionality and should not be affected by OCS. > or can we trigger auto expand Pvs up on reaching the first threshold? In OCS Product, we do not configure OCP Prometheus. It's essentially the user's responsibility to expand the PVC. This particular use case comes from OCS CI and should be handled there.
This has been discussed in November already (and likely even before, but I can't find any reference): http://post-office.corp.redhat.com/archives/ocs-qe/2019-November/msg00271.html Kyle Bader was suggesting 2 approaches, but both were boiling down to a significant changes in OCS: http://post-office.corp.redhat.com/archives/ocs-qe/2019-November/msg00272.html While Josh Salomon highlighted the importance of monitoring: > As I see it, the preferred way is a reliable alert mechanism with the > addition of enough spares to complete inflight operations - If the customer > does not monitor the alerts, there is practically nothing we can do. By > reliable alert mechanism, I mean we need to be able to reliably export our > alerts to standard monitoring system (for example via SNMP traps). I don't > know what we currently have in this domain, but we should be able to send our > alerts to the popular monitoring systems rather than ask the customer to > integrate with Ceph or Prometheus - we should assume that the main dashboard > of the IT is not the openshift dashboard, and the alerts should always be > pushed to the main dashboard. http://post-office.corp.redhat.com/archives/ocs-qe/2019-November/msg00281.html Which supports Anmol's comment 17 in which he notes that the most important task here is to fix alerting. The followup may include whether we should support/document how to configure OCP Prometheus to send selected critical alerts via snmp (maybe via prometheus-webhook-snmp or other existing alertmanager plugin?), but this requires collaboration with OCP Alerting team.
This bug was disucssed on today's "Monitoring BZ discussion with QE" among Anmol, Elena, Filip and me. The conclusion was that: - alerts needs to be fixed (BZ 1809248)for an admin to be able to avoid the problem - documentation needs to describe this (why, action items) as part of description of the storage utilization alerts, I reported new doc BZ 1834440 to make sure we update the docs accordingly when the alerts are fixed
Added a new critical alert CephClusterReadOnly which notifies when the cluster becomes read-only. Updated the warning and critical alert to accommodate the new alert.
Please provide devel-ack.
Tested on AWS IPI environment(internal mode) - 3 master - 3 worker(ocs) - Monitoring is backed by OCS Version OCP: 4.5.0-0.nightly-2020-07-25-031342 OCS: ocs-operator.v4.5.0-508.ci Observations At 85% cluster goes to read-only state(i.e, all the client IOs stopped and only read IO's was there), but was be able to expand cluster through UI. As expected we see both the alerts at 75% and 85% and one of the monitoring pod went to CreateContainerError state since it is backed by OCS but UI was accessible. Once the cluster expanded monitoring pod came to running state. - snapshot when cluster reached ~75% http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/after-logs/snapshots/when-77%25/ - snapshot when cluster reached ~85% http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/after-logs/snapshots/when-85%25/ - snapshot add capacity when cluster reached ~85% http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/after-logs/snapshots/add-capacity/ Logs http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/
Based on comment#26 moving bz to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754