1818736 – UI crashed after cluster reaching 85% capacity

Bug 1818736 - UI crashed after cluster reaching 85% capacity

Summary: UI crashed after cluster reaching 85% capacity

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Anmol Sachan
QA Contact:	akarsha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-30 08:47 UTC by Shrivaibavi Raghaventhiran
Modified:	2020-09-23 09:05 UTC (History)
CC List:	12 users (show)
Fixed In Version:	quay.io/rhceph-dev/ocs-olm-operator:4.5.0-475.ci
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 10:16:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Alert Screenshot (70.34 KB, image/jpeg) 2020-03-31 09:20 UTC, Bipul Adhikari	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 71	0	None	closed	Release 1.3	2021-01-16 02:21:35 UTC
Github	rook rook pull 5673	0	None	closed	ceph: Added CephClusterReadOnly alert and updated utilization alerts.	2021-01-16 02:21:35 UTC
Red Hat Bugzilla	1809248	1	None	None	None	2024-09-18 00:56:04 UTC
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:16:32 UTC

Description Shrivaibavi Raghaventhiran 2020-03-30 08:47:50 UTC

Description of problem:
-----------------------
Had 3 OCS node with 1 osd each of size 4TB, Filled 85% and saw prometheus-k8-0 in Terminating state and Did not get any alerts in UI

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-04-222846   True        False         3d2h    Error while reconciling 4.3.0-0.nightly-2020-03-04-222846: the cluster operator monitoring is degraded

prometheus-k8s-0                               6/7     Running       1          12h
prometheus-k8s-1                               0/7     Terminating   19         2d22h
prometheus-operator-65ff485b85-v5cw8           1/1     Running       0          2d22h
telemeter-client-68c9f8cbf6-g9jqd              3/3     Running       3          3d1h
thanos-querier-674bf557f8-27xlc                4/4     Running       4          3d1h
thanos-querier-674bf557f8-fqjzd                0/4     Pending       0          120m

Version of all relevant components (if applicable):
---------------------------------------------------
OCP- 4.3.0-0.nightly-2020-03-04-222846

sh-4.4# ceph version
ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)

rook: 4.3-34.0ea5457b.ocs_4.3
go: go1.11.13

ocs-operator.v4.3.0-379.ci                  OpenShift Container Storage   4.3.0-379.ci                    Succeeded


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, Unable to access UI with ease, No alerts seen.

Is there any workaround available to the best of your knowledge?
I can add capacity and recover the cluster

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
------------------
1. Install monitoring backed by PVC
2. Fill the cluster up to 85% capacity
3. See UI for alerts


Actual results:
---------------
The UI does not show metrics and Failed to show alerts

Expected results:
----------------
UI should show Metrics/Alerts even if the cluster capacity is 85%
According to https://issues.redhat.com/browse/KNIP-1223, Monitoring backed PVC should not become read-only.

Additional info:
----------------
$ oc get pods -n openshift-monitoring
NAME                                           READY   STATUS        RESTARTS   AGE
alertmanager-main-0                            3/3     Running       0          2d23h
alertmanager-main-1                            3/3     Running       0          2d23h
alertmanager-main-2                            3/3     Running       0          2d23h
cluster-monitoring-operator-66dc99c65b-spdrx   1/1     Running       1          3d2h
grafana-5f4fc9f9cc-qrlc2                       2/2     Running       2          3d2h
kube-state-metrics-6bcc97c9d6-slr97            3/3     Running       3          3d2h
node-exporter-2mvf7                            2/2     Running       2          3d1h
node-exporter-2rz6s                            2/2     Running       2          3d1h
node-exporter-448bc                            2/2     Running       2          3d1h
node-exporter-574td                            2/2     Running       2          3d1h
node-exporter-66ckd                            2/2     Running       2          3d1h
node-exporter-6bwjb                            2/2     Running       2          3d1h
node-exporter-6w92l                            2/2     Running       2          3d1h
node-exporter-78s8d                            2/2     Running       2          3d1h
node-exporter-7m49w                            2/2     Running       2          3d1h
node-exporter-7rc9b                            2/2     Running       2          3d1h
node-exporter-7sdsd                            2/2     Running       2          3d1h
node-exporter-7t9gh                            2/2     Running       2          3d1h
node-exporter-7w67v                            2/2     Running       2          3d1h
node-exporter-7zxq2                            2/2     Running       2          3d2h
node-exporter-8wbnx                            2/2     Running       2          3d1h
node-exporter-9bjdw                            2/2     Running       2          3d1h
node-exporter-9k5dv                            2/2     Running       2          3d2h
node-exporter-9n9b7                            2/2     Running       0          3d2h
node-exporter-9pj5w                            2/2     Running       2          3d1h
node-exporter-b2k4v                            2/2     Running       2          3d1h
node-exporter-djhlg                            2/2     Running       2          3d1h
node-exporter-dz49q                            2/2     Running       2          3d1h
node-exporter-fl2ld                            2/2     Running       2          3d1h
node-exporter-flwf9                            2/2     Running       2          3d1h
node-exporter-fnqcz                            2/2     Running       4          3d1h
node-exporter-g5cp2                            2/2     Running       2          3d1h
node-exporter-gq4jd                            2/2     Running       4          3d1h
node-exporter-hfs64                            2/2     Running       2          3d1h
node-exporter-jrpmx                            2/2     Running       2          3d2h
node-exporter-jw8lw                            2/2     Running       0          3d2h
node-exporter-k9pqk                            2/2     Running       2          3d1h
node-exporter-kd7tv                            2/2     Running       2          3d1h
node-exporter-l27lj                            2/2     Running       2          3d1h
node-exporter-lcsnp                            2/2     Running       2          3d1h
node-exporter-lg5h4                            2/2     Running       2          3d1h
node-exporter-lnvpw                            2/2     Running       2          3d1h
node-exporter-mkx2f                            2/2     Running       4          3d1h
node-exporter-nx9bz                            2/2     Running       2          3d1h
node-exporter-p442g                            2/2     Running       2          3d1h
node-exporter-p5qjk                            2/2     Running       2          3d1h
node-exporter-rq8nk                            2/2     Running       2          3d1h
node-exporter-rvtnf                            2/2     Running       2          3d1h
node-exporter-vdz6l                            2/2     Running       2          3d1h
node-exporter-vzw9d                            2/2     Running       2          3d1h
node-exporter-w5f24                            2/2     Running       2          3d2h
node-exporter-wb9k7                            2/2     Running       4          3d1h
node-exporter-wf2k4                            2/2     Running       2          3d1h
node-exporter-xmnhr                            2/2     Running       2          3d1h
node-exporter-xssxc                            2/2     Running       2          3d1h
node-exporter-z7ckm                            2/2     Running       2          3d1h
node-exporter-zhd7p                            2/2     Running       2          3d1h
node-exporter-zmdlt                            2/2     Running       2          3d1h
node-exporter-zmtrd                            2/2     Running       2          3d1h
node-exporter-zrprz                            2/2     Running       4          3d1h
openshift-state-metrics-7479448b69-5ws2c       3/3     Running       0          167m
prometheus-adapter-559b8c5c5b-c6vnz            1/1     Running       0          2d2h
prometheus-adapter-559b8c5c5b-vbtzr            1/1     Running       0          2d2h
prometheus-k8s-0                               6/7     Running       1          13h
prometheus-k8s-1                               0/7     Terminating   19         2d23h
prometheus-operator-65ff485b85-v5cw8           1/1     Running       0          2d23h
telemeter-client-68c9f8cbf6-g9jqd              3/3     Running       3          3d2h
thanos-querier-674bf557f8-27xlc                4/4     Running       4          3d2h
thanos-querier-674bf557f8-fqjzd                0/4     Pending       0          167m

$ oc get pvc -n openshift-monitoring
NAME                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
my-alertmanager-claim-alertmanager-main-0   Bound    pvc-d4ce98d4-2498-4a81-a045-7135f45e007f   40Gi       RWO            ocs-storagecluster-ceph-rbd   2d23h
my-alertmanager-claim-alertmanager-main-1   Bound    pvc-bff677cf-387d-486c-95b4-d9f5b6c6d609   40Gi       RWO            ocs-storagecluster-ceph-rbd   2d23h
my-alertmanager-claim-alertmanager-main-2   Bound    pvc-43ebae72-998a-4c5c-9f01-f29ba18b7945   40Gi       RWO            ocs-storagecluster-ceph-rbd   2d23h
my-prometheus-claim-prometheus-k8s-0        Bound    pvc-36dcd278-28e9-441a-9a3b-5a7f69cf80b6   40Gi       RWO            ocs-storagecluster-ceph-rbd   2d23h
my-prometheus-claim-prometheus-k8s-1        Bound    pvc-86d68810-4fd0-439b-955c-9060b3b0f40a   40Gi       RWO            ocs-storagecluster-ceph-rbd   2d23h

sh-4.4# ceph -s
  cluster:
    id:     6ca0c5b2-1051-4ee8-92c5-ed439d213a96
    health: HEALTH_ERR
            3 full osd(s)
            10 pool(s) full
            Degraded data redundancy: 30/2708412 objects degraded (0.001%), 24 pgs degraded
            Full OSDs blocking recovery: 25 pgs recovery_toofull
            1 subtrees have overcommitted pool target_size_ratio
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 2h), 3 in (since 3d)
 
  data:
    pools:   10 pools, 192 pgs
    objects: 902.80k objects, 3.4 TiB
    usage:   10 TiB used, 1.8 TiB / 12 TiB avail
    pgs:     30/2708412 objects degraded (0.001%)
             167 active+clean
             24  active+recovery_toofull+degraded
             1   active+recovery_toofull
 
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr
 
sh-4.4# 
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS 
 0   hdd 3.99899  1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192     up 
 1   hdd 3.99899  1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192     up 
 2   hdd 3.99899  1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 4.9 GiB 609 GiB 85.13 1.00 192     up 
                    TOTAL  12 TiB  10 TiB  10 TiB 4.0 MiB  15 GiB 1.8 TiB 85.14                 
MIN/MAX VAR: 1.00/1.00  STDDEV: 0


Note: Reboot of worker nodes and master nodes were also performed on the cluster.

Comment 2 Shrivaibavi Raghaventhiran 2020-03-30 09:22:01 UTC

Screenshot here http://rhsqe-repo.lab.eng.blr.redhat.com/cns/ocs-qe-bugs/bz-1818736/

Comment 3 Yaniv Kaul 2020-03-30 10:47:32 UTC

Logs?!

Comment 4 Nishanth Thomas 2020-03-30 13:15:07 UTC

@sraghave, Is this behavior consistent? Are you able to reproduce every time capacity reaches 85%?

Comment 5 Shrivaibavi Raghaventhiran 2020-03-30 13:34:47 UTC

@yaniv logs are same location as comment#2, Logs are being copied some network issues, Apologies

@Nishanth I tried once and I hit it for the first time itself,
My understanding is that PVC backed by monitoring is getting readonly, Hence the UI is crashing and very slow and not throwing alerts. 
Not sure, But above might be the issue.

Comment 6 Bipul Adhikari 2020-03-31 09:18:50 UTC

I see some old alerts in your cluster(will attach the screenshot). However, the screenshot you've provided looks more like a network issue. Since OCP metrics are also not being shows(CPU util..).

Comment 7 Bipul Adhikari 2020-03-31 09:20:21 UTC

Created attachment 1675025 [details]
Alert Screenshot

Screenshot for the above comment.

Comment 8 Shrivaibavi Raghaventhiran 2020-03-31 11:44:16 UTC

@Bipul The alerts are seen when its 75%... The screenshot you attached is the same. But the BZ is raised when all the osds are full ie when it reaches 85% Capacity. 

sh-4.4# ceph -s
  cluster:
    id:     6ca0c5b2-1051-4ee8-92c5-ed439d213a96
    health: HEALTH_ERR
            3 full osd(s)
            10 pool(s) full
            Degraded data redundancy: 30/2708412 objects degraded (0.001%), 24 pgs degraded
            Full OSDs blocking recovery: 25 pgs recovery_toofull
            1 subtrees have overcommitted pool target_size_ratio
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 2h), 3 in (since 3d)
 
  data:
    pools:   10 pools, 192 pgs
    objects: 902.80k objects, 3.4 TiB
    usage:   10 TiB used, 1.8 TiB / 12 TiB avail
    pgs:     30/2708412 objects degraded (0.001%)
             167 active+clean
             24  active+recovery_toofull+degraded
             1   active+recovery_toofull
 
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr
 
sh-4.4# 
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS 
 0   hdd 3.99899  1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192     up 
 1   hdd 3.99899  1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 5.0 GiB 609 GiB 85.14 1.00 192     up 
 2   hdd 3.99899  1.00000 4.0 TiB 3.4 TiB 3.4 TiB 1.3 MiB 4.9 GiB 609 GiB 85.13 1.00 192     up 
                    TOTAL  12 TiB  10 TiB  10 TiB 4.0 MiB  15 GiB 1.8 TiB 85.14                 
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

Comment 9 Bipul Adhikari 2020-03-31 12:23:52 UTC

The UI is not crashing however the metrics are not being displayed, this is not a UI issue. It's a backend issue. @Nishanth please move it the correct component.

Comment 10 Anmol Sachan 2020-04-01 08:41:23 UTC

@@sraghave I think this is happening because of PVC's backing Prometheus datastore getting full quickly. Can you please verify this? Also what size of PVC's are being used.

Comment 11 Shrivaibavi Raghaventhiran 2020-04-01 10:18:13 UTC

@anmol 
The prometheus pod backed by PVC is getting into terminating state, everytime when the OSDs reach 85% and hence the Alerts are not seen in UI.
I added capacity and waited till the OSD's capacity to get below 85% and then the UI was looking fine again. 
Everytime we cannot add capacity to escape from this issue as the supported osd is only 3 per node. When the OSD reached 85% its becoming readonly, So stopping the alerts coming to UI

Please have a look at KNIP-1223

The size we are using is 40G as mentioned in the OCS and OCP docs.

Comment 12 Nishanth Thomas 2020-04-01 10:54:25 UTC

So this is working as expected. Or am I missing something?

Comment 13 Yaniv Kaul 2020-04-01 13:36:26 UTC

I think it's a very bad experience. However, it's not a regression, to the best of my understanding.

Comment 14 Michael Adam 2020-04-07 08:22:30 UTC

(In reply to Bipul Adhikari from comment #9)
> The UI is not crashing however the metrics are not being displayed,

Can we have the dscription of the BZ updated to reflect this?

Comment 15 Michael Adam 2020-04-07 08:27:30 UTC

(In reply to Nishanth Thomas from comment #12)
> So this is working as expected. Or am I missing something?

Do I get it right that this is basically a chicken-and-egg problem with prometheus running on PVs affected by the OSD filling state that it is supposed to be monitoring?

Moving to 4.5 for further processing (4.3 is about to be released, and 4.4 is essentially closed).

Comment 16 Nishanth Thomas 2020-04-08 07:37:15 UTC

(In reply to Michael Adam from comment #15)
> (In reply to Nishanth Thomas from comment #12)
> > So this is working as expected. Or am I missing something?
> 
> Do I get it right that this is basically a chicken-and-egg problem with
> prometheus running on PVs affected by the OSD filling state that it is
> supposed to be monitoring?
> 
> Moving to 4.5 for further processing (4.3 is about to be released, and 4.4
> is essentially closed).

As we discussed on the gchat thread, the best solution at the moment is to lower the monitoring thresholds so that it will get a chance to send alerts, though there are some corner cases where this might not work.

@Anmol, Have you checked the feasibility of cleaning up the historical data or can we trigger auto expand Pvs up on reaching the first threshold?

Comment 17 Anmol Sachan 2020-04-20 06:52:00 UTC

> As we discussed on the gchat thread, the best solution at the moment is to
> lower the monitoring thresholds so that it will get a chance to send alerts,
> though there are some corner cases where this might not work.

We will be changing the monitoring thresholds according to this bug and https://bugzilla.redhat.com/show_bug.cgi?id=1809248 and the changed mon_osd_full_ratio, mon_osd_nearfull_ratio. So there are few cases that need to be taken care of together.

> 
> @Anmol, Have you checked the feasibility of cleaning up the historical data.

That is OCP functionality and should not be affected by OCS.

> or can we trigger auto expand Pvs up on reaching the first threshold?

In OCS Product, we do not configure OCP Prometheus. It's essentially the user's responsibility to expand the PVC. This particular use case comes from OCS CI and should be handled there.

Comment 18 Martin Bukatovic 2020-05-11 16:31:45 UTC

This has been discussed in November already (and likely even before, but I can't
find any reference):

http://post-office.corp.redhat.com/archives/ocs-qe/2019-November/msg00271.html

Kyle Bader was suggesting 2 approaches, but both were boiling down to a significant changes in OCS:

http://post-office.corp.redhat.com/archives/ocs-qe/2019-November/msg00272.html

While Josh Salomon highlighted the importance of monitoring:

> As I see it, the preferred way is a reliable alert mechanism with the
> addition of enough spares to complete inflight operations - If the customer
> does not monitor the alerts, there is practically nothing we can do. By
> reliable alert mechanism, I mean we need to be able to reliably export our
> alerts to standard monitoring system (for example via SNMP traps). I don't
> know what we currently have in this domain, but we should be able to send our
> alerts to the popular monitoring systems rather than ask the customer to
> integrate with Ceph or Prometheus - we should assume that the main dashboard
> of the IT is not the openshift dashboard, and the alerts should always be
> pushed to the main dashboard. 

http://post-office.corp.redhat.com/archives/ocs-qe/2019-November/msg00281.html

Which supports Anmol's comment 17 in which he notes that the most important
task here is to fix alerting.

The followup may include whether we should support/document how to configure
OCP Prometheus to send selected critical alerts via snmp (maybe via
prometheus-webhook-snmp or other existing alertmanager plugin?), but this
requires collaboration with OCP Alerting team.

Comment 19 Martin Bukatovic 2020-05-11 18:03:23 UTC

This bug was disucssed on today's "Monitoring BZ discussion with QE" among Anmol, Elena, Filip and me. The conclusion was that:

- alerts needs to be fixed (BZ 1809248)for an admin to be able to avoid the problem
- documentation needs to describe this (why, action items) as part of description of the storage utilization alerts, I reported new doc BZ 1834440 to make sure we update the docs accordingly when the alerts are fixed

Comment 20 Anmol Sachan 2020-06-22 14:21:56 UTC

Added a new critical alert CephClusterReadOnly which notifies when the cluster becomes read-only.
Updated the warning and critical alert to accommodate the new alert.

Comment 21 Yaniv Kaul 2020-06-24 15:35:25 UTC

Please provide devel-ack.

Comment 26 akarsha 2020-07-29 13:38:21 UTC

Tested on AWS IPI environment(internal mode)
- 3 master
- 3 worker(ocs)
- Monitoring is backed by OCS

Version
OCP: 4.5.0-0.nightly-2020-07-25-031342
OCS: ocs-operator.v4.5.0-508.ci

Observations
At 85% cluster goes to read-only state(i.e, all the client IOs stopped and only read IO's was there), but was be able to expand cluster through UI.
As expected we see both the alerts at 75% and 85% and one of the monitoring pod went to CreateContainerError state since it is backed by OCS but UI was accessible.
Once the cluster expanded monitoring pod came to running state.

- snapshot when cluster reached ~75%
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/after-logs/snapshots/when-77%25/

- snapshot when cluster reached ~85%
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/after-logs/snapshots/when-85%25/

- snapshot add capacity when cluster reached ~85%
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/after-logs/snapshots/add-capacity/


Logs
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1818736/

Comment 27 akarsha 2020-07-29 13:40:45 UTC

Based on comment#26 moving bz to verified.

Comment 29 errata-xmlrpc 2020-09-15 10:16:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.