Bug 2228359

Summary:	Alerts CephOSDSlowOps and CephMdsMissingReplicas do not appear on 4.10 cluster
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Daniel Osypenko <dosypenk>
Component:	ceph-monitoring	Assignee:	Divyansh Kamboj <dkamboj>
Status:	CLOSED WORKSFORME	QA Contact:	Daniel Osypenko <dosypenk>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	nthomas, odf-bz-bot
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-07 12:51:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Osypenko 2023-08-02 07:44:20 UTC

Description of problem (please be detailed as possible and provide log
snippests):

On a fresh deployment of OCP 4.10, ODF 4.10 vSphere upi_1az_rhcos_vsan_3m_3w.yaml - http://pastebin.test.redhat.com/1106468 
Alert CephOSDSlowOps is not firing when conditions met failing the test (test always PASS on other versions of OCP & ODF).
Alert CephMdsMissingReplicas triggered by downscaling mds deployments is not firing, after being in pending state it does not transfer to fire, pending state removed after less than a minute.  
Issue has been reproduced on another 4.10 deployment.
Also we get a msg "Alerts are not configured to be sent to a notification system" at management console / Overview - Status card. 

Alertmanager yaml:

"global":
  "resolve_timeout": "5m"
"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = critical"
  "target_matchers":
  - "severity =~ warning|info"
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = warning"
  "target_matchers":
  - "severity = info"
"receivers":
- "name": "Default"
- "name": "Watchdog"
- "name": "Critical"
"route":
  "group_by":
  - "namespace"
  "group_interval": "5m"
  "group_wait": "30s"
  "receiver": "Default"
  "repeat_interval": "12h"
  "routes":
  - "matchers":
    - "alertname = Watchdog"
    "receiver": "Watchdog"
  - "matchers":
    - "severity = critical"
    "receiver": "Critical"

compare the yaml above with a yaml from the cluster without this problem: http://pastebin.test.redhat.com/1106462


Version of all relevant components (if applicable):


OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.10.0-0.nightly-2023-07-31-231025
Kubernetes Version: v1.23.17+16bcd69

OCS verison:
ocs-operator.v4.10.14              OpenShift Container Storage   4.10.14   ocs-operator.v4.10.13              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2023-07-31-231025   True        False         4h55m   Error while reconciling 4.10.0-0.nightly-2023-07-31-231025: the cluster operator monitoring has not yet successfully rolled out

Rook version:
rook: v4.10.14-0.e37f8ca9f2a5aa1576a1b75d888322f4f948b27d
go: go1.16.12

Ceph version:
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes 2/2

Can this issue reproduce from the UI?
-

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP 4.10, ODF 4.10 cluster, login into management-console 
2. See msg "Alerts are not configured to be sent to a notification system ..."
3. Downscale mds deployments to 0 pods waiting for CephMdsMissingReplicas


Actual results:
 CephMdsMissingReplicas do not firing similarly to CephOSDSlowOps when conditions are met

Expected results:
 CephMdsMissingReplicas and CephOSDSlowOps are firing when conditions are met.
No msg "Alerts are not configured to be sent to a notification system ..." on cluster Overview (status card) on management-console


Additional info:
must-gather logs: https://drive.google.com/drive/folders/1fn06AsaU6WyW3_zY08LUGNIaNUYjPAUt?usp=sharing

Comment 4 Divyansh Kamboj 2023-08-03 11:17:07 UTC

I tried reproducing CephMdsMissingReplicas on ODF 4.10.14 cluster, I was able to get the alert on a fresh cluster. But I noticed a caveat.

The alert will only trigger when there's 1 pod for mds, if there's zero pods in total we do not have the alert. For CephODSSlowOps, can you provide the steps you took which would result in the alert firing?

Comment 5 Divyansh Kamboj 2023-08-03 11:18:13 UTC

Comment 6 Divyansh Kamboj 2023-08-03 11:19:35 UTC

apologies, bugzilla posts the message if i press enter.

@dosypenk can you confirm the behaviour in comment 4?

Comment 7 Daniel Osypenko 2023-08-03 11:56:37 UTC

Steps to reproduce CephODSSlowOps:
1. reduce osd_op_complaint_time value to 0.1 ->  'ceph config set osd osd_op_complaint_time 0.1'
2. create 2 PVCs with overall capacity equal to 90% of storage capacity 
3. fill up PVCs and verify that during the period of writing data to PVCs CephOSDSlowOps appears using Prometheus API

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27670/console
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27662/console
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27671/console

Comment 8 Divyansh Kamboj 2023-08-04 08:06:46 UTC

Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh cluster. @dosypenk can you provide details for a cluster that has this issue reproduced? so we can look at the metric data, as that isn't available in the must-gather.

Comment 9 Daniel Osypenko 2023-08-06 07:03:15 UTC

(In reply to Divyansh Kamboj from comment #8)
> Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh
> cluster. @dosypenk can you provide details for a cluster that has
> this issue reproduced? so we can look at the metric data, as that isn't
> available in the must-gather.

Hello Divyansh, I've added must-gather logs to the body of the bug report. Unfortunately cluster has been already destructed.

Comment 10 Divyansh Kamboj 2023-08-07 05:26:20 UTC

(In reply to Daniel Osypenko from comment #9)
> (In reply to Divyansh Kamboj from comment #8)
> > Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh
> > cluster. @dosypenk can you provide details for a cluster that has
> > this issue reproduced? so we can look at the metric data, as that isn't
> > available in the must-gather.
> 
> Hello Divyansh, I've added must-gather logs to the body of the bug report.
> Unfortunately cluster has been already destructed.

must-gather logs don't provide the metric data of the cluster, we'll need to look at the metrics for a few hours and then correlate it with the alerts and the logs, to understand why the issue is happening(or what component is malfunctioning). if it's possible, can you reproduce it, and send the details? 
the issues are not reproducible on the fresh clusters i create, following the instructions provided in the bug.

Comment 11 Daniel Osypenko 2023-08-07 10:34:12 UTC

(In reply to Divyansh Kamboj from comment #10)
> (In reply to Daniel Osypenko from comment #9)
> > (In reply to Divyansh Kamboj from comment #8)
> > > Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh
> > > cluster. @dosypenk can you provide details for a cluster that has
> > > this issue reproduced? so we can look at the metric data, as that isn't
> > > available in the must-gather.
> > 
> > Hello Divyansh, I've added must-gather logs to the body of the bug report.
> > Unfortunately cluster has been already destructed.
> 
> must-gather logs don't provide the metric data of the cluster, we'll need to
> look at the metrics for a few hours and then correlate it with the alerts
> and the logs, to understand why the issue is happening(or what component is
> malfunctioning). if it's possible, can you reproduce it, and send the
> details? 
> the issues are not reproducible on the fresh clusters i create, following
> the instructions provided in the bug.

With fresh cluster I reran the test with osd_op_complaint_time was set to 0.1, at 14:23 IST it filled the capacity to 84.4%, test failed. During 50 min CephOSDSlowOps did not appear.
credentials -> https://url.corp.redhat.com/cluster
Thanks for investigating

Comment 12 Divyansh Kamboj 2023-08-07 12:51:47 UTC

(In reply to Daniel Osypenko from comment #11)
> (In reply to Divyansh Kamboj from comment #10)
> > (In reply to Daniel Osypenko from comment #9)
> > > (In reply to Divyansh Kamboj from comment #8)
> > > > Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh
> > > > cluster. @dosypenk can you provide details for a cluster that has
> > > > this issue reproduced? so we can look at the metric data, as that isn't
> > > > available in the must-gather.
> > > 
> > > Hello Divyansh, I've added must-gather logs to the body of the bug report.
> > > Unfortunately cluster has been already destructed.
> > 
> > must-gather logs don't provide the metric data of the cluster, we'll need to
> > look at the metrics for a few hours and then correlate it with the alerts
> > and the logs, to understand why the issue is happening(or what component is
> > malfunctioning). if it's possible, can you reproduce it, and send the
> > details? 
> > the issues are not reproducible on the fresh clusters i create, following
> > the instructions provided in the bug.
> 
> With fresh cluster I reran the test with osd_op_complaint_time was set to
> 0.1, at 14:23 IST it filled the capacity to 84.4%, test failed. During 50
> min CephOSDSlowOps did not appear.
> credentials -> https://url.corp.redhat.com/cluster
> Thanks for investigating

What time was the test run? I can see the query for the alert pop up values around 7th Aug, 13:07 IST for almost 1.5 minutes, the threshold for the metric is 30s. So looks like the alert was triggered when SLOW_OPS was reported by ceph. Logs for ceph has the last mention of SLOW_OPS around 2023-08-07T07:38:28.770+0000 (7th Aug 13:08 IST), which correlates with the data that metric shows us.

The behaviour of the Alert works as intended in the cluster above.

Closing it as not reproducible.