1999753 – OSD flapping alert not triggerd

Bug 1999753 - OSD flapping alert not triggerd

Summary: OSD flapping alert not triggerd

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	gowtham
QA Contact:	Anna Sandler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-31 17:14 UTC by Anna Sandler
Modified:	2023-08-09 16:37 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-24 10:04:06 UTC
Embargoed:

Attachments	(Terms of Use)

Description Anna Sandler 2021-08-31 17:14:28 UTC

Description of problem (please be detailed as possible and provide log
snippests):deleting OSD pod with the same id for more than 5 times in less than 5 minutes does not trigger the alert "CephOSDFlapping"


Version of all relevant components (if applicable):
OCS 4.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?no


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?1


Can this issue reproducible?yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.run :
for i in {1..5}; do x=$(oc get pods -A | grep rook-ceph-osd-0 | awk '{print $2}'); oc delete pod $x -n openshift-storage; done

2. see that this alert not seen in the UI

*also was tested in ocs-ci automation https://github.com/red-hat-storage/ocs-ci/pull/4651

Actual results:alert should be triggerd 


Expected results:alert not triggerd


Additional info:

Comment 2 Travis Nielsen 2021-08-31 19:56:43 UTC

The alert is likely sensitive to how quickly the OSD is restarting, how long the OSD was down, and other factors. Ceph owns the health warning so I am not aware of the details of what it takes to trigger the warning. 

Do you see the health warning from the "ceph status" in the toolbox? If so, then the operator should detect that warning within a minute of it being raised, then it would show up on the CephCluster CR status, and then the alert would be triggered. But if Ceph isn't raising the health warning in the first place, the OCS alert cannot be triggered.

Another approach instead of restarting the OSDs could be to scale down an OSD deploynment, wait a few seconds, then scale it back up. This way, perhaps the OSD will be down long enough for Ceph to notice more. If it's just a pod deletion, the OSD will typically start back up quickly.

Comment 3 Anna Sandler 2021-09-02 02:05:29 UTC

buy the documentation of the alert in the bug https://bugzilla.redhat.com/show_bug.cgi?id=1935342 
the alert will be triggered if the OSD restarts 5 times in less than 5 minutes

my approaches to testing are:

1. delete the rook-ceph-osd-0-****** pod 6 times or more in a row or with some time out between them
this approach did not work since the new pod is coming up very very quickly

2. restarting the deployment/rook-ceph-osd 6 times in a row and with some timeout
same effect as in the first approach 

3. scaling the deployment up and down 6 times in a row with 45s between the scale down to the scale-up (45*6=270s < 300s which are 5 minutes)
I did manage to trigger the CephOSDDiskNotResponding alert which stated that there is a problem with the OSD, but only 4 out of 6 scale-downs showed this alert, and 5 is needed to trigger the CephOSDFlapping alert. I could have done the calling couple of more times but to make the cluster notice that the OSD is down takes 30-60 seconds and I have to stay in the 300s range 

its not really a bug, but this alert is just not triggerable IMO

Comment 4 Travis Nielsen 2021-09-13 16:00:10 UTC

Since a Ceph health issue was not raised, Rook cannot trigger the alert. There is not a Rook issue here.

Comment 5 Filip Balák 2021-09-16 11:31:31 UTC

(In reply to Anna Sandler from comment #3)
> 3. scaling the deployment up and down 6 times in a row with 45s between the
> scale down to the scale-up (45*6=270s < 300s which are 5 minutes)
> I did manage to trigger the CephOSDDiskNotResponding alert which stated that
> there is a problem with the OSD, but only 4 out of 6 scale-downs showed this
> alert, and 5 is needed to trigger the CephOSDFlapping alert. I could have
> done the calling couple of more times but to make the cluster notice that
> the OSD is down takes 30-60 seconds and I have to stay in the 300s range 
> 
> its not really a bug, but this alert is just not triggerable IMO

The bug is that alert CephOSDFlapping is not triggerable due to ceph restrictions (Ceph won't notice that osd went down 5 times in 5 minutes) as Anna pointed out.
The alert rules should be updated so that the alert can be triggered and adds value to customer or if it is not possible to raise the alert then it should be removed from ODF alerting.

Comment 6 gowtham 2021-09-22 12:31:12 UTC

As per the comment: https://bugzilla.redhat.com/show_bug.cgi?id=1935342#c10

From up to down to up is counted as 2 changes. So 5 restart / deleting OSD POD within 5 mins is (5*2) = 10.

The alerting condition is correct: changes(ceph_osd_up[5m]) >= 10.


I tested this in 3 ways:
  1. Deleted POD 5 times using the script without any sleep in between
       for i in {1..5}; do x=$(oc get pods -n openshift-storage | grep rook-ceph-osd-0 | awk '{print $1}'); oc delete pod $x -n openshift-storage; done
  2. Manually deleted the OSD POD once it is up and running successfully after the deletion.
  3. Deleted POD using the same script with 45 minutes of time interval (overall it took around 4 min 20 sec to finish).
       for i in {1..5}; do x=$(oc get pods -n openshift-storage | grep rook-ceph-osd-0 | awk '{print $1}'); oc delete pod $x -n openshift-storage; sleep 45s; done



In each case, I saw different results in metrics values. None of them are correct and one result is different from the other. 

For more info please check the graph (each hike is a result of the above testing method in the same order).

Comment 8 gowtham 2021-09-22 13:38:46 UTC

This bug is based on how quick the restart is happening and when Prometheus exporter is trying to collects metrics at that moment what is the status of the OSD. Maybe OSD is restarted and ceph also reported correctly but before the exporter is know about the status it may be changed.

So instead of pushing down or up as metrics, is it possible to push count itself, is ceph/rook can maintain the last 5 minutes status of OSD? we can't solve this race condition in the alert condition I feel. 

This is my understanding.

Comment 9 Anna Sandler 2021-09-22 22:04:46 UTC

this bug was raised because this alert is not triggerable in any way we have approached it
since it's not triggerable -> QE can't test it and be sure this will really work when needed 

so 2 ways to fix this: 
- provide steps to test it 
- change the triggering conditions of this alert

Comment 11 gowtham 2021-09-23 16:18:39 UTC

Hi Anna and Filip,
   I found the intention of this alert, It does not work like when you try to restart 5 times, and the alert should be raised. When a restart is keep happening(it can be any number of times) then the Prometheus exporter will be able to detect at least 5 times and confirm the OSD is kept restarting. So the 5 is for the Prometheus export to deduct like restart is going on, not the exact restart count.

I can raise this alert every time within 5 mins when I keep restarting.

The script which I used:(you can adjust the count)

 for i in {1..30}
    do
         x=`oc get pods -n openshift-storage | grep rook-ceph-osd-0 | awk '{print $1}'`
         oc delete pod $x -n openshift-storage
         result=`oc get pods -n openshift-storage -o=jsonpath='{.items[?(@.metadata.labels.osd=="0")].status.phase}'`
         until [ $result == "Running" ]
	 do
             echo "pod is down"
             sleep 5s
             result=`oc get pods -n openshift-storage -o=jsonpath='{.items[?(@.metadata.labels.osd=="0")].status.phase}'`
         done
         echo "Pod is running"
         # To report back like i am up
         sleep 20s
     done

It's not a bug, I think it's a miss understanding.

Comment 13 gowtham 2021-09-27 08:05:29 UTC

Hi Flip and Anna,

I had discussed this alert with @pcuzner, @shan, and @tnielsen. 

Few findings are: 

    1. As per the document OSD flapping is happening due to The cluster network (or just a host's NIC) experiences latency, packet drops, errors, or outages triggering erroneous updates to the mons over the public network. And doc clearly says OSD flapping is not due to daemon restart.
 
     Ref: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#flapping-osds_diag.

    2. The flapping OSDs scenario does not include the situation when the OSD processes are started and then immediately killed.

 
Based on the above findings, 
    * The above reproducing steps are not valid. There is no such case for OSDFlipping.
    * I would suggest a QE to test with a valid scenario.
    * If the expectation is to figure out OSD restarts then please create a new RFE BZ to initiate further discussions. (To test the OSD restart also, we just need to down the OSD container alone, not an entire pod).
      Command to restart OSD:  
         x=`oc get pods -n openshift-storage | grep rook-ceph-osd-0 | awk '{print $1}'`
         oc -n openshift-storage exec $x -c osd -- bash -c "unset CEPH_ARGS && ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok assert"  

    * Removing the blocker flag.

Comment 15 Anna Sandler 2021-09-29 07:09:25 UTC

will try reproducing it this way

Note You need to log in before you can comment on or make changes to this bug.