1904966 – Etcd alert rules are not written as per the etcd backend performance requirement

Bug 1904966 - Etcd alert rules are not written as per the etcd backend performance requirement

Summary: Etcd alert rules are not written as per the etcd backend performance requirement

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dean West
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-07 09:31 UTC by Aditya Deshpande
Modified:	2024-12-20 19:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-27 11:26:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6326071	0	None	None	None	2021-09-13 03:38:51 UTC

Description Aditya Deshpande 2020-12-07 09:31:03 UTC

Description of problem:

Following are two alerts i.e. etcdHighFsyncDurations and etcdHighCommitDurations for checking if underlying storage is perfomring well or not. 
~~~
      - alert: etcdHighFsyncDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
          histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
          > 0.5
        for: 10m
        labels:
          severity: warning

      - alert: etcdHighCommitDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
          histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
          > 0.25
        for: 10m
        labels:
          severity: warning

~~~

According to these alert rules, it seems the metrics values are compared and alert is getting fired at the time when wal_fsync_duration_seconds is going above 500ms and backend_commit_duration_seconds is going above 250ms. 

The values which are kept as boundary line i.e. 500ms and 250ms are too high as compared to etcd backend storage [erformance requirement. 
Backend performance requirement:
~~~
To rule out a slow disk from causing Etcd warnings, you can monitor metrics backend_commit_duration_seconds (p99 duration should be less than 25ms) and wal_fsync_duration_seconds (p99 duration should be less than 10ms) to confirm the storage is reasonably fast.
~~~

So, can we improvise the values from the alert rules? 

Customer's are relying on alerts if their OCP clusters are working properly or misbehaving. Also, the gap is too large as compared to the best practices. 10ms to 500ms is too much.
If the values stay less than 500ms wal_fsync_duration_seconds and 250ms for backend_commit_duration_seconds consistently and they are not as per etcd benchmarks then noone will get notified and the problem persists as it is. 


Version-Release number of selected component (if applicable):
I would say this would be impacting all OCP 4.x. I have checked alert rules on 4.6. 
Also, alert rules are same for 3.11 as well.

How reproducible:
Always

Actual results:
Alerting rules are not properly set according to etcd backend performance requirement.

Expected results:
Improvise on alerting rules so that the issues related to backend storage of etcd will be notified and get addressed.

Additional info:
I have discussed this situation on #forum-etcd slack channel with Sam Batschelet.

Comment 1 Sam Batschelet 2020-12-21 19:51:07 UTC

While I agree alerts should be accurately defined the goal of an alert is to inform the admin that they have an abnormal issue with the cluster. During runtime today most if not all of the fleet will run above the 10ms threshold. Not only that but depending on the etcd tuning. 200ms p99 could be very disruptive for an AWS cluster and result in leader elections with Azure you would not see that disruption. So we must be careful today as alerts are static for every cluster. We don't want to flap on expected conditions but we also want user to understand disks are not performant enough for the workload they are running.

For now we can review current alert thresholds but I belive a more conditional approach to these alerts based on observed runtime is probably the path forward. For example in Azure we consider X critical while on AWS we consider Y. The most important thing is if disks are the root cause for extended disruption we alert on that vs transient spike. If we are not doing this then we should consider retuning.

The approach we are leaning towards a few proactive approaches.

[1]: https://issues.redhat.com/browse/ETCD-119 - Run baseline fio performance analysis on the cluster to ensure disk are adequate to run OCP. Goal never create a cluster that has invalid disks.
[2]: https://issues.redhat.com/browse/CORS-1575 - Provide users with a benchmark subcommand of the installer so that the cluster can be benchmark tested. Goal ensure hardware configuration of control-plane meets customer expectations for expected workloads.
[3]: https://issues.redhat.com/browse/ETCD-121 - Adjust etcd runtime to tolerate underperforming disks and alert admin that they must upgrade hardware.

Note You need to log in before you can comment on or make changes to this bug.