Bug 1701805 - etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP
Summary: etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the clu...
Keywords:
Status: CLOSED DUPLICATE of bug 1701154
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-22 02:53 UTC by Junqi Zhao
Modified: 2019-04-24 16:45 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-23 12:43:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
etcdHighNumberOfFailedGRPCRequests alerts (140.28 KB, image/png)
2019-04-22 02:53 UTC, Junqi Zhao
no flags Details

Description Junqi Zhao 2019-04-22 02:53:55 UTC
Created attachment 1557071 [details]
etcdHighNumberOfFailedGRPCRequests alerts

Description of problem:
There are two etcdHighNumberOfFailedGRPCRequests rules, see below
**********************************************
alert: etcdHighNumberOfFailedGRPCRequests
expr: 100
  * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m]))
  / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m]))
  > 1
for: 10m
labels:
  severity: warning
annotations:
  message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for
    {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
**********************************************

alert: etcdHighNumberOfFailedGRPCRequests
expr: 100
  * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m]))
  / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m]))
  > 5
for: 5m
labels:
  severity: critical
annotations:
  message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for
    {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
**********************************************

etcd monitoring is enabled by default now, etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP
result for
100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

Element 											Value
{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.164.132:9979",job="etcd"}	100
{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.144.28:9979",job="etcd"}	100
{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.129.99:9979",job="etcd"}	100

result for
100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1

Element 											Value
{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.164.132:9979",job="etcd"}	100
{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.144.28:9979",job="etcd"}	100
{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.129.99:9979",job="etcd"}	100


Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-04-20-175518

How reproducible:
Always

Steps to Reproduce:
1. Check alerts in alertmanager after the cluster is UP.
2.
3.

Actual results:
etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP

Expected results:
etcd should perform well

Additional info:

Comment 1 Sam Batschelet 2019-04-23 12:43:02 UTC

*** This bug has been marked as a duplicate of bug 1701154 ***


Note You need to log in before you can comment on or make changes to this bug.