Created attachment 1557071 [details] etcdHighNumberOfFailedGRPCRequests alerts Description of problem: There are two etcdHighNumberOfFailedGRPCRequests rules, see below ********************************************** alert: etcdHighNumberOfFailedGRPCRequests expr: 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1 for: 10m labels: severity: warning annotations: message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.' ********************************************** alert: etcdHighNumberOfFailedGRPCRequests expr: 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5 for: 5m labels: severity: critical annotations: message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.' ********************************************** etcd monitoring is enabled by default now, etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP result for 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5 Element Value {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.164.132:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.144.28:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.129.99:9979",job="etcd"} 100 result for 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1 Element Value {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.164.132:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.144.28:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.129.99:9979",job="etcd"} 100 Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-20-175518 How reproducible: Always Steps to Reproduce: 1. Check alerts in alertmanager after the cluster is UP. 2. 3. Actual results: etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP Expected results: etcd should perform well Additional info:
*** This bug has been marked as a duplicate of bug 1701154 ***