Bug 1701805
Summary: | etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.1.0 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.1.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-04-23 12:43:02 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
*** This bug has been marked as a duplicate of bug 1701154 *** |
Created attachment 1557071 [details] etcdHighNumberOfFailedGRPCRequests alerts Description of problem: There are two etcdHighNumberOfFailedGRPCRequests rules, see below ********************************************** alert: etcdHighNumberOfFailedGRPCRequests expr: 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1 for: 10m labels: severity: warning annotations: message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.' ********************************************** alert: etcdHighNumberOfFailedGRPCRequests expr: 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5 for: 5m labels: severity: critical annotations: message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.' ********************************************** etcd monitoring is enabled by default now, etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP result for 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5 Element Value {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.164.132:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.144.28:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.129.99:9979",job="etcd"} 100 result for 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1 Element Value {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.164.132:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.144.28:9979",job="etcd"} 100 {grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.0.129.99:9979",job="etcd"} 100 Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-20-175518 How reproducible: Always Steps to Reproduce: 1. Check alerts in alertmanager after the cluster is UP. 2. 3. Actual results: etcdHighNumberOfFailedGRPCRequests alerts are activated shortly after the cluter is UP Expected results: etcd should perform well Additional info: