Bug 1840150

Summary: Report event in case of excessive leader changes that include disk metrics
Product: OpenShift Container Platform Reporter: Suresh Kolichala <skolicha>
Component: Etcd OperatorAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: geliu, ingvarr.zhmakin, mfojtik, sbatsche, wlewis
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1827585 Environment:
Last Closed: 2020-09-15 17:32:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1827585    
Bug Blocks:    

Comment 11 ge liu 2020-09-09 08:56:44 UTC
Verified with 4.4.0-0.nightly-2020-09-08-111845

change etcd leader by put in a typo in /etc/kubernetes/manifests/etcd-pod.yaml to make etcd pods down, then check event, the warning msg fired:

3m47s       Warning   EtcdLeaderChangeMetrics                 deployment/etcd-operator                         Detected 2.5 leader changes in last 5 minutes on "AWS" disk metrics are: etcd-ip-10-0-148-126.us-east-2.compute.internal=0.001993,etcd-ip-10-0-185-97.us-east-2.compute.internal=0.003454999999999999,etcd-ip-10-0-197-59.us-east-2.compute.internal=0.004200000000000011
2m48s       Warning   EtcdLeaderChangeMetrics                 deployment/etcd-operator                         Detected 2.5 leader changes in last 5 minutes on "AWS" disk metrics are: etcd-ip-10-0-148-126.us-east-2.compute.internal=0.001992999999999999,etcd-ip-10-0-185-97.us-east-2.compute.internal=0.003455000000000009,etcd-ip-10-0-197-59.us-east-2.compute.internal=0.00419999999999997
2m36s       Warning   ClusterMemberControllerUpdatingStatus   deployment/etcd-operator                         rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
89s         Warning   UnhealthyEtcdMember                     deployment/etcd-operator                         unhealthy members: ip-10-0-197-59.us-east-2.compute.internal,ip-10-0-185-97.us-east-2.compute.internal
89s         Warning   EtcdLeaderChangeMetrics                 deployment/etcd-operator                         Detected 5 leader changes in last 5 minutes on "AWS" disk metrics are: etcd-ip-10-0-148-126.us-east-2.compute.internal=0.001992999999999999,etcd-ip-10-0-185-97.us-east-2.compute.internal=0.003454999999999981,etcd-ip-10-0-197-59.us-east-2.compute.internal=0.004199999999999969
48s         Warning   EtcdLeaderChangeMetrics                 deployment/etcd-operator                         Detected 6.25 leader changes in last 5 minutes on "AWS" disk metrics are: etcd-ip-10-0-148-126.us-east-2.compute.internal=0.0019929999999999995,etcd-ip-10-0-185-97.us-east-2.compute.internal=0.003454999999999998,etcd-ip-10-0-197-59.us-east-2.compute.internal=0.004199999999999979

Comment 13 errata-xmlrpc 2020-09-15 17:32:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.21 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3605