This bug was initially created as a copy of Bug #1730413 I am copying this bug because: target 4.2.0 A number of clusters in the wild on 4.1.z (15-20?) are reporting one etcd member down via the `up` metric, but no alerts related to etcd failure are being reported. Other clusters with one etcd member reported down ARE reporting alerts related to a bad member: 80f5da7e-7527-41d2-8d6e-774b388a42a4 reports the following alerts: KubeDeploymentReplicasMismatch, KubePodNotReady, TargetDown, Watchdog and two down services up{_id="80f5da7e-7527-41d2-8d6e-774b388a42a4",endpoint="etcd-metrics",instance="172.16.0.34:9979",job="etcd",monitor="prometheus",namespace="openshift-etcd",pod="etcd-member-host-172-16-0-34",prometheus="openshift-monitoring/k8s",prometheus_replica="prometheus-telemeter-0",replica="$(HOSTNAME)",service="etcd"} 0 up{_id="80f5da7e-7527-41d2-8d6e-774b388a42a4",endpoint="metrics",instance="172.16.0.40:9101",job="sdn",monitor="prometheus",namespace="openshift-sdn",pod="sdn-wtl58",prometheus="openshift-monitoring/k8s",prometheus_replica="prometheus-telemeter-0",replica="$(HOSTNAME)",service="sdn"} This is a UPI cluster at 4.1.4. We may have a scheduling issue with the etcd proxy, but more data needs to be gathered.
Sam, this issue sounds like Monitoring work scope, Monitor may watch etcd member down, then report alert, right? is there alert msg reported by etcd itself?
verified with 4.2.0-0.nightly-2019-08-07-214151 in monitor section in webconsole, got alert msg: etcdMembersDown etcd cluster "etcd": members are down (1). Pending Since a minute ago Critical
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922