Hide Forgot
Description of problem: https://coreos.slack.com/archives/CHY2E1BL4/p1614781326060700 [FIRING:1] ProbeFailing blackbox (https://api.build01.ci.devcluster.openshift.com:6443/healthz prow-monitoring/prow critical) Probing the instance https://api.build01.ci.devcluster.openshift.com:6443/healthz has been failing for the past minute. During upgrading build01 from 4.7.0-rc.1 to 4.7.0, the above alert was fired, because https://api.build01.ci.devcluster.openshift.com:6443/healthz did not return 200. It happened several times and got recovered eventually. The longest hiccup lasted for 1.5m which triggered the alert. The alert is not built in OpenShift. DPTP made it via https://github.com/prometheus/blackbox_exporter Must-Gather: https://coreos.slack.com/archives/CHY2E1BL4/p1614783623067200?thread_ts=1614781326.060700&cid=CHY2E1BL4
Sorry, I haven't looked into this BZ yet. I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Ups, sorry, I haven't looked into this BZ yet. I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
It looks like /healthz returned non 200 return code because the etcd health check failed. For example, the check failed for kube-apiserver-ip-10-0-159-123.ec2.internal for about ~1m 2021-03-03T14:17:30.758649313Z [-]etcd failed: error getting data from etcd: context deadline exceeded … 2021-03-03T14:18:26.813925086Z [-]etcd failed: error getting data from etcd: context deadline exceeded during that time the API server was restarted by kubelet due to a failed liveness probe 14:18:00 openshift-kube-apiserver kubelet kube-apiserver-ip-10-0-159-123.ec2.internal Killing Container kube-apiserver failed liveness probe, will be restarted 14:19:17 openshift-kube-apiserver apiserver kube-apiserver-ip-10-0-159-123.ec2.internal TerminationMinimalShutdownDurationFinished The minimal shutdown duration of 1m10s finished moving to etcd team to investigate why etcd was unavailable during that time
# etcd-ip-10-0-140-81.ec2.internal > 2021-03-03T14:17:20.585755025Z 2021-03-03 14:17:20.585711 N | pkg/osutil: received terminated signal, shutting down... # etcd-ip-10-0-159-123.ec2.internal 2021-03-03T14:17:27.738750499Z 2021-03-03 14:17:27.738708 W | etcdserver/api/etcdhttp: /health error; QGET failed etcdserver: request timed out (status code 503) # etcd-ip-10-0-175-171.ec2.internal > 2021-03-03T14:17:26.862141318Z 2021-03-03 14:17:26.862127 W | etcdserver: read-only range request "key:\"/kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/ci-op-vjx98ktb/\" range_end:\"/kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/ci-op-vjx98ktb0\" " with result "range_response_count:0 size:8" took too long (6.187236627s) to execute 2021-03-03T14:17:26.862263572Z 2021-03-03 14:17:26.862243 W | etcdserver: read-only range request "key:\"/kubernetes.io/poddisruptionbudgets/ci-op-nshpn573/ci-operator-created-by-ci\" " with result "range_response_count:1 size:541" took too long (6.177427407s) to execute 2021-03-03T14:17:26.872591815Z 2021-03-03 14:17:26.872533 W | etcdserver: read-only range request "key:\"/kubernetes.io/ingress/\" range_end:\"/kubernetes.io/ingress0\" limit:10000 " with result "range_response_count:1 size:1528" took too long (5.179564511s) to execute The theory here is that kube-apiserver going down for upgrade put excessive CPU pressure on the other 2 nodes resulting in a high iowait condition. This resulted in etcd unable to make progress and timing out. The latency observed in logs is very high with many records over 2 seconds. Another data point is the etcd instance was having to work much harder than it needed to as the backend was heavily defragmented. Defragmentation would of reclaimed about 3 GB. This would have improved write efficiency and reduced memory utilization. In 4.9 we introduced the defragmentation controller[1] to mitigate this risk. ``` $ cat etcd_info/endpoint_status.json | jq '(.[0].Status.dbSize - .[0].Status.dbSizeInUse)/1000/1000' 3095.384064 ``` Without metrics and deeper look I am not sure if we can get a true RCA, But at this point I am not sure it could be actionable. Hongkai how would you like to proceed? [1] https://github.com/openshift/enhancements/pull/836
Sam, Thanks for digging it up. The issue in the bug has not happened ever since. build01 is now with `4.8.5`. With that being said, let us wait for the new controller from 4.9 (we are not far away from it). I am fine with closing it now.