Bug 1934628

Summary:	API server stopped reporting healthy during upgrade to 4.7.0
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.7	CC:	aos-bugs, jeder, mfojtik, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	---	Flags:	mfojtik: needinfo?
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	LifecycleReset
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-13 21:21:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongkai Liu 2021-03-03 15:16:50 UTC

Description of problem:

https://coreos.slack.com/archives/CHY2E1BL4/p1614781326060700
[FIRING:1] ProbeFailing blackbox (https://api.build01.ci.devcluster.openshift.com:6443/healthz prow-monitoring/prow critical)
Probing the instance https://api.build01.ci.devcluster.openshift.com:6443/healthz has been failing for the past minute.

During upgrading build01 from 4.7.0-rc.1 to 4.7.0, the above alert was fired,
because https://api.build01.ci.devcluster.openshift.com:6443/healthz did not return 200.

It happened several times and got recovered eventually. The longest hiccup lasted for 1.5m which triggered the alert.

The alert is not built in OpenShift. DPTP made it via https://github.com/prometheus/blackbox_exporter


Must-Gather:
https://coreos.slack.com/archives/CHY2E1BL4/p1614783623067200?thread_ts=1614781326.060700&cid=CHY2E1BL4

Comment 1 Lukasz Szaszkiewicz 2021-03-19 16:25:19 UTC

Sorry, I haven't looked into this BZ yet. I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 2 Lukasz Szaszkiewicz 2021-04-09 14:47:26 UTC

Ups, sorry, I haven't looked into this BZ yet. I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 3 Michal Fojtik 2021-05-09 15:14:27 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 Lukasz Szaszkiewicz 2021-05-10 08:20:50 UTC

Sorry, I haven't looked into this BZ yet. I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 5 Michal Fojtik 2021-05-10 09:14:37 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 6 Lukasz Szaszkiewicz 2021-05-24 07:38:13 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 7 Michal Fojtik 2021-06-09 10:03:33 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 8 Lukasz Szaszkiewicz 2021-06-11 11:58:52 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 9 Lukasz Szaszkiewicz 2021-06-17 18:27:08 UTC

It looks like /healthz returned non 200 return code because the etcd health check failed.

For example, the check failed for kube-apiserver-ip-10-0-159-123.ec2.internal for about ~1m

2021-03-03T14:17:30.758649313Z [-]etcd failed: error getting data from etcd: context deadline exceeded
…
2021-03-03T14:18:26.813925086Z [-]etcd failed: error getting data from etcd: context deadline exceeded

during that time the API server was restarted by kubelet due to a failed liveness probe


14:18:00	openshift-kube-apiserver	kubelet	kube-apiserver-ip-10-0-159-123.ec2.internal	Killing	Container kube-apiserver failed liveness probe, will be restarted
14:19:17	openshift-kube-apiserver	apiserver	kube-apiserver-ip-10-0-159-123.ec2.internal	TerminationMinimalShutdownDurationFinished	The minimal shutdown duration of 1m10s finished


moving to etcd team to investigate why etcd was unavailable during that time

Comment 10 Michal Fojtik 2021-06-17 18:29:50 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 11 Michal Fojtik 2021-07-17 18:45:37 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 12 Sam Batschelet 2021-08-12 15:21:03 UTC

# etcd-ip-10-0-140-81.ec2.internal
> 2021-03-03T14:17:20.585755025Z 2021-03-03 14:17:20.585711 N | pkg/osutil: received terminated signal, shutting down...

# etcd-ip-10-0-159-123.ec2.internal
2021-03-03T14:17:27.738750499Z 2021-03-03 14:17:27.738708 W | etcdserver/api/etcdhttp: /health error; QGET failed etcdserver: request timed out (status code 503)

# etcd-ip-10-0-175-171.ec2.internal
> 2021-03-03T14:17:26.862141318Z 2021-03-03 14:17:26.862127 W | etcdserver: read-only range request "key:\"/kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/ci-op-vjx98ktb/\" range_end:\"/kubernetes.io/controlplane.operator.openshift.io/podnetworkconnectivitychecks/ci-op-vjx98ktb0\" " with result "range_response_count:0 size:8" took too long (6.187236627s) to execute
2021-03-03T14:17:26.862263572Z 2021-03-03 14:17:26.862243 W | etcdserver: read-only range request "key:\"/kubernetes.io/poddisruptionbudgets/ci-op-nshpn573/ci-operator-created-by-ci\" " with result "range_response_count:1 size:541" took too long (6.177427407s) to execute
2021-03-03T14:17:26.872591815Z 2021-03-03 14:17:26.872533 W | etcdserver: read-only range request "key:\"/kubernetes.io/ingress/\" range_end:\"/kubernetes.io/ingress0\" limit:10000 " with result "range_response_count:1 size:1528" took too long (5.179564511s) to execute

The theory here is that kube-apiserver going down for upgrade put excessive CPU pressure on the other 2 nodes resulting in a high iowait condition. This resulted in etcd unable to make progress and timing out. The latency observed in logs is very high with many records over 2 seconds.


Another data point is the etcd instance was having to work much harder than it needed to as the backend was heavily defragmented. Defragmentation would of reclaimed about 3 GB. This would have improved write efficiency and reduced memory utilization. In 4.9 we introduced the defragmentation controller[1] to mitigate this risk.

```
$ cat etcd_info/endpoint_status.json | jq '(.[0].Status.dbSize - .[0].Status.dbSizeInUse)/1000/1000' 
3095.384064
```

Without metrics and deeper look I am not sure if we can get a true RCA, But at this point I am not sure it could be actionable. Hongkai how would you like to proceed?

[1] https://github.com/openshift/enhancements/pull/836

Comment 13 Michal Fojtik 2021-08-12 15:53:28 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 14 Hongkai Liu 2021-08-13 20:19:15 UTC

Sam, Thanks for digging it up.
The issue in the bug has not happened ever since.
build01 is now with `4.8.5`.

With that being said, let us wait for the new controller from 4.9 (we are not far away from it).
I am fine with closing it now.