1894272 – Large and dense cluster built with 4.5.16 bits goes unresponsive when trying to upgrade to 4.6.1

Bug 1894272 - Large and dense cluster built with 4.5.16 bits goes unresponsive when trying to upgrade to 4.6.1

Summary: Large and dense cluster built with 4.5.16 bits goes unresponsive when trying ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1900666
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:	aos-scalability-46
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-03 21:50 UTC by Naga Ravi Chaitanya Elluri
Modified:	2020-11-30 17:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-30 17:09:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Naga Ravi Chaitanya Elluri 2020-11-03 21:50:31 UTC

Description of problem:
We tried upgrading a 250 node 4.5.16 cluster loaded with 4000 projects to 4.6.1 and the cluster went unresponsive with highly unstable control plane. Etcd went through more than 200 leader changes and API server resource usage spiked up.
Each namespace has the following objects deployed:

- 12 imagestreams
- 3 buidconfigs
- 6 builds
- 1 deployment with 2 pod replicas (sleep) mounting two secrets each. deployment-2pod
- 2 deployments with 1 pod replicas (sleep) mounting two secrets. deployment-1pod
- 3 services, one pointing to deployment-2pod, and other two pointing to deployment-1pod
- 3 route. 1 pointing to the service deployment-2pod and other two pointing to deployment-1pod
- 20 secrets

Looking at the etcd logs, etcdserver is going through multiple leader elections. It is timing out waiting for the response as the requests were taking too long to execute ( ~ 13 seconds ):

2020-11-03T13:04:16.534490976+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f is starting a new election at term 19158
2020-11-03T13:04:16.534490976+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f became candidate at term 19159
2020-11-03T13:04:16.534490976+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f received MsgVoteResp from ebe6ccd39aeec40f at term 19159
2020-11-03T13:04:16.534534599+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f [logterm: 18923, index: 2708076] sent MsgVote request to 95c065288b6f55f2 at term 19159
2020-11-03T13:04:16.534534599+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f [logterm: 18923, index: 2708076] sent MsgVote request to ece85e3cd31e9a37 at term 19159
2020-11-03T13:04:16.643921330+00:00 stderr F 2020-11-03 13:04:16.643875 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-11-03T13:04:16.643962091+00:00 stderr F 2020-11-03 13:04:16.643955 W | etcdserver: read-only range request "key:\"/openshift.io/templates\" range_end:\"/openshift.io/templatet\" count_only:true " with result "error:etcdserver: request timed out" took too long (13.44553171s) to execute

We could not run must-gather as the API server was not responding but we did capture the journal and pod logs from the master nodes: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.6-sdn/upgrades/4.6-cluster-crash/.

Version-Release number of selected component (if applicable):
4.5.16 -> 4.6.1 upgrade

How reproducible:
This was the first upgrade run at scale. We can re-run the test on a similar cluster environment to see if we can reproduce it if needed.

Steps to Reproduce:
1. Install a cluster using 4.5.16 bits.
2. Load the cluster with large number of objects. https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner can be used to do it.
3. Try to upgrade to 4.6.1.

Actual results:
Upgrade failed. Cluster is unstable with API server not responding to the clients requests.

Expected results:
Cluster upgraded successfully to 4.6.1.

Comment 2 Naga Ravi Chaitanya Elluri 2020-11-19 20:28:02 UTC

we tried upgrading 4.5.16 -> 4.6.2 at scale and it worked this time. We loaded up the cluster with 4k projects but waited for a few hours for the cluster to settle before kicking off the upgrade, not sure how/if that helped in cluster stability since we observed the control plane take a hit when we immediately triggered the upgrade after loading the cluster. The etcd seems to be using 10x resources ( etcd memory is seem to be around 25GB vs 2.5GB on the cluster before the upgrade ) for most of time during the upgrade and fsync seems be ranging from ~30 ms - 100 ms during the upgrade. Raul Sevilla and Sam Batschelet found out that this is because of the change in I/O scheduler from mq-deadline to bf1. Repeating the test run with mq-deadline as the I/O scheduler.

Comment 3 Dan Mace 2020-11-20 16:07:54 UTC

This could be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1899600 which switches the IO scheduler only during upgrades.

Comment 4 Dan Mace 2020-11-20 16:12:54 UTC

(In reply to Dan Mace from comment #3)
> This could be a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1899600 which switches the IO
> scheduler only during upgrades.

I don't think I was clear in my explanation there. https://bugzilla.redhat.com/show_bug.cgi?id=1899600 is the canonical tracker for an IO performance regression which probably explains the problem reported in this bug, and we intend to mark examples (e.g. this bug) as duplicates of https://bugzilla.redhat.com/show_bug.cgi?id=1899600.

Comment 12 Sam Batschelet 2020-11-30 17:09:16 UTC


*** This bug has been marked as a duplicate of bug 1900666 ***

Note You need to log in before you can comment on or make changes to this bug.