Description of problem:
We tried upgrading a 250 node 4.5.16 cluster loaded with 4000 projects to 4.6.1 and the cluster went unresponsive with highly unstable control plane. Etcd went through more than 200 leader changes and API server resource usage spiked up.
Each namespace has the following objects deployed:
- 12 imagestreams
- 3 buidconfigs
- 6 builds
- 1 deployment with 2 pod replicas (sleep) mounting two secrets each. deployment-2pod
- 2 deployments with 1 pod replicas (sleep) mounting two secrets. deployment-1pod
- 3 services, one pointing to deployment-2pod, and other two pointing to deployment-1pod
- 3 route. 1 pointing to the service deployment-2pod and other two pointing to deployment-1pod
- 20 secrets
Looking at the etcd logs, etcdserver is going through multiple leader elections. It is timing out waiting for the response as the requests were taking too long to execute ( ~ 13 seconds ):
2020-11-03T13:04:16.534490976+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f is starting a new election at term 19158
2020-11-03T13:04:16.534490976+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f became candidate at term 19159
2020-11-03T13:04:16.534490976+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f received MsgVoteResp from ebe6ccd39aeec40f at term 19159
2020-11-03T13:04:16.534534599+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f [logterm: 18923, index: 2708076] sent MsgVote request to 95c065288b6f55f2 at term 19159
2020-11-03T13:04:16.534534599+00:00 stderr F raft2020/11/03 13:04:16 INFO: ebe6ccd39aeec40f [logterm: 18923, index: 2708076] sent MsgVote request to ece85e3cd31e9a37 at term 19159
2020-11-03T13:04:16.643921330+00:00 stderr F 2020-11-03 13:04:16.643875 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-11-03T13:04:16.643962091+00:00 stderr F 2020-11-03 13:04:16.643955 W | etcdserver: read-only range request "key:\"/openshift.io/templates\" range_end:\"/openshift.io/templatet\" count_only:true " with result "error:etcdserver: request timed out" took too long (13.44553171s) to execute
We could not run must-gather as the API server was not responding but we did capture the journal and pod logs from the master nodes: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.6-sdn/upgrades/4.6-cluster-crash/.
Version-Release number of selected component (if applicable):
4.5.16 -> 4.6.1 upgrade
This was the first upgrade run at scale. We can re-run the test on a similar cluster environment to see if we can reproduce it if needed.
Steps to Reproduce:
1. Install a cluster using 4.5.16 bits.
2. Load the cluster with large number of objects. https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner can be used to do it.
3. Try to upgrade to 4.6.1.
Upgrade failed. Cluster is unstable with API server not responding to the clients requests.
Cluster upgraded successfully to 4.6.1.
we tried upgrading 4.5.16 -> 4.6.2 at scale and it worked this time. We loaded up the cluster with 4k projects but waited for a few hours for the cluster to settle before kicking off the upgrade, not sure how/if that helped in cluster stability since we observed the control plane take a hit when we immediately triggered the upgrade after loading the cluster. The etcd seems to be using 10x resources ( etcd memory is seem to be around 25GB vs 2.5GB on the cluster before the upgrade ) for most of time during the upgrade and fsync seems be ranging from ~30 ms - 100 ms during the upgrade. Raul Sevilla and Sam Batschelet found out that this is because of the change in I/O scheduler from mq-deadline to bf1. Repeating the test run with mq-deadline as the I/O scheduler.
This could be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1899600 which switches the IO scheduler only during upgrades.
(In reply to Dan Mace from comment #3)
> This could be a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1899600 which switches the IO
> scheduler only during upgrades.
I don't think I was clear in my explanation there. https://bugzilla.redhat.com/show_bug.cgi?id=1899600 is the canonical tracker for an IO performance regression which probably explains the problem reported in this bug, and we intend to mark examples (e.g. this bug) as duplicates of https://bugzilla.redhat.com/show_bug.cgi?id=1899600.
*** This bug has been marked as a duplicate of bug 1900666 ***