1868750 – aws upgrade is super slow

Bug 1868750 - aws upgrade is super slow

Summary: aws upgrade is super slow

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Tomáš Nožička
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1865857 1866782 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-13 17:39 UTC by David Eads
Modified:	2020-10-27 16:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-cluster-lifecycle] cluster upgrade should be fast
Last Closed:	2020-10-27 16:28:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 320	0	None	closed	bug 1868750: UPSTREAM: <drop>: don't use dynamic tokens for KCM	2021-01-12 09:36:54 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:28:26 UTC

Description David Eads 2020-08-13 17:39:28 UTC

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci

This job is constantly failing to upgrade due to a very slow upgrade.  We extended the timeout to see if we got there eventually and we do. However, we see times approaching 80 minutes instead of the "normal" gcp time of 30ish minutes.

When I investigated https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1293834530519519232 I found that the KCM got a lease and then did nothing for about 40 minutes.  This left a networking daemonset not fully rolled out for about 40 minutes.  Which delayed the upgrade.

We don't have any logs for the time period because we're missing the capability, but given the frequency of the failure and the wide window, this should be reproducible locally. Remember, you need to catch this *during* an upgrade and figure out why the KCM isn't completing the daemonset rollout.

Comment 1 David Eads 2020-08-17 15:34:08 UTC

referencing test to mark sippy

cluster upgrade should be fast

Comment 2 Tomáš Nožička 2020-08-18 14:37:17 UTC

This seems to be definitely an issue. I was able to reproduce the half stuck upgrade from
quay.io/openshift-release-dev/ocp-release:4.5.6-x86_64
to
registry.svc.ci.openshift.org/ocp/release@sha256:27f71a857a17d47260026efb94343acfe0df693766de2071c7a35f30e660c306 (latest 4.6.0)

Seems to be caused by KCM getting Unauthorized

E0818 13:08:07.744923       1 daemon_controller.go:331] openshift-sdn/ovs failed with : error storing status for daemon set &v1.DaemonSet ...: Unauthorized

that was holding for a while and then it started to get 409s because stuck informers. I suppose it eventually succeeded (maybe hitting different apiserver through the LB) and then the nodes got restarted. So it either recovered before the restart or processed enough with a good kube-apiserver to get up to the restart point and got fixed by the restart.


although the creds on disk seemed to be valid

sh-4.4# openssl verify -CAfile /etc/kubernetes/static-pod-resources/kube-apiserver-certs/configmaps/client-ca/ca-bundle.crt /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt 
/etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt: OK

and I could use them to connect to a localhost KAS before the restart (wanna try it again tomorrow right at the start of the outage)


I am not yet sure how we got there but I'll continue tomorrow.

Comment 3 Tomáš Nožička 2020-08-21 11:49:42 UTC

*** Bug 1866782 has been marked as a duplicate of this bug. ***

Comment 4 Stefan Schimanski 2020-08-21 11:59:11 UTC

*** Bug 1865857 has been marked as a duplicate of this bug. ***

Comment 5 Tomáš Nožička 2020-08-21 12:00:32 UTC

Still trying to figure it out. KAS rolls out an upgrade, could credentials operator changes authConfig.Spec.ServiceAccountIssuer

  https://github.com/openshift/cloud-credential-operator/blob/8d54516/pkg/operator/oidcdiscoveryendpoint/controller.go#L244-L271

which triggers another KAS rollout. During this rollout KCM clients using serviceaccount get Unauthorized for several (about 5 to 20) minutes but it seem to recover eventually.

Comment 6 Tomáš Nožička 2020-08-25 08:14:08 UTC

To sum up the slack threads:
- KCM is using the bound tokens to create kubeconfigs for its controller

1. KASO fully rolls out the new KAS
2. KCM updates fine
3. other operators update
4. CCO updates and changes the service account issuer
5. KASO does a second full rollout of KAS to reflect the change and activates the new service account issuer which invalidates all bound tokens
6. KCM gets Unauthorized for tens of minutes at least

The PR disables the bound token auth in KCM and falls back to SA tokens.

Comment 9 RamaKasturi 2020-08-28 05:24:42 UTC

Verification of the bug is blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1868786

Comment 12 errata-xmlrpc 2020-10-27 16:28:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.