Bug 1868750 - aws upgrade is super slow
Summary: aws upgrade is super slow
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Tomáš Nožička
QA Contact: RamaKasturi
URL:
Whiteboard:
: 1865857 1866782 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-13 17:39 UTC by David Eads
Modified: 2020-09-04 12:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-cluster-lifecycle] cluster upgrade should be fast
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift kubernetes pull 320 None closed bug 1868750: UPSTREAM: <drop>: don't use dynamic tokens for KCM 2020-09-21 09:21:49 UTC

Description David Eads 2020-08-13 17:39:28 UTC
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci

This job is constantly failing to upgrade due to a very slow upgrade.  We extended the timeout to see if we got there eventually and we do. However, we see times approaching 80 minutes instead of the "normal" gcp time of 30ish minutes.

When I investigated https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1293834530519519232 I found that the KCM got a lease and then did nothing for about 40 minutes.  This left a networking daemonset not fully rolled out for about 40 minutes.  Which delayed the upgrade.

We don't have any logs for the time period because we're missing the capability, but given the frequency of the failure and the wide window, this should be reproducible locally. Remember, you need to catch this *during* an upgrade and figure out why the KCM isn't completing the daemonset rollout.

Comment 1 David Eads 2020-08-17 15:34:08 UTC
referencing test to mark sippy

cluster upgrade should be fast

Comment 2 Tomáš Nožička 2020-08-18 14:37:17 UTC
This seems to be definitely an issue. I was able to reproduce the half stuck upgrade from
quay.io/openshift-release-dev/ocp-release:4.5.6-x86_64
to
registry.svc.ci.openshift.org/ocp/release@sha256:27f71a857a17d47260026efb94343acfe0df693766de2071c7a35f30e660c306 (latest 4.6.0)

Seems to be caused by KCM getting Unauthorized

E0818 13:08:07.744923       1 daemon_controller.go:331] openshift-sdn/ovs failed with : error storing status for daemon set &v1.DaemonSet ...: Unauthorized

that was holding for a while and then it started to get 409s because stuck informers. I suppose it eventually succeeded (maybe hitting different apiserver through the LB) and then the nodes got restarted. So it either recovered before the restart or processed enough with a good kube-apiserver to get up to the restart point and got fixed by the restart.


although the creds on disk seemed to be valid

sh-4.4# openssl verify -CAfile /etc/kubernetes/static-pod-resources/kube-apiserver-certs/configmaps/client-ca/ca-bundle.crt /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt 
/etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt: OK

and I could use them to connect to a localhost KAS before the restart (wanna try it again tomorrow right at the start of the outage)


I am not yet sure how we got there but I'll continue tomorrow.

Comment 3 Tomáš Nožička 2020-08-21 11:49:42 UTC
*** Bug 1866782 has been marked as a duplicate of this bug. ***

Comment 4 Stefan Schimanski 2020-08-21 11:59:11 UTC
*** Bug 1865857 has been marked as a duplicate of this bug. ***

Comment 5 Tomáš Nožička 2020-08-21 12:00:32 UTC
Still trying to figure it out. KAS rolls out an upgrade, could credentials operator changes authConfig.Spec.ServiceAccountIssuer

  https://github.com/openshift/cloud-credential-operator/blob/8d54516/pkg/operator/oidcdiscoveryendpoint/controller.go#L244-L271

which triggers another KAS rollout. During this rollout KCM clients using serviceaccount get Unauthorized for several (about 5 to 20) minutes but it seem to recover eventually.

Comment 6 Tomáš Nožička 2020-08-25 08:14:08 UTC
To sum up the slack threads:
- KCM is using the bound tokens to create kubeconfigs for its controller

1. KASO fully rolls out the new KAS
2. KCM updates fine
3. other operators update
4. CCO updates and changes the service account issuer
5. KASO does a second full rollout of KAS to reflect the change and activates the new service account issuer which invalidates all bound tokens
6. KCM gets Unauthorized for tens of minutes at least

The PR disables the bound token auth in KCM and falls back to SA tokens.

Comment 9 RamaKasturi 2020-08-28 05:24:42 UTC
Verification of the bug is blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1868786


Note You need to log in before you can comment on or make changes to this bug.