Bug 1868750

Summary: aws upgrade is super slow
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: kube-controller-managerAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.6CC: aos-bugs, jsafrane, juzhao, knarra, mfojtik, vareti, wking
Target Milestone: ---Keywords: Upgrades
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-cluster-lifecycle] cluster upgrade should be fast
Last Closed: 2020-10-27 16:28:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Eads 2020-08-13 17:39:28 UTC
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci

This job is constantly failing to upgrade due to a very slow upgrade.  We extended the timeout to see if we got there eventually and we do. However, we see times approaching 80 minutes instead of the "normal" gcp time of 30ish minutes.

When I investigated https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1293834530519519232 I found that the KCM got a lease and then did nothing for about 40 minutes.  This left a networking daemonset not fully rolled out for about 40 minutes.  Which delayed the upgrade.

We don't have any logs for the time period because we're missing the capability, but given the frequency of the failure and the wide window, this should be reproducible locally. Remember, you need to catch this *during* an upgrade and figure out why the KCM isn't completing the daemonset rollout.

Comment 1 David Eads 2020-08-17 15:34:08 UTC
referencing test to mark sippy

cluster upgrade should be fast

Comment 2 Tomáš Nožička 2020-08-18 14:37:17 UTC
This seems to be definitely an issue. I was able to reproduce the half stuck upgrade from
quay.io/openshift-release-dev/ocp-release:4.5.6-x86_64
to
registry.svc.ci.openshift.org/ocp/release@sha256:27f71a857a17d47260026efb94343acfe0df693766de2071c7a35f30e660c306 (latest 4.6.0)

Seems to be caused by KCM getting Unauthorized

E0818 13:08:07.744923       1 daemon_controller.go:331] openshift-sdn/ovs failed with : error storing status for daemon set &v1.DaemonSet ...: Unauthorized

that was holding for a while and then it started to get 409s because stuck informers. I suppose it eventually succeeded (maybe hitting different apiserver through the LB) and then the nodes got restarted. So it either recovered before the restart or processed enough with a good kube-apiserver to get up to the restart point and got fixed by the restart.


although the creds on disk seemed to be valid

sh-4.4# openssl verify -CAfile /etc/kubernetes/static-pod-resources/kube-apiserver-certs/configmaps/client-ca/ca-bundle.crt /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt 
/etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt: OK

and I could use them to connect to a localhost KAS before the restart (wanna try it again tomorrow right at the start of the outage)


I am not yet sure how we got there but I'll continue tomorrow.

Comment 3 Tomáš Nožička 2020-08-21 11:49:42 UTC
*** Bug 1866782 has been marked as a duplicate of this bug. ***

Comment 4 Stefan Schimanski 2020-08-21 11:59:11 UTC
*** Bug 1865857 has been marked as a duplicate of this bug. ***

Comment 5 Tomáš Nožička 2020-08-21 12:00:32 UTC
Still trying to figure it out. KAS rolls out an upgrade, could credentials operator changes authConfig.Spec.ServiceAccountIssuer

  https://github.com/openshift/cloud-credential-operator/blob/8d54516/pkg/operator/oidcdiscoveryendpoint/controller.go#L244-L271

which triggers another KAS rollout. During this rollout KCM clients using serviceaccount get Unauthorized for several (about 5 to 20) minutes but it seem to recover eventually.

Comment 6 Tomáš Nožička 2020-08-25 08:14:08 UTC
To sum up the slack threads:
- KCM is using the bound tokens to create kubeconfigs for its controller

1. KASO fully rolls out the new KAS
2. KCM updates fine
3. other operators update
4. CCO updates and changes the service account issuer
5. KASO does a second full rollout of KAS to reflect the change and activates the new service account issuer which invalidates all bound tokens
6. KCM gets Unauthorized for tens of minutes at least

The PR disables the bound token auth in KCM and falls back to SA tokens.

Comment 9 RamaKasturi 2020-08-28 05:24:42 UTC
Verification of the bug is blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1868786

Comment 12 errata-xmlrpc 2020-10-27 16:28:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196