Bug 1868750
| Summary: | aws upgrade is super slow | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | 
| Component: | kube-controller-manager | Assignee: | Tomáš Nožička <tnozicka> | 
| Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | 
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.6 | CC: | aos-bugs, jsafrane, juzhao, knarra, mfojtik, vareti, wking | 
| Target Milestone: | --- | Keywords: | Upgrades | 
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | [sig-cluster-lifecycle] cluster upgrade should be fast | |
| Last Closed: | 2020-10-27 16:28:09 UTC | Type: | Bug | 
| Regression: | --- | Mount Type: | --- | 
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| 
        
          Description
        
        
          David Eads
        
        
        
        
        
          2020-08-13 17:39:28 UTC
        
       referencing test to mark sippy cluster upgrade should be fast This seems to be definitely an issue. I was able to reproduce the half stuck upgrade from quay.io/openshift-release-dev/ocp-release:4.5.6-x86_64 to registry.svc.ci.openshift.org/ocp/release@sha256:27f71a857a17d47260026efb94343acfe0df693766de2071c7a35f30e660c306 (latest 4.6.0) Seems to be caused by KCM getting Unauthorized E0818 13:08:07.744923 1 daemon_controller.go:331] openshift-sdn/ovs failed with : error storing status for daemon set &v1.DaemonSet ...: Unauthorized that was holding for a while and then it started to get 409s because stuck informers. I suppose it eventually succeeded (maybe hitting different apiserver through the LB) and then the nodes got restarted. So it either recovered before the restart or processed enough with a good kube-apiserver to get up to the restart point and got fixed by the restart. although the creds on disk seemed to be valid sh-4.4# openssl verify -CAfile /etc/kubernetes/static-pod-resources/kube-apiserver-certs/configmaps/client-ca/ca-bundle.crt /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt: OK and I could use them to connect to a localhost KAS before the restart (wanna try it again tomorrow right at the start of the outage) I am not yet sure how we got there but I'll continue tomorrow. *** Bug 1866782 has been marked as a duplicate of this bug. *** *** Bug 1865857 has been marked as a duplicate of this bug. *** Still trying to figure it out. KAS rolls out an upgrade, could credentials operator changes authConfig.Spec.ServiceAccountIssuer https://github.com/openshift/cloud-credential-operator/blob/8d54516/pkg/operator/oidcdiscoveryendpoint/controller.go#L244-L271 which triggers another KAS rollout. During this rollout KCM clients using serviceaccount get Unauthorized for several (about 5 to 20) minutes but it seem to recover eventually. To sum up the slack threads: - KCM is using the bound tokens to create kubeconfigs for its controller 1. KASO fully rolls out the new KAS 2. KCM updates fine 3. other operators update 4. CCO updates and changes the service account issuer 5. KASO does a second full rollout of KAS to reflect the change and activates the new service account issuer which invalidates all bound tokens 6. KCM gets Unauthorized for tens of minutes at least The PR disables the bound token auth in KCM and falls back to SA tokens. Verification of the bug is blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1868786 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |