This job is constantly failing to upgrade due to a very slow upgrade. We extended the timeout to see if we got there eventually and we do. However, we see times approaching 80 minutes instead of the "normal" gcp time of 30ish minutes.
When I investigated https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1293834530519519232 I found that the KCM got a lease and then did nothing for about 40 minutes. This left a networking daemonset not fully rolled out for about 40 minutes. Which delayed the upgrade.
We don't have any logs for the time period because we're missing the capability, but given the frequency of the failure and the wide window, this should be reproducible locally. Remember, you need to catch this *during* an upgrade and figure out why the KCM isn't completing the daemonset rollout.
referencing test to mark sippy
cluster upgrade should be fast
This seems to be definitely an issue. I was able to reproduce the half stuck upgrade from
registry.svc.ci.openshift.org/ocp/release@sha256:27f71a857a17d47260026efb94343acfe0df693766de2071c7a35f30e660c306 (latest 4.6.0)
Seems to be caused by KCM getting Unauthorized
E0818 13:08:07.744923 1 daemon_controller.go:331] openshift-sdn/ovs failed with : error storing status for daemon set &v1.DaemonSet ...: Unauthorized
that was holding for a while and then it started to get 409s because stuck informers. I suppose it eventually succeeded (maybe hitting different apiserver through the LB) and then the nodes got restarted. So it either recovered before the restart or processed enough with a good kube-apiserver to get up to the restart point and got fixed by the restart.
although the creds on disk seemed to be valid
sh-4.4# openssl verify -CAfile /etc/kubernetes/static-pod-resources/kube-apiserver-certs/configmaps/client-ca/ca-bundle.crt /etc/kubernetes/static-pod-resources/kube-controller-manager-certs/secrets/kube-controller-manager-client-cert-key/tls.crt
and I could use them to connect to a localhost KAS before the restart (wanna try it again tomorrow right at the start of the outage)
I am not yet sure how we got there but I'll continue tomorrow.
*** Bug 1866782 has been marked as a duplicate of this bug. ***
*** Bug 1865857 has been marked as a duplicate of this bug. ***
Still trying to figure it out. KAS rolls out an upgrade, could credentials operator changes authConfig.Spec.ServiceAccountIssuer
which triggers another KAS rollout. During this rollout KCM clients using serviceaccount get Unauthorized for several (about 5 to 20) minutes but it seem to recover eventually.
To sum up the slack threads:
- KCM is using the bound tokens to create kubeconfigs for its controller
1. KASO fully rolls out the new KAS
2. KCM updates fine
3. other operators update
4. CCO updates and changes the service account issuer
5. KASO does a second full rollout of KAS to reflect the change and activates the new service account issuer which invalidates all bound tokens
6. KCM gets Unauthorized for tens of minutes at least
The PR disables the bound token auth in KCM and falls back to SA tokens.
Verification of the bug is blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1868786