Bug 1813512

Summary: 4.3 -> 4.4 update hangs on ClusterOperatorDegraded, NodeInstallerDegraded, RevisionControllerDegraded, TargetConfigControllerDegraded
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: kube-controller-managerAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.4CC: aos-bugs, maszulik, mfojtik, sdodson, tnozicka, vrutkovs
Target Milestone: ---Keywords: Upgrades
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Upgrade race when snaphotting KCM localhost-recovery-client-token Consequence: Failed upgrade Fix: Fixed secret snaphotting to wait for token to be injected before making a snaphot Result: Upgrade works
Story Points: ---
Clone Of:
: 1819256 (view as bug list) Environment:
Last Closed: 2020-05-04 11:46:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1819256    
Bug Blocks:    

Description W. Trevor King 2020-03-14 04:58:22 UTC
Job [1].  Build-log symptoms:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:136]: during upgrade to registry.svc.ci.openshift.org/ocp/release@sha256:6fa3e6520d6668737d29a68ef7d7189642b07dba9b17511316210f336e9492b0
Unexpected error:
    <*errors.errorString | 0xc00470cf90>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-rc.1: 18% complete",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-rc.1: 18% complete
occurred
...
Mar 13 22:52:57.703 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 298:\nNodeInstallerDegraded: \nRevisionControllerDegraded: conflicting latestAvailableRevision 303\nTargetConfigControllerDegraded: "serviceaccount/localhost-recovery-client": secret openshift-kube-controller-manager/localhost-recovery-client-token hasn't been populated with SA token yet: missing SA UID

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/92/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-69d0187ff513f67a3dbe37c843e318a59ff27699f3c38dcd2d79df13bc176def/namespaces/openshift-cluster-version/pods/cluster-version-operator-5fffc549d9-kmq5k/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'Running sync.*in state\|Result of work' | tail -n3
2020-03-13T23:01:42.3198065Z I0313 23:01:42.319768       1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 346:
2020-03-13T23:04:40.0548178Z I0313 23:04:40.054748       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:6fa3e6520d6668737d29a68ef7d7189642b07dba9b17511316210f336e9492b0 (force=true) on generation 2 in state Updating at attempt 9
2020-03-13T23:10:25.1064177Z I0313 23:10:25.106409       1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 393:

Not sure what's going on with that.  Currently it's only turned up in a single 4.3.5 -> 4.4.0-rc.1 CI job [2], but hanging an update is bad.  We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/92
[2]: https://search.svc.ci.openshift.org/?name=upgrade&search=localhost-recovery-client-token+hasn%27t+been+populated+with+SA+token+yet%3A+missing+SA+UID&maxAge=336h&context=-1&type=all

Comment 1 Tomáš Nožička 2020-03-16 16:01:33 UTC
The number of installer revisions is outstanding.

Mar 13 22:16:57.566 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 111:\nNodeInstallerDegraded: \nTargetConfigControllerDegraded: "serviceaccount/localhost-recovery-client": secret openshift-kube-controller-manager/localhost-recovery-client-token hasn't been populated with SA token yet: missing SA UID
Mar 13 22:26:12.601 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 155:\nNodeInstallerDegraded: \nTargetConfigControllerDegraded: "serviceaccount/localhost-recovery-client": secret openshift-kube-controller-manager/localhost-recovery-client-token hasn't been populated with SA token yet: missing SA UID
Mar 13 22:44:12.670 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 250:\nNodeInstallerDegraded: \nTargetConfigControllerDegraded: "serviceaccount/localhost-recovery-client": secret openshift-kube-controller-manager/localhost-recovery-client-token hasn't been populated with SA token yet: missing SA UID
Mar 13 22:52:57.703 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 298:\nNodeInstallerDegraded: \nRevisionControllerDegraded: conflicting latestAvailableRevision 303\nTargetConfigControllerDegraded: "serviceaccount/localhost-recovery-client": secret openshift-kube-controller-manager/localhost-recovery-client-token hasn't been populated with SA token yet: missing SA UID


Also this looks pretty fatal:

Mar 13 23:05:31.216 W ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-vf25zxdf-7f637-cb7p6-master-0 Unable to attach or mount volumes: unmounted volumes=[resource-dir cert-dir], unattached volumes=[resource-dir cert-dir]: timed out waiting for the condition


I didn't find the exact cause yet but I'll continue in the morning. Moving it to 4.4.0 as it looks fairly bad so far.

Comment 2 Tomáš Nožička 2020-03-17 18:12:32 UTC
```
{
  "apiVersion": "v1",
  "kind": "Secret",
  "metadata": {
    "annotations": {
      "kubernetes.io/service-account.name": "localhost-recovery-client"
    },
    "creationTimestamp": "2020-03-13T21:57:13Z",
    "name": "localhost-recovery-client-token",
    "namespace": "openshift-kube-controller-manager",
    "resourceVersion": "24994",
    "selfLink": "/api/v1/namespaces/openshift-kube-controller-manager/secrets/localhost-recovery-client-token",
    "uid": "853d9788-b36f-459c-b26b-ee8e5e70b45f"
  },
  "type": "kubernetes.io/service-account-token"
}
```

is missing the token that should have been injected by kube-controller-manager.
I was considering it could be fighting with apply but the kube-controller-manager-operator logs are missing SecretUpdated events.

Unfortunately, the audit logs that could prove either way are corrupted and contain no mention of secret openshift-kube-controller-manager/localhost-recovery-client-token, not even the create audit event.

```
{
  "apiVersion": "v1",
  "data": {
    "ca.crt": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURNakNDQWhxZ0F3SUJBZ0lJSCs1L0tCZ21DdWt3RFFZSktvWklodmNOQVFFTEJRQXdOekVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1TRXdId1lEVlFRREV4aHJkV0psTFdGd2FYTmxjblpsY2kxc1lpMXphV2R1WlhJdwpIaGNOTWpBd016RXpNakExT0RRMVdoY05NekF3TXpFeE1qQTFPRFExV2pBM01SSXdFQVlEVlFRTEV3bHZjR1Z1CmMyaHBablF4SVRBZkJnTlZCQU1UR0d0MVltVXRZWEJwYzJWeWRtVnlMV3hpTFhOcFoyNWxjakNDQVNJd0RRWUoKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBS2VlTWZTUC9jUld4N1hXTGhqZ3hKV1F0WTNlVzY3SgptL1FBSHY1SGRtaGttdlZRMHplWitIRHVtSktWYTFjNmpyL240NXh6OUs2R2VhSFZoVTlxVUZubzhDZXFpcStVCkZ3dzNXNERNaWJoRWpVUmVQbWNGZys1TnZxcjJLaVQ0U3J0SUZtMTRmMWRINzcyNElaUUZ6ckxjRUF2QUVCTnUKK05ieG9MbnU2WVNhR1BYZTFDbTNiNkViUTdLeTMraGczZGxMcVRTTzlQNFhPRmtrdDdKRHhsdkd5K3Vydnd2OApNcGlKa1NMSDVITEppY2VFMHo2czlZbnAvd0lHNTFqZERUKzEzRDZ3VVZKYUEwd0gzSXdINy9lejNEZTZEZ0JRCldLdlNtMXJtNkl0aTJublN5ZE51MXdUNDVrSCtscHhMdEJ3NldWL1BMVk5kR0Frb3d3dk84MTBDQXdFQUFhTkMKTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFRk53awpwKzhrZjJtVTBaU00wRFZYaHkveFc3Mi9NQTBHQ1NxR1NJYjNEUUVCQ3dVQUE0SUJBUUNmQ2JiakoxamtjU2xECjFqVnN5VE5EZmg1a3RLV0t2TENZNHJ1ZnhhN3ZIaE9waVk0UHIrdnB2akRDUzlrZWkrU3VJWjA5YU5sRkIvOFcKM1krckloVUZmME1tOVlqTTgwRmFWR1ZRYUFTNUhJaGw2aUFGYXNIK24yRUkycWNNdldwR3JIcHhudTFMdkJkeApHTTh6elNSeFFycURzaWRoRFJZaVlxb1A2MW9PL01Qalptb0x5UkZRdmpiVTBPaEZCaUNmUWFYUi9VZEJqQWJsCmNQSnFmdlpxenZDRDBmOGN1dncyR1M5endDWXNtR2pkWnZxNi9nQnBNY3FSUERCZmtNQjNUdlpPdzQ5T3VnT1kKNHZza3ZlcHR5eEdMZWFoRVdyb3djOStlWGdhUFQrZjIzU3pJZzNqRklIMldveGFSQ2ZhekhMRkw5WmVSQ2haVgpVN0ZldTFSRAotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCi0tLS0tQkVHSU4gQ0VSVElGSUNBVEUtLS0tLQpNSUlEUURDQ0FpaWdBd0lCQWdJSWRjQkRJZ0RXMFp3d0RRWUpLb1pJaHZjTkFRRUxCUUF3UGpFU01CQUdBMVVFCkN4TUpiM0JsYm5Ob2FXWjBNU2d3SmdZRFZRUURFeDlyZFdKbExXRndhWE5sY25abGNpMXNiMk5oYkdodmMzUXQKYzJsbmJtVnlNQjRYRFRJd01ETXhNekl3TlRnME5Gb1hEVE13TURNeE1USXdOVGcwTkZvd1BqRVNNQkFHQTFVRQpDeE1KYjNCbGJuTm9hV1owTVNnd0pnWURWUVFERXg5cmRXSmxMV0Z3YVhObGNuWmxjaTFzYjJOaGJHaHZjM1F0CmMybG5ibVZ5TUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUEzY1ZWQWlDbTN5T2gKN3ZJSE9DeTRFSlpPUDVlbHBIVkJDNEdwdWVRMEVlek9HbHV6Mnk4TmphMHIxZWxRZUJGeFpzK3dBODZwSEEwSQphMUJwdXVWRVNoNHN4UytkSXhuSmtDeXhrTVBhU1NKc3VvUlphUk11ZkFQMHRweUMxb2pqYnhWOXFFQUREdlVtCmRjbVlHbVd1S292WC9kR0Z5VEgxWG1TSzBZb3RZamNuWUhjWnVlOTVrRHloOUJUQXUvbktSVjZoZWMxbVV4eFQKUlRSYnlJc3l6MEJFZkQwcmtaM2QxOGU4MkRGWjNYQk8vNCtFUi94UERBcm91czJpNHBtSHA1Vm10OWVPTE5wUApnaUpxMFZ1ZHcrMTJKUkl4QXNGOVFvemVId2s0K0JlNVZieVdGWjRNSkdiZHlEd09hOUJVM3FxUEJsSE9TcXJFCmZUbHRGVVliTXdJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVTJIbWgzUWx5Z1RLRmlza3BXNmZVMTlUYkN3VXdEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUQ4WEZOZlRtQmt0Ni9ET0hWYUUvRW0wY3BCR3NqMmdpRExGaS83ZlRZVWltSXZ4NXJFUzFwQ2YrZHNWCmxYQ2NaUEt0dzR1aGJOdnlGM04yRUxqdWRWdWlVaEd4cVFvamZMc25WQi8zbnVqT2VhNWo1a1dvKzc4clc1RE0KNXBnY3hsTVM4MElTcGdVV1BKN1ovSzhuV0hVT3NoR2I5bjNwTEgwaWxLOGR4WlRsa0hpekx5WVVuWE80N0FLTQpEMi82eXF6ZWZjSHkyVU84ZnluNkVRRDF2NWtDbGFHUHNIQ2E0WXNRL3ltYUd2UGtpL2RTaG1VN05mT1prc1FkClVUTGxyclFuTGJuT3doN3Y3dGZZZWFPMU9ITDVrZ0VlVFZqVWFBYklDZTE1eEJpdVJ3RnM1NTd6U045bTE1M0IKakQ0VW03bk9FckUvNHNCNUJUdk5DMHdNVFFFPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCi0tLS0tQkVHSU4gQ0VSVElGSUNBVEUtLS0tLQpNSUlEVERDQ0FqU2dBd0lCQWdJSUpMQmJUY2lUeWxnd0RRWUpLb1pJaHZjTkFRRUxCUUF3UkRFU01CQUdBMVVFCkN4TUpiM0JsYm5Ob2FXWjBNUzR3TEFZRFZRUURFeVZyZFdKbExXRndhWE5sY25abGNpMXpaWEoyYVdObExXNWwKZEhkdmNtc3RjMmxuYm1WeU1CNFhEVEl3TURNeE16SXdOVGcwTlZvWERUTXdNRE14TVRJd05UZzBOVm93UkRFUwpNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUzR3TEFZRFZRUURFeVZyZFdKbExXRndhWE5sY25abGNpMXpaWEoyCmFXTmxMVzVsZEhkdmNtc3RjMmxuYm1WeU1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0MKQVFFQTJLcFFXemRnS0lvUEtiR2hVWDN4TkVtdkEwN1J6UFJDWUk1dFViUTJhZ1ZDY2VqaHg1UkVSL1dyTTJGVQp1S0FIY0M2cnVzQnFVUnhNVnJQQ05lKzJ2S3BJV0h4ZnZKa3JOcXkxb2FmdDJWelpqd3JLQk01MHZqQUsxdDF4ClFpNU5aZTFQNHgrTlpRRDdQYnpaMFlvb0l6eEorVzQveTljdFM2Q1hWc0tmaVh5SUw0Z1o1Q0NudHE1bnZiZnQKcHNPTUdiWEV6dTdiT3pXN29wRGdSUHAvai94TzYyTm4wRWs0emNVZ29FZDdpMHRwd3l0ZFJRaXMzbFpJbHJwRApnRjVqK21SSjNBR1REYjY3bmNnVS9nUFFKak5sV1ZOS0drSVdaOVQzRUx2RmgxUHg2MWZyZGdJVXQ1Z2ExVkVFClpGM3NCL0VaTEpMbGRRNis4VVBKNGpucmxRSURBUUFCbzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQXFRd0R3WUQKVlIwVEFRSC9CQVV3QXdFQi96QWRCZ05WSFE0RUZnUVVWcWJyTTJkdDJDZlA0bVdnZkt4TWNpZHdpVkV3RFFZSgpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFMcnJaU2FpYW9xRXhBRWl2d1AzMWlQaXVVQVlFQWZKcnlIcnU4NkM3TFpHCi9uMWx1bTQxWTAvek9Pc3VXdTJhWldTM0kzU3BlZUxpK2dLZUJIVlNMMjMveHZCbWgwNVIrSHhLS2Izd25jYkQKYkNDSElxSkd4NUNZMXhRY0xyNVpQcTYxRUx3Uk1Zam9OTG9pdUtTc2x5Qm4zZFovRUtjY2VmY20zY2E2d2ZkUwpma1FlRmY1QkhuM3E5clNNZ093U1lBdkM2SU1HYjRjSTRDckZlNEJQcWVIZnhudGJRdGFXd1JPSytGUTF3alNRClVZdnlSNjVZT2dxaDY2bzFRSVFtT2tJYjh3b0dLTEd5NFdQYnE5eHJ6eUROMVNxNXN1dUZkMm54TUh1dStWajQKK3ZzWC8xMFFMdWxpN214ZGdvMHdDc0M5a09iR0RmYkJRT0t4S2pSVlgvND0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQotLS0tLUJFR0lOIENFUlRJRklDQVRFLS0tLS0KTUlJRGtEQ0NBbmlnQXdJQkFnSUJBVEFOQmdrcWhraUc5dzBCQVFzRkFEQlpNVmN3VlFZRFZRUURERTV2Y0dWdQpjMmhwWm5RdGEzVmlaUzFoY0dselpYSjJaWEl0YjNCbGNtRjBiM0pmYkc5allXeG9iM04wTFhKbFkyOTJaWEo1CkxYTmxjblpwYm1jdGMybG5ibVZ5UURFMU9EUXhNelEwTnpNd0hoY05NakF3TXpFek1qRXlNVEV6V2hjTk16QXcKTXpFeE1qRXlNVEUwV2pCWk1WY3dWUVlEVlFRRERFNXZjR1Z1YzJocFpuUXRhM1ZpWlMxaGNHbHpaWEoyWlhJdApiM0JsY21GMGIzSmZiRzlqWVd4b2IzTjBMWEpsWTI5MlpYSjVMWE5sY25acGJtY3RjMmxuYm1WeVFERTFPRFF4Ck16UTBOek13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLQW9JQkFRREd3N0JVR3NqQ1Q3dVQKSGdzaHoyLzRuNUpEbXVKNmVOQzdCV3RIcDZDUEZwSElXNUkzY3Evc1p5MG4rQVNhZXdycTNHelc2WmR0U2ZIWApyaElLZnBhOHEwL1dmcVJqYVVxbENpbEFteU9XY25IYW5pQnJUMzRxY3pONFh2cGk5VkhjSGFxTnM5K0JvVGNaCmJkZ1prT0hqN3VuTmtMa1RoRFNIcXR5TDlyMjc2NFN4ekxiaGZlaXpIbHM3R3VjN0xyVm1KUUN3VUE2SHhISUoKUWgrOXZPZHJ1TFBhWHM0dCtkL1hJWURIbXlzem0rUzZyYnN0SzFGa1pKUzE0TkYwVG9odllYVUdITGcxUEF2Sgo2TUt0N3htOCtNaG53MjFvYjYyQ1ZENGRGM1ozV0JvQmlHY1haalYyNHBXRWw4d0VSUW96K3diYWNEQVVDY1hkCndaNUhTOWRyQWdNQkFBR2pZekJoTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQQmdOVkhSTUJBZjhFQlRBREFRSC8KTUIwR0ExVWREZ1FXQkJSbWdkVXlpTzc5RzA4LytVdk82QTNKUEh6UnVqQWZCZ05WSFNNRUdEQVdnQlJtZ2RVeQppTzc5RzA4LytVdk82QTNKUEh6UnVqQU5CZ2txaGtpRzl3MEJBUXNGQUFPQ0FRRUFyQkRxWm9hY3FNSHpvSDRKCnVZSHovRjZmNVdsVUlNVU9NZG9ta1VmaW1teUQxNFdRWkR0T0d1cUpWMGFLVGo2QllKSmUwMC9jOGNFRnZrcy8KSVVYSklac2taWjM5ZWlCb3JKcXNqcnYzN2EvSzNYUDVQbmRZWWRNazBId1M2SGsyZmVNSFN1aGV3QzUwaFVseAovaHU0REI3Sk1ieW0zVldxK3oyaGRtSHUyUmZZQ0EvYldnT005aVhQUHJ5ZDdia3A5SmVPeGFRZ2pDdzVzem1NClRQNExMWHRpWmhUa2pIdlBsMG90YUg5VGMrMFFIS3FmYm5EcVBkVnk5Yy9kV3ZxaWw5cUUrL3hGR2tUVFg2clcKMzB0WHhmbklDTlU2MC9zMEk1ckdRdytnVkVJckd5bjhUVEJLS29xU3hmOHhvVHJxbmtnNTZhNDFsV00zTU9Bdwp0RmN6OUE9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCi0tLS0tQkVHSU4gQ0VSVElGSUNBVEUtLS0tLQpNSUlDN1RDQ0FkV2dBd0lCQWdJQkFUQU5CZ2txaGtpRzl3MEJBUXNGQURBbU1TUXdJZ1lEVlFRRERCdHBibWR5ClpYTnpMVzl3WlhKaGRHOXlRREUxT0RReE16UTNOVEl3SGhjTk1qQXdNekV6TWpFeU5UVXlXaGNOTWpJd016RXoKTWpFeU5UVXpXakFtTVNRd0lnWURWUVFEREJ0cGJtZHlaWE56TFc5d1pYSmhkRzl5UURFMU9EUXhNelEzTlRJdwpnZ0VpTUEwR0NTcUdTSWIzRFFFQkFRVUFBNElCRHdBd2dnRUtBb0lCQVFDN1FGVGkwRUhKNEkvcEZHazVZeFlMCllSUVRUUEYxcHBjRllUMEc4WnlGTHA5QzdVcVBSc091VW5hWDVVMUdWUXI5VmUySXJYaVBxaXhHK0ZabC9zeHcKUjlkdkh5Snpqd21KMEg0NnR3TmNDY0FodTRzcTV5dnhjZm9Hbit1UEpWTGFEVHJEcEpiaFlYUTBoeDQ1Z3lUUQpPS2MralhnTTJ5Sys1TzlwSWNSZlB6Ly9RWTNSRVJPTlZpY0d3eW52UXZmdXhuMGJsZDIzZ0YxQkl1bG84aFhLCm1sWDNyaG11em1RUFNoL2s0YXdyY09qN2hEZWNteitSRW5La0VUQTIxclhZUnJSTmFxZGZEQ2MyYTZKdmgvQUEKNkpZVE5VMGh4OWdidFNnYWJSMHQ5QnJxZHFmT2ZVdlV4N0ZjcDY2VDV2MHlvcndTaTNHZTlmVUE5VW1EQkVXUApBZ01CQUFHakpqQWtNQTRHQTFVZER3RUIvd1FFQXdJQ3BEQVNCZ05WSFJNQkFmOEVDREFHQVFIL0FnRUFNQTBHCkNTcUdTSWIzRFFFQkN3VUFBNElCQVFCY0NRc3E3ekgrbmpmUDFPTWI1Ym9haUtrWmRrMG40dU5wZWhlSEtuenoKWWhSbGUzTm4rcG9qb0xIU1NXVUJoT3Znb3JmZFMxN2RTQ1dkbU5URFB3V3V3blBJTmtyMWRUcmx5emZLZUlobAp0TjYxQS9haHRJNDNkR2hpUzBnZCtQTnpOQ0RnVjFxNzRRU0ZIRzEzdUc5dXV5dEJQL1dYa0hoUjJFaE54SzRxClZGU0pJNVVQOHdaMEZ2UUZnZ05ZMWk3S3lNTUZRTVV6UkhYUUt1MThIbmFiRTdjVUV1clArbUgzM054Q1VFRUgKU21BY29reU9oU013WnpRdjgxbGxUNGd5OHdUbmVzUkN4aXpmNGJ3MnJ5QmQyTnExZGdjZkJXclF1SjBiRy9MMgplRjVWRTlpNzEybmRBNXJDZDZ2MkU0UjJLNDNuQ3V3R0cvSkdUZGFUYkNQRAotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==",
    "namespace": "",
    "service-ca.crt": "-----BEGIN CERTIFICATE-----
MIIDMjCCAhqgAwIBAgIIH+5/KBgmCukwDQYJKoZIhvcNAQELBQAwNzESMBAGA1UE
CxMJb3BlbnNoaWZ0MSEwHwYDVQQDExhrdWJlLWFwaXNlcnZlci1sYi1zaWduZXIw
HhcNMjAwMzEzMjA1ODQ1WhcNMzAwMzExMjA1ODQ1WjA3MRIwEAYDVQQLEwlvcGVu
c2hpZnQxITAfBgNVBAMTGGt1YmUtYXBpc2VydmVyLWxiLXNpZ25lcjCCASIwDQYJ
KoZIhvcNAQEBBQADggEPADCCAQoCggEBAKeeMfSP/cRWx7XWLhjgxJWQtY3eW67J
m/QAHv5HdmhkmvVQ0zeZ+HDumJKVa1c6jr/n45xz9K6GeaHVhU9qUFno8Ceqiq+U
Fww3W4DMibhEjURePmcFg+5Nvqr2KiT4SrtIFm14f1dH7724IZQFzrLcEAvAEBNu
+NbxoLnu6YSaGPXe1Cm3b6EbQ7Ky3+hg3dlLqTSO9P4XOFkkt7JDxlvGy+urvwv8
MpiJkSLH5HLJiceE0z6s9Ynp/wIG51jdDT+13D6wUVJaA0wH3IwH7/ez3De6DgBQ
WKvSm1rm6Iti2nnSydNu1wT45kH+lpxLtBw6WV/PLVNdGAkowwvO810CAwEAAaNC
MEAwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFNwk
p+8kf2mU0ZSM0DVXhy/xW72/MA0GCSqGSIb3DQEBCwUAA4IBAQCfCbbjJ1jkcSlD
1jVsyTNDfh5ktKWKvLCY4rufxa7vHhOpiY4Pr+vpvjDCS9kei+SuIZ09aNlFB/8W
3Y+rIhUFf0Mm9YjM80FaVGVQaAS5HIhl6iAFasH+n2EI2qcMvWpGrHpxnu1LvBdx
GM8zzSRxQrqDsidhDRYiYqoP61oO/MPjZmoLyRFQvjbU0OhFBiCfQaXR/UdBjAbl
cPJqfvZqzvCD0f8cuvw2GS9zwCYsmGjdZvq6/gBpMcqRPDBfkMB3TvZOw49OugOY
4vskveptyxGLeahEWrowc9+eXgaPT+f23SzIg3jFIH2WoxaRCfazHLFL9ZeRChZV
U7Feu1RD
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIDQDCCAiigAwIBAgIIdcBDIgDW0ZwwDQYJKoZIhvcNAQELBQAwPjESMBAGA1UE
CxMJb3BlbnNoaWZ0MSgwJgYDVQQDEx9rdWJlLWFwaXNlcnZlci1sb2NhbGhvc3Qt
c2lnbmVyMB4XDTIwMDMxMzIwNTg0NFoXDTMwMDMxMTIwNTg0NFowPjESMBAGA1UE
CxMJb3BlbnNoaWZ0MSgwJgYDVQQDEx9rdWJlLWFwaXNlcnZlci1sb2NhbGhvc3Qt
c2lnbmVyMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA3cVVAiCm3yOh
7vIHOCy4EJZOP5elpHVBC4GpueQ0EezOGluz2y8Nja0r1elQeBFxZs+wA86pHA0I
a1BpuuVESh4sxS+dIxnJkCyxkMPaSSJsuoRZaRMufAP0tpyC1ojjbxV9qEADDvUm
dcmYGmWuKovX/dGFyTH1XmSK0YotYjcnYHcZue95kDyh9BTAu/nKRV6hec1mUxxT
RTRbyIsyz0BEfD0rkZ3d18e82DFZ3XBO/4+ER/xPDArous2i4pmHp5Vmt9eOLNpP
giJq0Vudw+12JRIxAsF9QozeHwk4+Be5VbyWFZ4MJGbdyDwOa9BU3qqPBlHOSqrE
fTltFUYbMwIDAQABo0IwQDAOBgNVHQ8BAf8EBAMCAqQwDwYDVR0TAQH/BAUwAwEB
/zAdBgNVHQ4EFgQU2Hmh3QlygTKFiskpW6fU19TbCwUwDQYJKoZIhvcNAQELBQAD
ggEBAD8XFNfTmBkt6/DOHVaE/Em0cpBGsj2giDLFi/7fTYUimIvx5rES1pCf+dsV
lXCcZPKtw4uhbNvyF3N2ELjudVuiUhGxqQojfLsnVB/3nujOea5j5kWo+78rW5DM
5pgcxlMS80ISpgUWPJ7Z/K8nWHUOshGb9n3pLH0ilK8dxZTlkHizLyYUnXO47AKM
D2/6yqzefcHy2UO8fyn6EQD1v5kClaGPsHCa4YsQ/ymaGvPki/dShmU7NfOZksQd
UTLlrrQnLbnOwh7v7tfYeaO1OHL5kgEeTVjUaAbICe15xBiuRwFs557zSN9m153B
jD4Um7nOErE/4sB5BTvNC0wMTQE=
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIDTDCCAjSgAwIBAgIIJLBbTciTylgwDQYJKoZIhvcNAQELBQAwRDESMBAGA1UE
CxMJb3BlbnNoaWZ0MS4wLAYDVQQDEyVrdWJlLWFwaXNlcnZlci1zZXJ2aWNlLW5l
dHdvcmstc2lnbmVyMB4XDTIwMDMxMzIwNTg0NVoXDTMwMDMxMTIwNTg0NVowRDES
MBAGA1UECxMJb3BlbnNoaWZ0MS4wLAYDVQQDEyVrdWJlLWFwaXNlcnZlci1zZXJ2
aWNlLW5ldHdvcmstc2lnbmVyMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKC
AQEA2KpQWzdgKIoPKbGhUX3xNEmvA07RzPRCYI5tUbQ2agVCcejhx5RER/WrM2FU
uKAHcC6rusBqURxMVrPCNe+2vKpIWHxfvJkrNqy1oaft2VzZjwrKBM50vjAK1t1x
Qi5NZe1P4x+NZQD7PbzZ0YooIzxJ+W4/y9ctS6CXVsKfiXyIL4gZ5CCntq5nvbft
psOMGbXEzu7bOzW7opDgRPp/j/xO62Nn0Ek4zcUgoEd7i0tpwytdRQis3lZIlrpD
gF5j+mRJ3AGTDb67ncgU/gPQJjNlWVNKGkIWZ9T3ELvFh1Px61frdgIUt5ga1VEE
ZF3sB/EZLJLldQ6+8UPJ4jnrlQIDAQABo0IwQDAOBgNVHQ8BAf8EBAMCAqQwDwYD
VR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUVqbrM2dt2CfP4mWgfKxMcidwiVEwDQYJ
KoZIhvcNAQELBQADggEBALrrZSaiaoqExAEivwP31iPiuUAYEAfJryHru86C7LZG
/n1lum41Y0/zOOsuWu2aZWS3I3SpeeLi+gKeBHVSL23/xvBmh05R+HxKKb3wncbD
bCCHIqJGx5CY1xQcLr5ZPq61ELwRMYjoNLoiuKSslyBn3dZ/EKccefcm3ca6wfdS
fkQeFf5BHn3q9rSMgOwSYAvC6IMGb4cI4CrFe4BPqeHfxntbQtaWwROK+FQ1wjSQ
UYvyR65YOgqh66o1QIQmOkIb8woGKLGy4WPbq9xrzyDN1Sq5suuFd2nxMHuu+Vj4
+vsX/10QLuli7mxdgo0wCsC9kObGDfbBQOKxKjRVX/4=
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIDkDCCAnigAwIBAgIBATANBgkqhkiG9w0BAQsFADBZMVcwVQYDVQQDDE5vcGVu
c2hpZnQta3ViZS1hcGlzZXJ2ZXItb3BlcmF0b3JfbG9jYWxob3N0LXJlY292ZXJ5
LXNlcnZpbmctc2lnbmVyQDE1ODQxMzQ0NzMwHhcNMjAwMzEzMjEyMTEzWhcNMzAw
MzExMjEyMTE0WjBZMVcwVQYDVQQDDE5vcGVuc2hpZnQta3ViZS1hcGlzZXJ2ZXIt
b3BlcmF0b3JfbG9jYWxob3N0LXJlY292ZXJ5LXNlcnZpbmctc2lnbmVyQDE1ODQx
MzQ0NzMwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDGw7BUGsjCT7uT
Hgshz2/4n5JDmuJ6eNC7BWtHp6CPFpHIW5I3cq/sZy0n+ASaewrq3GzW6ZdtSfHX
rhIKfpa8q0/WfqRjaUqlCilAmyOWcnHaniBrT34qczN4Xvpi9VHcHaqNs9+BoTcZ
bdgZkOHj7unNkLkThDSHqtyL9r2764SxzLbhfeizHls7Guc7LrVmJQCwUA6HxHIJ
Qh+9vOdruLPaXs4t+d/XIYDHmyszm+S6rbstK1FkZJS14NF0TohvYXUGHLg1PAvJ
6MKt7xm8+Mhnw21ob62CVD4dF3Z3WBoBiGcXZjV24pWEl8wERQoz+wbacDAUCcXd
wZ5HS9drAgMBAAGjYzBhMA4GA1UdDwEB/wQEAwICpDAPBgNVHRMBAf8EBTADAQH/
MB0GA1UdDgQWBBRmgdUyiO79G08/+UvO6A3JPHzRujAfBgNVHSMEGDAWgBRmgdUy
iO79G08/+UvO6A3JPHzRujANBgkqhkiG9w0BAQsFAAOCAQEArBDqZoacqMHzoH4J
uYHz/F6f5WlUIMUOMdomkUfimmyD14WQZDtOGuqJV0aKTj6BYJJe00/c8cEFvks/
IUXJIZskZZ39eiBorJqsjrv37a/K3XP5PndYYdMk0HwS6Hk2feMHSuhewC50hUlx
/hu4DB7JMbym3VWq+z2hdmHu2RfYCA/bWgOM9iXPPryd7bkp9JeOxaQgjCw5szmM
TP4LLXtiZhTkjHvPl0otaH9Tc+0QHKqfbnDqPdVy9c/dWvqil9qE+/xFGkTTX6rW
30tXxfnICNU60/s0I5rGQw+gVEIrGyn8TTBKKoqSxf8xoTrqnkg56a41lWM3MOAw
tFcz9A==
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIC7TCCAdWgAwIBAgIBATANBgkqhkiG9w0BAQsFADAmMSQwIgYDVQQDDBtpbmdy
ZXNzLW9wZXJhdG9yQDE1ODQxMzQ3NTIwHhcNMjAwMzEzMjEyNTUyWhcNMjIwMzEz
MjEyNTUzWjAmMSQwIgYDVQQDDBtpbmdyZXNzLW9wZXJhdG9yQDE1ODQxMzQ3NTIw
ggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC7QFTi0EHJ4I/pFGk5YxYL
YRQTTPF1ppcFYT0G8ZyFLp9C7UqPRsOuUnaX5U1GVQr9Ve2IrXiPqixG+FZl/sxw
R9dvHyJzjwmJ0H46twNcCcAhu4sq5yvxcfoGn+uPJVLaDTrDpJbhYXQ0hx45gyTQ
OKc+jXgM2yK+5O9pIcRfPz//QY3RERONVicGwynvQvfuxn0bld23gF1BIulo8hXK
mlX3rhmuzmQPSh/k4awrcOj7hDecmz+REnKkETA21rXYRrRNaqdfDCc2a6Jvh/AA
6JYTNU0hx9gbtSgabR0t9BrqdqfOfUvUx7Fcp66T5v0yorwSi3Ge9fUA9UmDBEWP
AgMBAAGjJjAkMA4GA1UdDwEB/wQEAwICpDASBgNVHRMBAf8ECDAGAQH/AgEAMA0G
CSqGSIb3DQEBCwUAA4IBAQBcCQsq7zH+njfP1OMb5boaiKkZdk0n4uNpeheHKnzz
YhRle3Nn+pojoLHSSWUBhOvgorfdS17dSCWdmNTDPwWuwnPINkr1dTrlyzfKeIhl
tN61A/ahtI43dGhiS0gd+PNzNCDgV1q74QSFHG13uG9uuytBP/WXkHhR2EhNxK4q
VFSJI5UP8wZ0FvQFggNY1i7KyMMFQMUzRHXQKu18HnabE7cUEurP+mH33NxCUEEH
SmAcokyOhSMwZzQv81llT4gy8wTnesRCxizf4bw2ryBd2Nq1dgcfBWrQuJ0bG/L2
eF5VE9i712ndA5rCd6v2E4R2K43nCuwGG/JGTdaTbCPD
-----END CERTIFICATE-----

-----BEGIN CERTIFICATE-----
MIIDSjCCAjKgAwIBAgIBATANBgkqhkiG9w0BAQsFADA2MTQwMgYDVQQDDCtvcGVu
c2hpZnQtc2VydmljZS1zZXJ2aW5nLXNpZ25lckAxNTg0MTM0NDc2MB4XDTIwMDMx
MzIxMjExNVoXDTIyMDUxMjIxMjExNlowNjE0MDIGA1UEAwwrb3BlbnNoaWZ0LXNl
cnZpY2Utc2VydmluZy1zaWduZXJAMTU4NDEzNDQ3NjCCASIwDQYJKoZIhvcNAQEB
BQADggEPADCCAQoCggEBALCvJkeCR9wU8Xuyaq8ER47Be0vKPX/pzibAf+YWri9I
5fN1ajVLyLIHE7hxk/32gMQ2u0RSVFBdwI3Jr5a5Li7N3hGJUtZdJmI4o5C9y3Mi
Ow82ZMdgiR++XlZV2H6v4Uoc6UBdLnO4c1L+Gu+GcfU/hIUzxoESg0aPgYznYRpj
4aT1epoXwHnTIdo/Knuijg0mJ70HExW9UgWSv/HJ3PzW0FxL/kAMUdIuDzfwu1RZ
1uRqn5qv3W0BdcasAO2LMJ4TYFHwVjXp51s6+N1Cy+TFVfaniqqMtR5ZuZRJV+i1
4w17rBsxI/MW77pxm4sApX/SOU/1bzrxOs7WG2t4L68CAwEAAaNjMGEwDgYDVR0P
AQH/BAQDAgKkMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFIduF72fksyRn+jc
Ug3Ymsa+tV2UMB8GA1UdIwQYMBaAFIduF72fksyRn+jcUg3Ymsa+tV2UMA0GCSqG
SIb3DQEBCwUAA4IBAQAaz9Jz5KEcnlIM3WSTGlHiJP1095ivVyhoRjnDNq8ToZhp
F3VbyzyN70w2V75dMPoROAsGhsavyyr/+iIZMFdpZD6RRjR0OYYUF+h/MJZvRwsd
yl9+zsiry3D4X9T9pqZTHE6BKn/vNqaA04Yvgd4OokFdM+zrWxSUdaIZRJEVuRgb
1Rlvon65CTBf50cupQiDeG19tDZpIzdDdf+OhqeH7gmTbwl6vCxeTtNgcJE9/Md0
zQHzyfnVGy0eHAasPoKFwQ94/7/RFzkfJunb+o/yX2w3ULkuYfLUq1bXOgeCGwbI
XWgyjY3fVZL3EDxsN2A2Nw4kkzUdHzSkt7Q0i08W
-----END CERTIFICATE-----
",
    "token": ""
  },
  "kind": "Secret",
  "metadata": {
    "annotations": {
      "kubernetes.io/service-account.name": "localhost-recovery-client",
      "kubernetes.io/service-account.uid": "5ec083e7-7f42-45a6-b3d0-f3d03eea45de"
    },
    "creationTimestamp": "2020-03-13T23:10:59Z",
    "name": "localhost-recovery-client-token-400",
    "namespace": "openshift-kube-controller-manager",
    "ownerReferences": [
      {
        "apiVersion": "v1",
        "kind": "ConfigMap",
        "name": "revision-status-400",
        "uid": "8bda187a-6ee8-4851-b2f4-58500631d220"
      }
    ],
    "resourceVersion": "65608",
    "selfLink": "/api/v1/namespaces/openshift-kube-controller-manager/secrets/localhost-recovery-client-token-400",
    "uid": "ce6980c0-4f45-49a3-aae1-454ad783a948"
  },
  "type": "kubernetes.io/service-account-token"
}
```

The revisioned versions contain "ca.crt" and "service-ca.crt" values injected (possibly some admission for service-ca - might worth to have a look as it isn't present on localhost-recovery-client-token).

This is what triggers the 400 rollouts as it is hot looping on DeepEqual between those two.

ACTION ITEM #1 Do not rollout a revision if createTargetConfigController failed 
maybe blocking the managePod would prevent the install before the token is ready https://github.com/openshift/cluster-kube-controller-manager-operator/blob/e976dbc5f71d9b8c988a802a96073a929b7bdae1/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L234-L241

The lack of injected token is likely a case where no kube-controller-manager is running.

=== kube-controller-manager-ci-op-vf25zxdf-7f637-cb7p6-master-0
 - is the new version deploying but waiting for the token to be injected
 - pending
 - it shouldn't install before the revisioned resource is ready

=== kube-controller-manager-ci-op-vf25zxdf-7f637-cb7p6-master-1
2020-03-13T21:34:29.9862257Z E0313 21:34:29.986173       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=10s: dial tcp [::1]:6443: connect: connection refused
2020-03-13T21:34:36.5130096Z E0313 21:34:36.512916       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: configmaps "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system": RBAC: [clusterrole.rbac.authorization.k8s.io "system:webhook" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:oauth-token-deleter" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-jenkinspipeline" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:discovery" not found, clusterrole.rbac.authorization.k8s.io "self-access-reviewer" not found, clusterrole.rbac.authorization.k8s.io "basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-docker" not found, clusterrole.rbac.authorization.k8s.io "system:scope-impersonation" not found, clusterrole.rbac.authorization.k8s.io "cluster-status" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "system:kube-controller-manager" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-source" not found, clusterrole.rbac.authorization.k8s.io "console-extensions-reader" not found]
2020-03-13T22:01:14.6880472Z E0313 22:01:14.687968       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=10s: dial tcp [::1]:6443: connect: connection refused
2020-03-13T22:01:19.4268737Z E0313 22:01:19.426825       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: configmaps "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

=== kube-controller-manager-ci-op-vf25zxdf-7f637-cb7p6-master-2
2020-03-13T23:12:18.3357629Z E0313 23:12:18.335746       1 azure_standard.go:714] error: az.EnsureHostInPool(ci-op-vf25zxdf-7f637-cb7p6-master-1), az.vmSet.GetPrimaryInterface.Get(ci-op-vf25zxdf-7f637-cb7p6-master-1, ), err=azure - cloud provider rate limited(read) for operation:NicGet

2020-03-13T23:12:33.8048022Z E0313 23:12:33.804725       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.PodSecurityPolicy: the server could not find the requested resource
2020-03-13T23:12:33.8058897Z E0313 23:12:33.805858       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.ReplicaSet: the server could not find the requested resource
2020-03-13T23:12:33.8073024Z E0313 23:12:33.807277       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.NetworkPolicy: the server could not find the requested resource
...

ACTION ITEM #2
- we should never have all kube-controller-managers down during upgrade
- some of the errors might be connected to kube-apiservers

Maciej, would you mind looking into this further while I am out? Thanks.

Comment 3 Maciej Szulik 2020-03-26 18:35:54 UTC
I looked into this and confirmed most of that Tomas wrote in the previous comment,
that kcm-o is hot-looping while waiting for localhost-recovery-client-token SA token.
The problem is that this is a new resource which is additionally dependant on kcm
to inject additional data. This makes this mechanism very racy. I spoke with Stefan
about it and we both agree that blocking managePod would only hide the issue here.
The best option and recommended approach for the future cases like this one is
to create a "blueprint" resource and allow the controller to inject that token.
Only then sync the blueprint into its actual location. This will lengthen the process
but only the first time, at the same time preventing any races like the one 
described above.

Comment 4 W. Trevor King 2020-03-26 18:58:22 UTC
Re-instating the needsinfo.  Comments above make it sound like folks have a good handle on the underlying issue, but the impact-statement questions from comment 0 ("Who is impacted?" through "Is this a regression?" remain unanswered, so it's not clear if this should remain an UpgradeBlocker or, if so, which edges should be blocked by it.

Comment 5 Maciej Szulik 2020-03-27 09:08:03 UTC
(In reply to W. Trevor King from comment #4)
> Re-instating the needsinfo.  Comments above make it sound like folks have a
> good handle on the underlying issue, but the impact-statement questions from
> comment 0 ("Who is impacted?" through "Is this a regression?" remain
> unanswered, so it's not clear if this should remain an UpgradeBlocker or, if
> so, which edges should be blocked by it.

Sorry, I missed that. No this is not a regression, in 4.4 we've introduced new mechanism for auto recovery,
which was not present previously in the product. Specifically the operator dependence on a new secret, namely 
localhost-recovery-client-token which additionally needs to be reacted on by the operand introduce a race
which was identified and I'm working on a fix.

Comment 6 W. Trevor King 2020-03-27 17:21:38 UTC
So everyone updating from 4.3.z to 4.4 (until your fix lands) is impacted?  Or just some subset of those folks?  Also, can you answer the "What is the impact?" and "How involved is remediation?", or do we not have a handle on those yet?

Comment 7 Maciej Szulik 2020-03-30 13:55:30 UTC
(In reply to W. Trevor King from comment #6)
> So everyone updating from 4.3.z to 4.4 (until your fix lands) is impacted? 
> Or just some subset of those folks?  Also, can you answer the "What is the
> impact?" and "How involved is remediation?", or do we not have a handle on
> those yet?

So this is time dependant iow. unpredictable, it may or may not happen. The
ordering here is important. We have it all identified and actively working 
on a fix.

Comment 8 W. Trevor King 2020-03-30 14:33:37 UTC
> We have it all identified and actively working on a fix.

But you cannot retroactively fix existing releases like 4.4.0-rc.1.  My current understanding is that this issue impacts some racy fraction of 4.3.z -> 4.4 updates, and that it does not impact other updates like 4.3.z -> 4.3.z.  I am still not clear on whether it impacts 4.4 -> 4.4 updates.  Or, when a given cluster is impacted, what the impact is, or how involved remediation is.  Or are we saying something like "only 4.4 RCs are impacted, we don't really care about managing update edges for RCs, we'll have this fixed before 4.4.0 GAs, and we don't want to bother working up remediation details for the folks who do happen to hit this on a 4.4 RC"?

Comment 9 Tomáš Nožička 2020-03-31 10:15:06 UTC
> My current understanding is that this issue impacts some racy fraction of 4.3.z -> 4.4 updates, and that it does not impact other updates like 4.3.z -> 4.3.z.
correct

> I am still not clear on whether it impacts 4.4 -> 4.4 updates.
I don't think it does. The SA would already be created and provided with the long lived token. The issue comes from a race when creating the long lived SA the first time (4.3->4.4) and also hitting other issues.

The impact is likely fatal (manual intervention) as you can't update the operator to newer version afterwards to remedy the situation. My hope is we fix this before 4.4 ships.

Comment 12 zhou ying 2020-04-13 06:49:01 UTC
Confirmed with payload: 4.4.0-0.nightly-2020-04-09-220855, the issue has fixed:
1) Run this command in a loop to make sure that data.token is deleted from openshift-kube-controller-manager/localhost-recovery-client-token

# for i in {1..100} ; do oc patch secret localhost-recovery-client-token --type='json' -p='[{"op": "replace", "path": "/data/token", "value": ""}]'; done
2) In second terminal ran the command below to see that revisions have not made the token empty(token with name "localhost-recovery-client-token-<number>" when run the command on second terminal).

# for i in {1..100}; do oc get secrets -o json | jq '.items[] | select(.metadata.name | test("localhost-recovery-client-token-*")) | select(.data.token == null or .data.token == "")'; done

No revisions had the token empty.

Comment 16 errata-xmlrpc 2020-05-04 11:46:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 17 W. Trevor King 2021-04-05 17:47:14 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475