Description of problem: This is observed in our QE Upgrade CI profile: 37_IPI on Azure & Private Cluster Upgrade from 4.5.0-0.nightly-2021-06-21-181416 to 4.6.0-0.nightly-2021-06-24-080044 fails with cluster operator monitoring failing to rollout and/or degraded. [2021-06-24T16:09:30.451Z] Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [2021-06-24T16:09:30.451Z] authentication 4.6.0-0.nightly-2021-06-24-080044 True False False 141m [2021-06-24T16:09:30.451Z] cloud-credential 4.6.0-0.nightly-2021-06-24-080044 True False False 4h38m [2021-06-24T16:09:30.451Z] cluster-autoscaler 4.6.0-0.nightly-2021-06-24-080044 True False False 4h25m [2021-06-24T16:09:30.451Z] config-operator 4.6.0-0.nightly-2021-06-24-080044 True False False 4h25m [2021-06-24T16:09:30.451Z] console 4.6.0-0.nightly-2021-06-24-080044 True False False 154m [2021-06-24T16:09:30.451Z] csi-snapshot-controller 4.6.0-0.nightly-2021-06-24-080044 True False False 141m [2021-06-24T16:09:30.451Z] dns 4.5.0-0.nightly-2021-06-21-181416 True True False 4h33m [2021-06-24T16:09:30.451Z] etcd 4.6.0-0.nightly-2021-06-24-080044 True False False 4h32m [2021-06-24T16:09:30.451Z] image-registry 4.6.0-0.nightly-2021-06-24-080044 True True False 4h19m [2021-06-24T16:09:30.451Z] ingress 4.6.0-0.nightly-2021-06-24-080044 True False False 155m [2021-06-24T16:09:30.451Z] insights 4.6.0-0.nightly-2021-06-24-080044 True False False 4h27m [2021-06-24T16:09:30.451Z] kube-apiserver 4.6.0-0.nightly-2021-06-24-080044 True False False 4h31m [2021-06-24T16:09:30.451Z] kube-controller-manager 4.6.0-0.nightly-2021-06-24-080044 True False False 4h32m [2021-06-24T16:09:30.452Z] kube-scheduler 4.6.0-0.nightly-2021-06-24-080044 True False False 4h31m [2021-06-24T16:09:30.452Z] kube-storage-version-migrator 4.6.0-0.nightly-2021-06-24-080044 True False False 4h19m [2021-06-24T16:09:30.452Z] machine-api 4.6.0-0.nightly-2021-06-24-080044 True False False 4h23m [2021-06-24T16:09:30.452Z] machine-approver 4.6.0-0.nightly-2021-06-24-080044 True False False 4h30m [2021-06-24T16:09:30.452Z] machine-config 4.5.0-0.nightly-2021-06-21-181416 True False False 4h23m [2021-06-24T16:09:30.452Z] marketplace 4.6.0-0.nightly-2021-06-24-080044 True False False 154m [2021-06-24T16:09:30.452Z] monitoring 4.6.0-0.nightly-2021-06-24-080044 False True True 136m [2021-06-24T16:09:30.452Z] network 4.6.0-0.nightly-2021-06-24-080044 True False False 4h34m [2021-06-24T16:09:30.452Z] node-tuning 4.6.0-0.nightly-2021-06-24-080044 True False False 155m [2021-06-24T16:09:30.452Z] openshift-apiserver 4.6.0-0.nightly-2021-06-24-080044 True False False 142m [2021-06-24T16:09:30.452Z] openshift-controller-manager 4.6.0-0.nightly-2021-06-24-080044 True False False 4h27m [2021-06-24T16:09:30.452Z] openshift-samples 4.6.0-0.nightly-2021-06-24-080044 True False False 154m [2021-06-24T16:09:30.452Z] operator-lifecycle-manager 4.6.0-0.nightly-2021-06-24-080044 True False False 4h34m [2021-06-24T16:09:30.452Z] operator-lifecycle-manager-catalog 4.6.0-0.nightly-2021-06-24-080044 True False False 4h34m [2021-06-24T16:09:30.452Z] operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2021-06-24-080044 True False False 140m [2021-06-24T16:09:30.452Z] service-ca 4.6.0-0.nightly-2021-06-24-080044 True False False 4h34m [2021-06-24T16:09:30.452Z] storage 4.6.0-0.nightly-2021-06-24-080044 True False False 156m Version-Release number of selected component (if applicable): kubernetes v1.18.3+d8ef5ad Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa) 4.18.0-193.56.1.el8_2.x86_64 cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8 How reproducible: Once in our CI Steps to Reproduce (in detail): 1. Create an OCP 4.5.0-0.nightly-2021-06-21-181416 on Azure IPI private cluster. 2. Upgrade to 4.6.0-0.nightly-2021-06-24-080044 3. oc get co Actual results: Upgrade fails due to cluster operators image-registry and monitoring failing to roll out, progressing and degraded Expected results: Upgrade to be successful and all cluster operators to toll out successffully Additional info: Link to must-gather tar ball in next private comment
from must-gather file namespaces/openshift-monitoring/core/events.yaml - apiVersion: v1 count: 875 eventTime: null firstTimestamp: "2021-06-24T13:47:43Z" involvedObject: apiVersion: v1 fieldPath: spec.containers{grafana-proxy} kind: Pod name: grafana-7ff876c957-9z5bm namespace: openshift-monitoring resourceVersion: "72697" uid: 433cf0ff-5326-4a1e-86c0-4fcfab73ab0c kind: Event lastTimestamp: "2021-06-24T16:13:23Z" message: 'Readiness probe failed: Get https://10.129.2.35:3000/oauth/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)' and there are many "Client.Timeout exceeded while awaiting headers" errors for different projects, so this is a network issue $ grep -r "Client.Timeout exceeded while awaiting headers"