1976342 – OCP Upgrade from 4.5.0-0.nightly-2021-06-21-181416 to 4.6.0-0.nightly-2021-06-24-080044 fails on IPI Azure private cluster - monitoring cluster operator fails to roll out

Bug 1976342 - OCP Upgrade from 4.5.0-0.nightly-2021-06-21-181416 to 4.6.0-0.nightly-2021-06-24-080044 fails on IPI Azure private cluster - monitoring cluster operator fails to roll out

Summary: OCP Upgrade from 4.5.0-0.nightly-2021-06-21-181416 to 4.6.0-0.nightly-2021-06...

Keywords:
Status:	CLOSED DUPLICATE of bug 1958390
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-25 19:50 UTC by Walid A.
Modified:	2021-07-05 11:47 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-05 11:47:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Walid A. 2021-06-25 19:50:32 UTC

Description of problem:
This is observed in our QE Upgrade CI profile:  
37_IPI on Azure & Private Cluster

Upgrade from 4.5.0-0.nightly-2021-06-21-181416 to 4.6.0-0.nightly-2021-06-24-080044 fails with cluster operator monitoring failing to rollout and/or degraded.

[2021-06-24T16:09:30.451Z] Post action: #oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
[2021-06-24T16:09:30.451Z] authentication                             4.6.0-0.nightly-2021-06-24-080044   True        False         False      141m
[2021-06-24T16:09:30.451Z] cloud-credential                           4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h38m
[2021-06-24T16:09:30.451Z] cluster-autoscaler                         4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h25m
[2021-06-24T16:09:30.451Z] config-operator                            4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h25m
[2021-06-24T16:09:30.451Z] console                                    4.6.0-0.nightly-2021-06-24-080044   True        False         False      154m
[2021-06-24T16:09:30.451Z] csi-snapshot-controller                    4.6.0-0.nightly-2021-06-24-080044   True        False         False      141m
[2021-06-24T16:09:30.451Z] dns                                        4.5.0-0.nightly-2021-06-21-181416   True        True          False      4h33m
[2021-06-24T16:09:30.451Z] etcd                                       4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h32m
[2021-06-24T16:09:30.451Z] image-registry                             4.6.0-0.nightly-2021-06-24-080044   True        True          False      4h19m
[2021-06-24T16:09:30.451Z] ingress                                    4.6.0-0.nightly-2021-06-24-080044   True        False         False      155m
[2021-06-24T16:09:30.451Z] insights                                   4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h27m
[2021-06-24T16:09:30.451Z] kube-apiserver                             4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h31m
[2021-06-24T16:09:30.451Z] kube-controller-manager                    4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h32m
[2021-06-24T16:09:30.452Z] kube-scheduler                             4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h31m
[2021-06-24T16:09:30.452Z] kube-storage-version-migrator              4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h19m
[2021-06-24T16:09:30.452Z] machine-api                                4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h23m
[2021-06-24T16:09:30.452Z] machine-approver                           4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h30m
[2021-06-24T16:09:30.452Z] machine-config                             4.5.0-0.nightly-2021-06-21-181416   True        False         False      4h23m
[2021-06-24T16:09:30.452Z] marketplace                                4.6.0-0.nightly-2021-06-24-080044   True        False         False      154m
[2021-06-24T16:09:30.452Z] monitoring                                 4.6.0-0.nightly-2021-06-24-080044   False       True          True       136m
[2021-06-24T16:09:30.452Z] network                                    4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h34m
[2021-06-24T16:09:30.452Z] node-tuning                                4.6.0-0.nightly-2021-06-24-080044   True        False         False      155m
[2021-06-24T16:09:30.452Z] openshift-apiserver                        4.6.0-0.nightly-2021-06-24-080044   True        False         False      142m
[2021-06-24T16:09:30.452Z] openshift-controller-manager               4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h27m
[2021-06-24T16:09:30.452Z] openshift-samples                          4.6.0-0.nightly-2021-06-24-080044   True        False         False      154m
[2021-06-24T16:09:30.452Z] operator-lifecycle-manager                 4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h34m
[2021-06-24T16:09:30.452Z] operator-lifecycle-manager-catalog         4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h34m
[2021-06-24T16:09:30.452Z] operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2021-06-24-080044   True        False         False      140m
[2021-06-24T16:09:30.452Z] service-ca                                 4.6.0-0.nightly-2021-06-24-080044   True        False         False      4h34m
[2021-06-24T16:09:30.452Z] storage                                    4.6.0-0.nightly-2021-06-24-080044   True        False         False      156m



Version-Release number of selected component (if applicable):
kubernetes v1.18.3+d8ef5ad
Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa)   4.18.0-193.56.1.el8_2.x86_64   
cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8

How reproducible:
Once in our CI

Steps to Reproduce (in detail):
1. Create an OCP 4.5.0-0.nightly-2021-06-21-181416 on Azure IPI private cluster.
2. Upgrade to 4.6.0-0.nightly-2021-06-24-080044 
3. oc get co


Actual results:
Upgrade fails due to cluster operators image-registry and monitoring failing to roll out, progressing and degraded

Expected results:
Upgrade to be successful and all cluster operators to toll out successffully

Additional info:
Link to must-gather tar ball in next private comment

Comment 2 Junqi Zhao 2021-06-28 02:08:18 UTC

from must-gather file
namespaces/openshift-monitoring/core/events.yaml 
- apiVersion: v1
  count: 875
  eventTime: null
  firstTimestamp: "2021-06-24T13:47:43Z"
  involvedObject:
    apiVersion: v1
    fieldPath: spec.containers{grafana-proxy}
    kind: Pod
    name: grafana-7ff876c957-9z5bm
    namespace: openshift-monitoring
    resourceVersion: "72697"
    uid: 433cf0ff-5326-4a1e-86c0-4fcfab73ab0c
  kind: Event
  lastTimestamp: "2021-06-24T16:13:23Z"
  message: 'Readiness probe failed: Get https://10.129.2.35:3000/oauth/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)'

and there are many "Client.Timeout exceeded while awaiting headers" errors for different projects, so this is a network issue
$ grep -r "Client.Timeout exceeded while awaiting headers"

Note You need to log in before you can comment on or make changes to this bug.