Bug 1854175

Summary:	[OVN]4.4.11-x86_64 upgrade to 4.5.0-rc.6-x86_64 failed due console operator
Product:	OpenShift Container Platform	Reporter:	Simon <skordas>
Component:	Networking	Assignee:	Ricardo Carrillo Cruz <ricarril>
Networking sub component:	ovn-kubernetes	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, bbennett, dmellado, jhou, jokerman, mifiedle, pweil, scuppett, spadgett, xtian, yanpzhan, yapei
Version:	4.5	Keywords:	Reopened, Upgrades
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-21 13:36:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon 2020-07-06 16:14:53 UTC

Description of problem:
During upgrade 4.4.11-x86_64 to 4.5.0-rc.6-x86_64
console stays in unavailable state 

Version-Release number of selected component (if applicable):
4.4.11-x86_64

How reproducible:
1 on 1 try


Steps to Reproduce:
1. Upgrade 4.4.11-x86_64 to 
2. 4.5.0-rc.6-x86_64
3.

Actual results:

$ oc get co console
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
console                                    4.5.0-rc.6   False       True          False      106m

Name:         console
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-07-06T08:06:40Z
  Generation:          1
  Resource Version:    79114
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/console
  UID:                 d0126bb1-1460-4d1a-9edb-0a30529b8e3e
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-07-06T08:06:41Z
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-07-06T10:04:33Z
    Message:               SyncLoopRefreshProgressing: Working toward version 4.5.0-rc.6
    Reason:                SyncLoopRefresh_InProgress
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-07-06T10:04:37Z
    Message:               DeploymentAvailable: 2 replicas ready at version 4.5.0-rc.6
    Reason:                Deployment_FailedUpdate
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-07-06T08:06:40Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:      operator.openshift.io
    Name:       cluster
    Resource:   consoles
    Group:      config.openshift.io
    Name:       cluster
    Resource:   consoles
    Group:      config.openshift.io
    Name:       cluster
    Resource:   infrastructures
    Group:      config.openshift.io
    Name:       cluster
    Resource:   proxies
    Group:      oauth.openshift.io
    Name:       console
    Resource:   oauthclients
    Group:      
    Name:       openshift-console-operator
    Resource:   namespaces
    Group:      
    Name:       openshift-console
    Resource:   namespaces
    Group:      
    Name:       console-public
    Namespace:  openshift-config-managed
    Resource:   configmaps
  Versions:
    Name:     operator
    Version:  4.5.0-rc.6
Events:       <none>
Name:         dns
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-07-06T08:01:15Z
  Generation:          1
  Resource Version:    47582
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/dns
  UID:                 e6369fae-94ad-4719-84f6-96329f087541
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-07-06T08:15:24Z
    Message:               All desired DNS DaemonSets available and operand Namespace exists
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-07-06T09:17:18Z
    Message:               Desired and available number of DNS DaemonSets are equal
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-07-06T08:01:36Z
    Message:               At least 1 DNS DaemonSet available
    Reason:                AsExpected
    Status:                True
    Type:                  Available
  Extension:               <nil>
  Related Objects:
    Group:     
    Name:      openshift-dns-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-dns
    Resource:  namespaces
    Group:     operator.openshift.io
    Name:      
    Resource:  DNS
  Versions:
    Name:     operator
    Version:  4.4.11
    Name:     coredns
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1cfcdb3c2406c10e980eabd454ef2640877b15d6576e7dfae2beaf129ec94f03
    Name:     openshift-cli
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7cb23c271c3b40a1733f7ae366167cdb91050a449c263811c066f582b772054c
Events:       <none>
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-07-06T08:00:50Z
  Generation:          1
  Resource Version:    49273
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 bf1c40f8-77e5-4838-a4b1-ea091f526f41
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-07-06T09:21:46Z
    Message:               Cluster has deployed 4.4.11
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-07-06T08:01:53Z
    Message:               Cluster version is 4.4.11
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-07-06T09:21:46Z
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-07-06T08:01:53Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      machine-config-controller
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.4.11
Events:       <none>
Name:         network
Namespace:    
Labels:       <none>
Annotations:  network.operator.openshift.io/last-seen-state: {"DaemonsetStates":[],"DeploymentStates":[]}
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-07-06T07:59:07Z
  Generation:          1
  Resource Version:    78767
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/network
  UID:                 28323b4a-9fa9-4cc6-89b5-0bdc19d01894
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-07-06T07:59:56Z
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-07-06T07:59:07Z
    Status:                True
    Type:                  Upgradeable
    Last Transition Time:  2020-07-06T10:04:07Z
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-07-06T08:00:31Z
    Status:                True
    Type:                  Available
  Extension:               <nil>
  Related Objects:
    Group:      
    Name:       applied-cluster
    Namespace:  openshift-network-operator
    Resource:   configmaps
    Group:      apiextensions.k8s.io
    Name:       network-attachment-definitions.k8s.cni.cncf.io
    Resource:   customresourcedefinitions
    Group:      apiextensions.k8s.io
    Name:       ippools.whereabouts.cni.cncf.io
    Resource:   customresourcedefinitions
    Group:      
    Name:       openshift-multus
    Resource:   namespaces
    Group:      rbac.authorization.k8s.io
    Name:       multus
    Resource:   clusterroles
    Group:      
    Name:       multus
    Namespace:  openshift-multus
    Resource:   serviceaccounts
    Group:      rbac.authorization.k8s.io
    Name:       multus
    Resource:   clusterrolebindings
    Group:      rbac.authorization.k8s.io
    Name:       multus-whereabouts
    Resource:   clusterrolebindings
    Group:      rbac.authorization.k8s.io
    Name:       whereabouts-cni
    Resource:   clusterroles
    Group:      
    Name:       cni-binary-copy-script
    Namespace:  openshift-multus
    Resource:   configmaps
    Group:      apps
    Name:       multus
    Namespace:  openshift-multus
    Resource:   daemonsets
    Group:      
    Name:       multus-admission-controller
    Namespace:  openshift-multus
    Resource:   services
    Group:      rbac.authorization.k8s.io
    Name:       multus-admission-controller-webhook
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       multus-admission-controller-webhook
    Resource:   clusterrolebindings
    Group:      admissionregistration.k8s.io
    Name:       multus.openshift.io
    Resource:   validatingwebhookconfigurations
    Group:      
    Name:       openshift-service-ca
    Namespace:  openshift-network-operator
    Resource:   configmaps
    Group:      apps
    Name:       multus-admission-controller
    Namespace:  openshift-multus
    Resource:   daemonsets
    Group:      monitoring.coreos.com
    Name:       monitor-multus-admission-controller
    Namespace:  openshift-multus
    Resource:   servicemonitors
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-multus
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-multus
    Resource:   rolebindings
    Group:      monitoring.coreos.com
    Name:       prometheus-k8s-rules
    Namespace:  openshift-multus
    Resource:   prometheusrules
    Group:      
    Name:       openshift-ovn-kubernetes
    Resource:   namespaces
    Group:      
    Name:       ovn-kubernetes-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   serviceaccounts
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-node
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-node
    Resource:   clusterrolebindings
    Group:      
    Name:       ovn-kubernetes-controller
    Namespace:  openshift-ovn-kubernetes
    Resource:   serviceaccounts
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-controller
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-controller
    Resource:   clusterrolebindings
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-sbdb
    Namespace:  openshift-ovn-kubernetes
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-sbdb
    Namespace:  openshift-ovn-kubernetes
    Resource:   rolebindings
    Group:      
    Name:       ovnkube-config
    Namespace:  openshift-ovn-kubernetes
    Resource:   configmaps
    Group:      
    Name:       ovnkube-db
    Namespace:  openshift-ovn-kubernetes
    Resource:   services
    Group:      apps
    Name:       ovs-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   daemonsets
    Group:      network.operator.openshift.io
    Name:       ovn
    Namespace:  openshift-ovn-kubernetes
    Resource:   operatorpkis
    Group:      monitoring.coreos.com
    Name:       master-rules
    Namespace:  openshift-ovn-kubernetes
    Resource:   prometheusrules
    Group:      monitoring.coreos.com
    Name:       networking-rules
    Namespace:  openshift-ovn-kubernetes
    Resource:   prometheusrules
    Group:      monitoring.coreos.com
    Name:       monitor-ovn-master
    Namespace:  openshift-ovn-kubernetes
    Resource:   servicemonitors
    Group:      
    Name:       ovn-kubernetes-master
    Namespace:  openshift-ovn-kubernetes
    Resource:   services
    Group:      monitoring.coreos.com
    Name:       monitor-ovn-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   servicemonitors
    Group:      
    Name:       ovn-kubernetes-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   services
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-ovn-kubernetes
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-ovn-kubernetes
    Resource:   rolebindings
    Group:      policy
    Name:       ovn-raft-quorum-guard
    Namespace:  openshift-ovn-kubernetes
    Resource:   poddisruptionbudgets
    Group:      apps
    Name:       ovnkube-master
    Namespace:  openshift-ovn-kubernetes
    Resource:   daemonsets
    Group:      apps
    Name:       ovnkube-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   daemonsets
    Group:      
    Name:       openshift-network-operator
    Resource:   namespaces
  Versions:
    Name:     operator
    Version:  4.4.11
Events:       <none>

Expected results:
console should be available and upgraded

Comment 2 Samuel Padgett 2020-07-06 18:28:41 UTC

The console pod is unable to get the OAuth well-known endpoint. This could be a networking issue.

2020-07-06T11:52:09.957946336Z 2020-07-06T11:52:09Z auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Comment 3 Zac Herman 2020-07-06 19:26:35 UTC

It looks like we've seen this before, possibly.  One bug notes restarting the OVS pod on the node fixed the issue.

https://bugzilla.redhat.com/show_bug.cgi?id=1760103
https://bugzilla.redhat.com/show_bug.cgi?id=1760948

Comment 4 zhaozhanqi 2020-07-07 02:54:07 UTC

(In reply to Zac Herman from comment #3)
> It looks like we've seen this before, possibly.  One bug notes restarting
> the OVS pod on the node fixed the issue.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1760103
> https://bugzilla.redhat.com/show_bug.cgi?id=1760948

from the install-config.yaml. this issue cluster is happen on 'OVNKubernetes' networok plugin and above two bugs are 'openshift-sdn'. 
So updated the subcomponent

Comment 6 Mike Fiedler 2020-07-07 12:24:31 UTC

I hit this issue last night on an ovn cluster on Azure.   Same config this was originally reported on.

1/2 console pods crashlooping with "error contacting auth provider"

I tried deleting the crashing console pod, but it's replacement stuck in ContainerCreating with this event:

  Warning  FailedCreatePodSandBox  2s         kubelet, ugdci06153010c-v4ksw-master-2  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_console-764486b8cd-gxp5c_openshift-console_213dfedd-5d81-424b-9270-c7ea20e21b5e_0(89c6fbf63d99c330d48f08a5dbba695d29983f95ea5bf1502193dfc1ff7e6356): Multus: [openshift-console/console-764486b8cd-gxp5c]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-console/console-764486b8cd-gxp5c] failed to configure pod interface: timed out dumping br-int flow entries for sandbox: timed out waiting for the condition
'

So, another "timed out waiting for condition"


@yadan for my case, the upgrade was successful (console not progressing) but the console was unavailable.

Comment 7 zhaozhanqi 2020-07-08 08:32:55 UTC

(In reply to Mike Fiedler from comment #6)
> I hit this issue last night on an ovn cluster on Azure.   Same config this
> was originally reported on.
> 
> 1/2 console pods crashlooping with "error contacting auth provider"
> 
> I tried deleting the crashing console pod, but it's replacement stuck in
> ContainerCreating with this event:
> 
>   Warning  FailedCreatePodSandBox  2s         kubelet,
> ugdci06153010c-v4ksw-master-2  Failed to create pod sandbox: rpc error: code
> = Unknown desc = failed to create pod network sandbox
> k8s_console-764486b8cd-gxp5c_openshift-console_213dfedd-5d81-424b-9270-
> c7ea20e21b5e_0(89c6fbf63d99c330d48f08a5dbba695d29983f95ea5bf1502193dfc1ff7e63
> 56): Multus: [openshift-console/console-764486b8cd-gxp5c]: error adding
> container to network "ovn-kubernetes": delegateAdd: error invoking confAdd -
> "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request
> failed with status 400: '[openshift-console/console-764486b8cd-gxp5c] failed
> to configure pod interface: timed out dumping br-int flow entries for
> sandbox: timed out waiting for the condition
> '
> 
> So, another "timed out waiting for condition"
> 
> 
> @yadan for my case, the upgrade was successful (console not progressing) but
> the console was unavailable.

Guess this issue happen on all pods in that node not only console pod.  could you help provide the must-gather logs. thanks @Mike

Comment 8 Yanping Zhang 2020-07-08 09:01:22 UTC

Checked upgrade ci, the upgrade from 4.4.11-x86_64 to 4.5.0-rc.7-x86_64 on cluster with IPI on Azure (FIPS on) OVN succeed. And console could be accessed successfully.
$ oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-rc.7   True        False         False      146m
cloud-credential                           4.5.0-rc.7   True        False         False      176m
cluster-autoscaler                         4.5.0-rc.7   True        False         False      161m
config-operator                            4.5.0-rc.7   True        False         False      105m
console                                    4.5.0-rc.7   True        False         False      64m
csi-snapshot-controller                    4.5.0-rc.7   True        False         False      74m
dns                                        4.5.0-rc.7   True        False         False      169m
etcd                                       4.5.0-rc.7   True        False         False      169m
image-registry                             4.5.0-rc.7   True        False         False      153m
ingress                                    4.5.0-rc.7   True        False         False      152m
insights                                   4.5.0-rc.7   True        False         False      163m
kube-apiserver                             4.5.0-rc.7   True        False         False      167m
kube-controller-manager                    4.5.0-rc.7   True        False         False      168m
kube-scheduler                             4.5.0-rc.7   True        False         False      168m
kube-storage-version-migrator              4.5.0-rc.7   True        False         False      68m
machine-api                                4.5.0-rc.7   True        False         False      163m
machine-approver                           4.5.0-rc.7   True        False         False      92m
machine-config                             4.5.0-rc.7   True        False         False      101m
marketplace                                4.5.0-rc.7   True        False         False      62m
monitoring                                 4.5.0-rc.7   True        False         False      88m
network                                    4.5.0-rc.7   True        False         False      171m
node-tuning                                4.5.0-rc.7   True        False         False      92m
openshift-apiserver                        4.5.0-rc.7   True        False         False      62m
openshift-controller-manager               4.5.0-rc.7   True        False         False      164m
openshift-samples                          4.5.0-rc.7   True        False         False      91m
operator-lifecycle-manager                 4.5.0-rc.7   True        False         False      170m
operator-lifecycle-manager-catalog         4.5.0-rc.7   True        False         False      170m
operator-lifecycle-manager-packageserver   4.5.0-rc.7   True        False         False      62m
service-ca                                 4.5.0-rc.7   True        False         False      171m
service-catalog-apiserver                  4.4.11       True        False         False      61m
service-catalog-controller-manager         4.4.11       True        False         False      65m
storage                                    4.5.0-rc.7   True        False         False      92m
[zyp@MiWiFi-R1CM ~]$ oc get pod -n openshift-console
NAME                         READY   STATUS    RESTARTS   AGE
console-dc6dc747-57v5l       1/1     Running   0          70m
console-dc6dc747-zwrz7       1/1     Running   0          65m
downloads-8546cb9cff-4hhlj   1/1     Running   0          65m
downloads-8546cb9cff-m9qlc   1/1     Running   0          70m

Comment 10 Ricardo Carrillo Cruz 2020-07-14 19:42:28 UTC

I downloaded the 4.5.1 and upgraded from latest 4.4 just fine:

[ricky@localhost openshift-installer]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.12    True        True          2m45s   Working towards 4.5.1: 27% complete
[ricky@localhost openshift-installer]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.1     True        False         49m     Cluster version is 4.5.1
[ricky@localhost openshift-installer]$ oc get co console
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
console   4.5.1     True        False         False      57m

As such closing, seems it's fixed or was an environmental issue.

Comment 13 Ricardo Carrillo Cruz 2020-08-19 08:15:51 UTC

Punting to 4.7 as I'm unable to reproduce and is very intermittent.

Comment 14 Ben Bennett 2020-08-20 13:28:15 UTC

There is no supported upgrade for ovn-kube from 4.4 to 4.5.  The only customer with a supported 4.4 ovn-kube is not upgrading clusters, they are reinstalling.