1794775 – [Bare Metal] OVN install fails on UPI

Bug 1794775 - [Bare Metal] OVN install fails on UPI

Summary: [Bare Metal] OVN install fails on UPI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ricardo Carrillo Cruz
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1751274 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-24 16:10 UTC by Anurag saxena
Modified:	2020-07-13 17:13 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1751274
Environment:
Last Closed:	2020-07-13 17:13:28 UTC
Target Upstream Version:
Embargoed:
Flags:	weliang: needinfo- anusaxen: needinfo-

Attachments	(Terms of Use)
jounalctl -u bootkube logs (2.90 MB, text/plain) 2020-01-24 16:10 UTC, Anurag saxena	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:13:57 UTC

Description Anurag saxena 2020-01-24 16:10:02 UTC

Created attachment 1655101 [details]
jounalctl -u bootkube logs

Description of problem: This is clone of https://bugzilla.redhat.com/show_bug.cgi?id=1751274 which contains mixed causes of failures so pointing this new bug in the direction of ongoing problem.

Tried installing cluster on 4.4 but apparently workers are not coming up, kube-apiserver and openshift-apiserver are degrading.

CVO complains

I0124 16:00:44.878372       1 leaderelection.go:246] failed to acquire lease openshift-cluster-version/version
E0124 16:01:42.617225       1 leaderelection.go:330] error retrieving resource lock openshift-cluster-version/version

Check additional info for CLI level investigation. I can share cluster info as well for debugging

Version-Release number of selected component (if applicable):4.4.0-0.nightly-2020-01-23-130817


How reproducible:Always


Steps to Reproduce:
1.Install OVNKubernetes cluster on UPI Baremetal
2.
3.

Actual results: Cluster fails to come up. Workers down 

Expected results:Cluster should be installed successfully

Additional info:

$ oc get nodes
NAME             STATUS   ROLES    AGE   VERSION
ip-10-0-57-54    Ready    master   12h   v1.17.1
ip-10-0-59-6     Ready    master   12h   v1.17.1
ip-10-0-67-130   Ready    master   12h   v1.17.1

$ oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-b8hx6   12h   system:node:ip-10-0-57-54                                                   Approved,Issued
csr-hwtwx   12h   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-m5kp9   12h   system:node:ip-10-0-59-6                                                    Approved,Issued
csr-pj98n   12h   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-pw4gm   12h   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-rzfbk   12h   system:node:ip-10-0-67-130                                                  Approved,Issued
csr-vljz6   12h   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                 Unknown     Unknown       True       12h
cloud-credential                           4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
dns                                        4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
insights                                   4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
kube-apiserver                                                                 False       True          True       12h
kube-controller-manager                    4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
kube-scheduler                             4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
kube-storage-version-migrator              4.4.0-0.nightly-2020-01-23-130817   False       False         False      12h
machine-api                                4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
machine-config                             4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
network                                    4.4.0-0.nightly-2020-01-23-130817   True        True          True       12h
node-tuning                                4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
openshift-apiserver                        4.4.0-0.nightly-2020-01-23-130817   False       False         True       12h
openshift-controller-manager               4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
operator-lifecycle-manager                 4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
operator-lifecycle-manager-packageserver                                       False       True          False      12h
service-ca                                 4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
service-catalog-apiserver                  4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h
service-catalog-controller-manager         4.4.0-0.nightly-2020-01-23-130817   True        False         False      12h

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          12h     Unable to apply 4.4.0-0.nightly-2020-01-23-130817: an unknown error has occurred

$ oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-289h7   4/4     Running   0          12h
ovnkube-master-bhqs5   4/4     Running   0          12h
ovnkube-master-jhcs9   0/4     Pending   0          12h
ovnkube-node-9fxkj     2/2     Running   0          12h
ovnkube-node-jvtlb     2/2     Running   0          12h
ovnkube-node-mm6f5     2/2     Running   0          12h
ovs-node-mmvdj         1/1     Running   0          13h
ovs-node-mtxl8         1/1     Running   0          13h
ovs-node-xp82s         1/1     Running   0          13h

Comment 1 Anurag saxena 2020-01-24 16:11:41 UTC

*** Bug 1751274 has been marked as a duplicate of this bug. ***

Comment 2 Aniket Bhat 2020-01-31 20:43:37 UTC

Assigning to Ricky.

Comment 3 Ricardo Carrillo Cruz 2020-02-03 12:33:57 UTC

Hi there

Can you please reproduce and give me link to the environment?

Thanks

Comment 5 Ricardo Carrillo Cruz 2020-02-04 09:35:40 UTC

The kube apiserver static pods died:

Name:         kube-apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-02-03T15:29:11Z
  Generation:          1
  Resource Version:    278447
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/kube-apiserver
  UID:                 160bc85d-87e6-4a2b-84a9-6133866285f7
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-02-03T15:31:25Z
    Message:               NodeInstallerDegraded: 1 nodes are failing on revision 4:
NodeInstallerDegraded: 
StaticPodsDegraded: pods "kube-apiserver-ip-10-0-55-223" not found
StaticPodsDegraded: pods "kube-apiserver-ip-10-0-76-10" not found
StaticPodsDegraded: pods "kube-apiserver-ip-10-0-53-159" not found
    Reason:                NodeInstaller_InstallerPodFailed::StaticPods_Error
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-02-03T15:29:17Z
    Message:               Progressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 5
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-02-03T15:29:11Z
    Message:               Available: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 5
    Reason:                _ZeroNodesActive
    Status:                False
    Type:                  Available
    Last Transition Time:  2020-02-03T15:29:11Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  kubeapiservers
    Group:     apiextensions.k8s.io
    Name:      
    Resource:  customresourcedefinitions
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver
    Resource:  namespaces
  Versions:
    Name:     raw-internal
    Version:  4.4.0-0.nightly-2020-02-03-081920
Events:       <none>

The network was working tho:

Name:         network
Namespace:    
Labels:       <none>
Annotations:  network.operator.openshift.io/last-seen-state: {"DaemonsetStates":[],"DeploymentStates":[]}
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-02-03T15:26:49Z
  Generation:          1
  Resource Version:    86143
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/network
  UID:                 22cb8503-d6d6-4d15-ba32-8dd964f2c6f3
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-02-03T20:31:56Z
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-02-03T15:26:49Z
    Status:                True
    Type:                  Upgradeable
    Last Transition Time:  2020-02-03T15:35:40Z
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-02-03T15:28:17Z
    Status:                True
    Type:                  Available
  Extension:               <nil>
  Related Objects:
    Group:      
    Name:       applied-cluster
    Namespace:  openshift-network-operator
    Resource:   configmaps
    Group:      apiextensions.k8s.io
    Name:       network-attachment-definitions.k8s.cni.cncf.io
    Resource:   customresourcedefinitions
    Group:      
    Name:       openshift-multus
    Resource:   namespaces
    Group:      rbac.authorization.k8s.io
    Name:       multus
    Resource:   clusterroles
    Group:      
    Name:       multus
    Namespace:  openshift-multus
    Resource:   serviceaccounts
    Group:      rbac.authorization.k8s.io
    Name:       multus
    Resource:   clusterrolebindings
    Group:      apps
    Name:       multus
    Namespace:  openshift-multus
    Resource:   daemonsets
    Group:      
    Name:       multus-admission-controller
    Namespace:  openshift-multus
    Resource:   services
    Group:      rbac.authorization.k8s.io
    Name:       multus-admission-controller-webhook
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       multus-admission-controller-webhook
    Resource:   clusterrolebindings
    Group:      admissionregistration.k8s.io
    Name:       multus.openshift.io
    Resource:   validatingwebhookconfigurations
    Group:      
    Name:       openshift-service-ca
    Namespace:  openshift-network-operator
    Resource:   configmaps
    Group:      apps
    Name:       multus-admission-controller
    Namespace:  openshift-multus
    Resource:   daemonsets
    Group:      
    Name:       multus-admission-controller-monitor-service
    Namespace:  openshift-multus
    Resource:   services
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-multus
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-multus
    Resource:   rolebindings
    Group:      
    Name:       openshift-ovn-kubernetes
    Resource:   namespaces
    Group:      
    Name:       ovn-kubernetes-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   serviceaccounts
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-node
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-node
    Resource:   clusterrolebindings
    Group:      
    Name:       ovn-kubernetes-controller
    Namespace:  openshift-ovn-kubernetes
    Resource:   serviceaccounts
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-controller
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-controller
    Resource:   clusterrolebindings
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-sbdb
    Namespace:  openshift-ovn-kubernetes
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       openshift-ovn-kubernetes-sbdb
    Namespace:  openshift-ovn-kubernetes
    Resource:   rolebindings
    Group:      
    Name:       ovnkube-config
    Namespace:  openshift-ovn-kubernetes
    Resource:   configmaps
    Group:      
    Name:       ovnkube-db
    Namespace:  openshift-ovn-kubernetes
    Resource:   services
    Group:      apps
    Name:       ovs-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   daemonsets
    Group:      network.operator.openshift.io
    Name:       ovn
    Namespace:  openshift-ovn-kubernetes
    Resource:   operatorpkis
    Group:      
    Name:       ovn-kubernetes-master
    Namespace:  openshift-ovn-kubernetes
    Resource:   services
    Group:      
    Name:       ovn-kubernetes-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   services
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-ovn-kubernetes
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       prometheus-k8s
    Namespace:  openshift-ovn-kubernetes
    Resource:   rolebindings
    Group:      policy
    Name:       ovn-raft-quorum-guard
    Namespace:  openshift-ovn-kubernetes
    Resource:   poddisruptionbudgets
    Group:      apps
    Name:       ovnkube-master
    Namespace:  openshift-ovn-kubernetes
    Resource:   daemonsets
    Group:      apps
    Name:       ovnkube-node
    Namespace:  openshift-ovn-kubernetes
    Resource:   daemonsets
    Group:      
    Name:       openshift-network-operator
    Resource:   namespaces
  Versions:
    Name:     operator
    Version:  4.4.0-0.nightly-2020-02-03-081920
Events:       <none>

It's needed to look at node logs to ascertain why the apiserver died.
Is it possible to ssh to them?

Thanks

Comment 7 Ricardo Carrillo Cruz 2020-02-06 11:31:40 UTC

I couldn't find kube apiserver logs , they must have been GC'd.
I'd need some QE guy that is closer to my region to spin it up so I can jump on the env quickly to tail
on logs before they eventually get lost.

Comment 10 Weibin Liang 2020-02-27 19:04:33 UTC

Hi Ricardo, 

I still see this issue in latest v4.4 image. 
QE can re test it after https://bugzilla.redhat.com/show_bug.cgi?id=1796844 get fixed.

Comment 11 Ben Bennett 2020-03-05 14:29:39 UTC

Moving to 4.5 since this is an ovn-kubernetes issue.

Comment 13 zhaozhanqi 2020-05-13 02:21:08 UTC

this issue did not be reproduced with the latest 4.4 and 4.5 nightly build. This issue should be fixed by some PR merged. 
Move this bug to verified.

Comment 16 errata-xmlrpc 2020-07-13 17:13:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.