Created attachment 1655101 [details] jounalctl -u bootkube logs Description of problem: This is clone of https://bugzilla.redhat.com/show_bug.cgi?id=1751274 which contains mixed causes of failures so pointing this new bug in the direction of ongoing problem. Tried installing cluster on 4.4 but apparently workers are not coming up, kube-apiserver and openshift-apiserver are degrading. CVO complains I0124 16:00:44.878372 1 leaderelection.go:246] failed to acquire lease openshift-cluster-version/version E0124 16:01:42.617225 1 leaderelection.go:330] error retrieving resource lock openshift-cluster-version/version Check additional info for CLI level investigation. I can share cluster info as well for debugging Version-Release number of selected component (if applicable):4.4.0-0.nightly-2020-01-23-130817 How reproducible:Always Steps to Reproduce: 1.Install OVNKubernetes cluster on UPI Baremetal 2. 3. Actual results: Cluster fails to come up. Workers down Expected results:Cluster should be installed successfully Additional info: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-57-54 Ready master 12h v1.17.1 ip-10-0-59-6 Ready master 12h v1.17.1 ip-10-0-67-130 Ready master 12h v1.17.1 $ oc get csr NAME AGE REQUESTOR CONDITION csr-b8hx6 12h system:node:ip-10-0-57-54 Approved,Issued csr-hwtwx 12h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-m5kp9 12h system:node:ip-10-0-59-6 Approved,Issued csr-pj98n 12h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-pw4gm 12h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-rzfbk 12h system:node:ip-10-0-67-130 Approved,Issued csr-vljz6 12h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 12h cloud-credential 4.4.0-0.nightly-2020-01-23-130817 True False False 12h dns 4.4.0-0.nightly-2020-01-23-130817 True False False 12h insights 4.4.0-0.nightly-2020-01-23-130817 True False False 12h kube-apiserver False True True 12h kube-controller-manager 4.4.0-0.nightly-2020-01-23-130817 True False False 12h kube-scheduler 4.4.0-0.nightly-2020-01-23-130817 True False False 12h kube-storage-version-migrator 4.4.0-0.nightly-2020-01-23-130817 False False False 12h machine-api 4.4.0-0.nightly-2020-01-23-130817 True False False 12h machine-config 4.4.0-0.nightly-2020-01-23-130817 True False False 12h network 4.4.0-0.nightly-2020-01-23-130817 True True True 12h node-tuning 4.4.0-0.nightly-2020-01-23-130817 True False False 12h openshift-apiserver 4.4.0-0.nightly-2020-01-23-130817 False False True 12h openshift-controller-manager 4.4.0-0.nightly-2020-01-23-130817 True False False 12h operator-lifecycle-manager 4.4.0-0.nightly-2020-01-23-130817 True False False 12h operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-01-23-130817 True False False 12h operator-lifecycle-manager-packageserver False True False 12h service-ca 4.4.0-0.nightly-2020-01-23-130817 True False False 12h service-catalog-apiserver 4.4.0-0.nightly-2020-01-23-130817 True False False 12h service-catalog-controller-manager 4.4.0-0.nightly-2020-01-23-130817 True False False 12h $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 12h Unable to apply 4.4.0-0.nightly-2020-01-23-130817: an unknown error has occurred $ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-289h7 4/4 Running 0 12h ovnkube-master-bhqs5 4/4 Running 0 12h ovnkube-master-jhcs9 0/4 Pending 0 12h ovnkube-node-9fxkj 2/2 Running 0 12h ovnkube-node-jvtlb 2/2 Running 0 12h ovnkube-node-mm6f5 2/2 Running 0 12h ovs-node-mmvdj 1/1 Running 0 13h ovs-node-mtxl8 1/1 Running 0 13h ovs-node-xp82s 1/1 Running 0 13h
*** Bug 1751274 has been marked as a duplicate of this bug. ***
Assigning to Ricky.
Hi there Can you please reproduce and give me link to the environment? Thanks
The kube apiserver static pods died: Name: kube-apiserver Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-02-03T15:29:11Z Generation: 1 Resource Version: 278447 Self Link: /apis/config.openshift.io/v1/clusteroperators/kube-apiserver UID: 160bc85d-87e6-4a2b-84a9-6133866285f7 Spec: Status: Conditions: Last Transition Time: 2020-02-03T15:31:25Z Message: NodeInstallerDegraded: 1 nodes are failing on revision 4: NodeInstallerDegraded: StaticPodsDegraded: pods "kube-apiserver-ip-10-0-55-223" not found StaticPodsDegraded: pods "kube-apiserver-ip-10-0-76-10" not found StaticPodsDegraded: pods "kube-apiserver-ip-10-0-53-159" not found Reason: NodeInstaller_InstallerPodFailed::StaticPods_Error Status: True Type: Degraded Last Transition Time: 2020-02-03T15:29:17Z Message: Progressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 5 Status: True Type: Progressing Last Transition Time: 2020-02-03T15:29:11Z Message: Available: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 5 Reason: _ZeroNodesActive Status: False Type: Available Last Transition Time: 2020-02-03T15:29:11Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: kubeapiservers Group: apiextensions.k8s.io Name: Resource: customresourcedefinitions Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-kube-apiserver-operator Resource: namespaces Group: Name: openshift-kube-apiserver Resource: namespaces Versions: Name: raw-internal Version: 4.4.0-0.nightly-2020-02-03-081920 Events: <none> The network was working tho: Name: network Namespace: Labels: <none> Annotations: network.operator.openshift.io/last-seen-state: {"DaemonsetStates":[],"DeploymentStates":[]} API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-02-03T15:26:49Z Generation: 1 Resource Version: 86143 Self Link: /apis/config.openshift.io/v1/clusteroperators/network UID: 22cb8503-d6d6-4d15-ba32-8dd964f2c6f3 Spec: Status: Conditions: Last Transition Time: 2020-02-03T20:31:56Z Status: False Type: Degraded Last Transition Time: 2020-02-03T15:26:49Z Status: True Type: Upgradeable Last Transition Time: 2020-02-03T15:35:40Z Status: False Type: Progressing Last Transition Time: 2020-02-03T15:28:17Z Status: True Type: Available Extension: <nil> Related Objects: Group: Name: applied-cluster Namespace: openshift-network-operator Resource: configmaps Group: apiextensions.k8s.io Name: network-attachment-definitions.k8s.cni.cncf.io Resource: customresourcedefinitions Group: Name: openshift-multus Resource: namespaces Group: rbac.authorization.k8s.io Name: multus Resource: clusterroles Group: Name: multus Namespace: openshift-multus Resource: serviceaccounts Group: rbac.authorization.k8s.io Name: multus Resource: clusterrolebindings Group: apps Name: multus Namespace: openshift-multus Resource: daemonsets Group: Name: multus-admission-controller Namespace: openshift-multus Resource: services Group: rbac.authorization.k8s.io Name: multus-admission-controller-webhook Resource: clusterroles Group: rbac.authorization.k8s.io Name: multus-admission-controller-webhook Resource: clusterrolebindings Group: admissionregistration.k8s.io Name: multus.openshift.io Resource: validatingwebhookconfigurations Group: Name: openshift-service-ca Namespace: openshift-network-operator Resource: configmaps Group: apps Name: multus-admission-controller Namespace: openshift-multus Resource: daemonsets Group: Name: multus-admission-controller-monitor-service Namespace: openshift-multus Resource: services Group: rbac.authorization.k8s.io Name: prometheus-k8s Namespace: openshift-multus Resource: roles Group: rbac.authorization.k8s.io Name: prometheus-k8s Namespace: openshift-multus Resource: rolebindings Group: Name: openshift-ovn-kubernetes Resource: namespaces Group: Name: ovn-kubernetes-node Namespace: openshift-ovn-kubernetes Resource: serviceaccounts Group: rbac.authorization.k8s.io Name: openshift-ovn-kubernetes-node Resource: clusterroles Group: rbac.authorization.k8s.io Name: openshift-ovn-kubernetes-node Resource: clusterrolebindings Group: Name: ovn-kubernetes-controller Namespace: openshift-ovn-kubernetes Resource: serviceaccounts Group: rbac.authorization.k8s.io Name: openshift-ovn-kubernetes-controller Resource: clusterroles Group: rbac.authorization.k8s.io Name: openshift-ovn-kubernetes-controller Resource: clusterrolebindings Group: rbac.authorization.k8s.io Name: openshift-ovn-kubernetes-sbdb Namespace: openshift-ovn-kubernetes Resource: roles Group: rbac.authorization.k8s.io Name: openshift-ovn-kubernetes-sbdb Namespace: openshift-ovn-kubernetes Resource: rolebindings Group: Name: ovnkube-config Namespace: openshift-ovn-kubernetes Resource: configmaps Group: Name: ovnkube-db Namespace: openshift-ovn-kubernetes Resource: services Group: apps Name: ovs-node Namespace: openshift-ovn-kubernetes Resource: daemonsets Group: network.operator.openshift.io Name: ovn Namespace: openshift-ovn-kubernetes Resource: operatorpkis Group: Name: ovn-kubernetes-master Namespace: openshift-ovn-kubernetes Resource: services Group: Name: ovn-kubernetes-node Namespace: openshift-ovn-kubernetes Resource: services Group: rbac.authorization.k8s.io Name: prometheus-k8s Namespace: openshift-ovn-kubernetes Resource: roles Group: rbac.authorization.k8s.io Name: prometheus-k8s Namespace: openshift-ovn-kubernetes Resource: rolebindings Group: policy Name: ovn-raft-quorum-guard Namespace: openshift-ovn-kubernetes Resource: poddisruptionbudgets Group: apps Name: ovnkube-master Namespace: openshift-ovn-kubernetes Resource: daemonsets Group: apps Name: ovnkube-node Namespace: openshift-ovn-kubernetes Resource: daemonsets Group: Name: openshift-network-operator Resource: namespaces Versions: Name: operator Version: 4.4.0-0.nightly-2020-02-03-081920 Events: <none> It's needed to look at node logs to ascertain why the apiserver died. Is it possible to ssh to them? Thanks
I couldn't find kube apiserver logs , they must have been GC'd. I'd need some QE guy that is closer to my region to spin it up so I can jump on the env quickly to tail on logs before they eventually get lost.
Hi Ricardo, I still see this issue in latest v4.4 image. QE can re test it after https://bugzilla.redhat.com/show_bug.cgi?id=1796844 get fixed.
Moving to 4.5 since this is an ovn-kubernetes issue.
this issue did not be reproduced with the latest 4.4 and 4.5 nightly build. This issue should be fixed by some PR merged. Move this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409