Bug 1801898 - [etcd-operator] etcd operator failing due to node name inconsistencies across platforms
Summary: [etcd-operator] etcd operator failing due to node name inconsistencies across...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.4.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
: 1802649 1802678 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-11 21:02 UTC by Yu Qi Zhang
Modified: 2020-06-02 09:41 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:36:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 143 0 None closed Bug 1801898: remove dependency on node internal DNS name 2021-02-06 14:26:35 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:36:27 UTC

Description Yu Qi Zhang 2020-02-11 21:02:40 UTC
Description of problem:

On azure/metal/ovirt, installs are failing with:

level=fatal msg="failed to initialize the cluster: Cluster operator etcd is reporting a failure: InstallerControllerDegraded: missing required resources: [configmaps: config-1,etcd-metrics-proxy-client-ca-1,etcd-metrics-proxy-serving-ca-1,etcd-peer-client-ca-1,etcd-pod-1,etcd-serving-ca-1, secrets: etcd-all-peer-1,etcd-all-serving-1,etcd-all-serving-metrics-1]\nStaticPodsDegraded: pods \"etcd-ci-op-7lhbj7qi-761c8-jm6jf-master-2\" not found\nStaticPodsDegraded: pods \"etcd-ci-op-7lhbj7qi-761c8-jm6jf-master-1\" not found\nStaticPodsDegraded: pods \"etcd-ci-op-7lhbj7qi-761c8-jm6jf-master-0\" not found\nRevisionControllerDegraded: configmaps \"etcd-pod\" not found\nTargetConfigControllerDegraded: \"configmap/kube-apiserver-pod\": node/ci-op-7lhbj7qi-761c8-jm6jf-master-2 missing InternalDNS"

Sam did some digging and come up with: https://github.com/openshift/cluster-etcd-operator/issues/115

This is blocking many of our jobs, example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/793

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/7091/rehearse-7091-pull-ci-openshift-installer-master-e2e-ovirt/2

Version-Release number of selected component (if applicable):
4.4

How reproducible:
Always

Comment 6 Abhinav Dahiya 2020-02-13 17:19:26 UTC
*** Bug 1802678 has been marked as a duplicate of this bug. ***

Comment 7 Abhinav Dahiya 2020-02-13 17:19:28 UTC
*** Bug 1802649 has been marked as a duplicate of this bug. ***

Comment 11 Ray Ashworth 2020-02-17 17:02:58 UTC
Test latest nightly build, looks like it was posted Saturday 2/15, no change, do we need a new RH CORE OS image?

Comment 12 isaic 2020-02-17 19:56:06 UTC
Can you confirm what the latest 4.4 CoreOS OVA file should be?  

Originally were told to go here for a 1/24/20 date:  https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.4/44.81.202001241431.0/x86_64/rhcos-44.81.202001241431.0-vmware.x86_64.ova  and is the one that fails when we try to install OCP 4.4 on VMware. 

We then checked to see if we could find a new version of CoreOS OVA file.  Noticed that there was a "newer" version than the one we are using (but likely NOT related to this bugzilla) since it has a 2/07/20 date here.  https://github.com/openshift/installer/blob/master/data/data/rhcos.json#L123-L127

https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.4/44.81.202002071430-0/x86_64/rhcos-44.81.202002071430-0-vmware.x86_64.ova

Let us know.  Tks!

Comment 13 ge liu 2020-02-20 07:45:46 UTC
Verified in upi osp with 4.4.0-0.nightly-2020-02-19-173908, tried on other platform: azure/vsphere/... but blocked by another bug: https://bugzilla.redhat.com/show_bug.cgi?id=1798945.

Comment 15 errata-xmlrpc 2020-05-04 11:36:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 16 Nicolas Marcq 2020-05-29 08:06:08 UTC
Hi,

The mirror still contains only the 4.4.3 image:
https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.4/latest/

there is a way to precise images version to pull from the bootstrap installation? I use the lasted openshift-install 4.4.5 but it seems that is the OVA that actually point the the installed Openshift version.

Thanks.

Comment 17 Sam Batschelet 2020-05-29 13:55:49 UTC
>Hi,

>The mirror still contains only the 4.4.3 image:
>https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.4/latest/

>there is a way to precise images version to pull from the bootstrap installation? I use the lasted openshift-install 4.4.5 but it seems that is the OVA that actually point the the installed Openshift version.

>Thanks.


Thank you for the report we are looking into this.

Comment 18 Sam Batschelet 2020-05-29 22:12:49 UTC
Spoke to ART team which handles these assets, they said that these images although referencing 4.4.3 are the latest for rhcos dependencies. So in short what you are seeing is expected.

Comment 19 Nicolas Marcq 2020-06-02 09:41:54 UTC
OK thanks.

It's just that I still have the issue with the installer 4.4.5 and the RHCOS image 4.4.3.

###################################

oc describe co etcd                                                    
Name:         etcd
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-06-02T09:11:16Z
  Generation:          1
  Resource Version:    45747
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/etcd
  UID:                 d5f61fc1-5064-409a-b7ca-c7357e22e759
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-06-02T09:13:39Z
    Message:               StaticPodsDegraded: pods "etcd-localhost" not found
InstallerControllerDegraded: missing required resources: [configmaps: etcd-scripts,restore-etcd-pod, configmaps: config-1,etcd-metrics-proxy-client-ca-1,etcd-metrics-proxy-serving-ca-1,etcd-peer-client-ca-1,etcd-pod-1,etcd-serving-ca-1, secrets: etcd-all-peer-1,etcd-all-serving-1,etcd-all-serving-metrics-1]
EnvVarControllerDegraded: at least three nodes are required to have a valid configuration
RevisionControllerDegraded: configmaps "etcd-pod" not found
ScriptControllerDegraded: "configmap/etcd-pod": missing env var values

###################################


Note You need to log in before you can comment on or make changes to this bug.