Bug 1801898

Summary: [etcd-operator] etcd operator failing due to node name inconsistencies across platforms
Product: OpenShift Container Platform Reporter: Yu Qi Zhang <jerzhang>
Component: Etcd OperatorAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.4CC: ashworth, cglombek, eslutsky, isaic, jcallen, jiajliu, juriarte, nicolas.marcq, rgolan, slowrie, wjiang, wsun, yanyang, yprokule
Target Milestone: ---Keywords: TestBlocker
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:36:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yu Qi Zhang 2020-02-11 21:02:40 UTC
Description of problem:

On azure/metal/ovirt, installs are failing with:

level=fatal msg="failed to initialize the cluster: Cluster operator etcd is reporting a failure: InstallerControllerDegraded: missing required resources: [configmaps: config-1,etcd-metrics-proxy-client-ca-1,etcd-metrics-proxy-serving-ca-1,etcd-peer-client-ca-1,etcd-pod-1,etcd-serving-ca-1, secrets: etcd-all-peer-1,etcd-all-serving-1,etcd-all-serving-metrics-1]\nStaticPodsDegraded: pods \"etcd-ci-op-7lhbj7qi-761c8-jm6jf-master-2\" not found\nStaticPodsDegraded: pods \"etcd-ci-op-7lhbj7qi-761c8-jm6jf-master-1\" not found\nStaticPodsDegraded: pods \"etcd-ci-op-7lhbj7qi-761c8-jm6jf-master-0\" not found\nRevisionControllerDegraded: configmaps \"etcd-pod\" not found\nTargetConfigControllerDegraded: \"configmap/kube-apiserver-pod\": node/ci-op-7lhbj7qi-761c8-jm6jf-master-2 missing InternalDNS"

Sam did some digging and come up with: https://github.com/openshift/cluster-etcd-operator/issues/115

This is blocking many of our jobs, example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/793

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/7091/rehearse-7091-pull-ci-openshift-installer-master-e2e-ovirt/2

Version-Release number of selected component (if applicable):
4.4

How reproducible:
Always

Comment 6 Abhinav Dahiya 2020-02-13 17:19:26 UTC
*** Bug 1802678 has been marked as a duplicate of this bug. ***

Comment 7 Abhinav Dahiya 2020-02-13 17:19:28 UTC
*** Bug 1802649 has been marked as a duplicate of this bug. ***

Comment 11 Ray Ashworth 2020-02-17 17:02:58 UTC
Test latest nightly build, looks like it was posted Saturday 2/15, no change, do we need a new RH CORE OS image?

Comment 12 isaic 2020-02-17 19:56:06 UTC
Can you confirm what the latest 4.4 CoreOS OVA file should be?  

Originally were told to go here for a 1/24/20 date:  https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.4/44.81.202001241431.0/x86_64/rhcos-44.81.202001241431.0-vmware.x86_64.ova  and is the one that fails when we try to install OCP 4.4 on VMware. 

We then checked to see if we could find a new version of CoreOS OVA file.  Noticed that there was a "newer" version than the one we are using (but likely NOT related to this bugzilla) since it has a 2/07/20 date here.  https://github.com/openshift/installer/blob/master/data/data/rhcos.json#L123-L127

https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.4/44.81.202002071430-0/x86_64/rhcos-44.81.202002071430-0-vmware.x86_64.ova

Let us know.  Tks!

Comment 13 ge liu 2020-02-20 07:45:46 UTC
Verified in upi osp with 4.4.0-0.nightly-2020-02-19-173908, tried on other platform: azure/vsphere/... but blocked by another bug: https://bugzilla.redhat.com/show_bug.cgi?id=1798945.

Comment 15 errata-xmlrpc 2020-05-04 11:36:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 16 Nicolas Marcq 2020-05-29 08:06:08 UTC
Hi,

The mirror still contains only the 4.4.3 image:
https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.4/latest/

there is a way to precise images version to pull from the bootstrap installation? I use the lasted openshift-install 4.4.5 but it seems that is the OVA that actually point the the installed Openshift version.

Thanks.

Comment 17 Sam Batschelet 2020-05-29 13:55:49 UTC
>Hi,

>The mirror still contains only the 4.4.3 image:
>https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.4/latest/

>there is a way to precise images version to pull from the bootstrap installation? I use the lasted openshift-install 4.4.5 but it seems that is the OVA that actually point the the installed Openshift version.

>Thanks.


Thank you for the report we are looking into this.

Comment 18 Sam Batschelet 2020-05-29 22:12:49 UTC
Spoke to ART team which handles these assets, they said that these images although referencing 4.4.3 are the latest for rhcos dependencies. So in short what you are seeing is expected.

Comment 19 Nicolas Marcq 2020-06-02 09:41:54 UTC
OK thanks.

It's just that I still have the issue with the installer 4.4.5 and the RHCOS image 4.4.3.

###################################

oc describe co etcd                                                    
Name:         etcd
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-06-02T09:11:16Z
  Generation:          1
  Resource Version:    45747
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/etcd
  UID:                 d5f61fc1-5064-409a-b7ca-c7357e22e759
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-06-02T09:13:39Z
    Message:               StaticPodsDegraded: pods "etcd-localhost" not found
InstallerControllerDegraded: missing required resources: [configmaps: etcd-scripts,restore-etcd-pod, configmaps: config-1,etcd-metrics-proxy-client-ca-1,etcd-metrics-proxy-serving-ca-1,etcd-peer-client-ca-1,etcd-pod-1,etcd-serving-ca-1, secrets: etcd-all-peer-1,etcd-all-serving-1,etcd-all-serving-metrics-1]
EnvVarControllerDegraded: at least three nodes are required to have a valid configuration
RevisionControllerDegraded: configmaps "etcd-pod" not found
ScriptControllerDegraded: "configmap/etcd-pod": missing env var values

###################################