Description of problem: UPI install one OCP 4.2 cluster on gcp and upgrade to 4.3 nightly successfully, but kube-apiserver operator is Degraded after upgraded 4.3 to 4.4 nightly. $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.4.0-0.nightly-2021-02-06-081501 True True False 3h7m ... Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2021-02-06-081501 How reproducible: Always Reproduced steps: UPI install one 4.2 cluster on GCP and upgrade to 4.3-> 4.4 Actual results: $ oc get no -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME˽ leap090550-02082152-m-0.c.openshift-qe.internal Ready master 3h12m v1.17.1+5ef953f 10.0.0.4 Red Hat Enterprise Linux CoreOS 44.82.202102032134-0 (Ootpa) 4.18.0-193.41.1.el8_2.x86_64 cri-o://1.17.5-11.rhaos4.4.git7f979af.el8 leap090550-02082152-m-1.c.openshift-qe.internal Ready master 3h12m v1.17.1+5ef953f 10.0.0.5 Red Hat Enterprise Linux CoreOS 44.82.202102032134-0 (Ootpa) 4.18.0-193.41.1.el8_2.x86_64 cri-o://1.17.5-11.rhaos4.4.git7f979af.el8 leap090550-02082152-m-2.c.openshift-qe.internal Ready master 3h12m v1.17.1+5ef953f 10.0.0.6 Red Hat Enterprise Linux CoreOS 44.82.202102032134-0 (Ootpa) 4.18.0-193.41.1.el8_2.x86_64 cri-o://1.17.5-11.rhaos4.4.git7f979af.el8 leap090550-02082152-w-a-0.c.openshift-qe.internal Ready worker 177m v1.17.1+5ef953f 10.0.32.2 Red Hat Enterprise Linux CoreOS 44.82.202102032134-0 (Ootpa) 4.18.0-193.41.1.el8_2.x86_64 cri-o://1.17.5-11.rhaos4.4.git7f979af.el8 leap090550-02082152-w-a-l-0 Ready worker 123m v1.16.2+853223d 10.0.32.6 Red Hat Enterprise Linux Server 7.7 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.16.6-18.rhaos4.3.git538d861.el7 leap090550-02082152-w-a-l-1 Ready worker 123m v1.16.2+853223d 10.0.32.5 Red Hat Enterprise Linux Server 7.7 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.16.6-18.rhaos4.3.git538d861.el7 leap090550-02082152-w-b-1.c.openshift-qe.internal Ready worker 176m v1.17.1+5ef953f 10.0.32.3 Red Hat Enterprise Linux CoreOS 44.82.202102032134-0 (Ootpa) 4.18.0-193.41.1.el8_2.x86_64 cri-o://1.17.5-11.rhaos4.4.git7f979af.el8 leap090550-02082152-w-c-2.c.openshift-qe.internal Ready worker 176m v1.17.1+5ef953f 10.0.32.4 Red Hat Enterprise Linux CoreOS 44.82.202102032134-0 (Ootpa) 4.18.0-193.41.1.el8_2.x86_64 cri-o://1.17.5-11.rhaos4.4.git7f979af.el8 $ oc get co authentication 4.4.0-0.nightly-2021-02-06-081501 True False False 169m cloud-credential 4.4.0-0.nightly-2021-02-06-081501 True False False 3h10m cluster-autoscaler 4.4.0-0.nightly-2021-02-06-081501 True False False 3h1m console 4.4.0-0.nightly-2021-02-06-081501 True False False 15m csi-snapshot-controller 4.4.0-0.nightly-2021-02-06-081501 True False False 42m dns 4.4.0-0.nightly-2021-02-06-081501 True False False 3h9m etcd 4.4.0-0.nightly-2021-02-06-081501 True False False 51m image-registry 4.4.0-0.nightly-2021-02-06-081501 True False False 14m ingress 4.4.0-0.nightly-2021-02-06-081501 True False False 7m48s insights 4.4.0-0.nightly-2021-02-06-081501 True False False 3h10m kube-apiserver 4.4.0-0.nightly-2021-02-06-081501 True True False 3h7m kube-controller-manager 4.4.0-0.nightly-2021-02-06-081501 True False False 48m kube-scheduler 4.4.0-0.nightly-2021-02-06-081501 True False False 48m kube-storage-version-migrator 4.4.0-0.nightly-2021-02-06-081501 True False False 9m33s machine-api 4.4.0-0.nightly-2021-02-06-081501 True False False 3h10m machine-config 4.4.0-0.nightly-2021-02-06-081501 True False False 59m marketplace 4.4.0-0.nightly-2021-02-06-081501 True False False 11m monitoring 4.4.0-0.nightly-2021-02-06-081501 True False False 14m network 4.4.0-0.nightly-2021-02-06-081501 True False False 3h9m node-tuning 4.4.0-0.nightly-2021-02-06-081501 True False False 37m openshift-apiserver 4.4.0-0.nightly-2021-02-06-081501 True False False 10m openshift-controller-manager 4.4.0-0.nightly-2021-02-06-081501 True False False 3h7m openshift-samples 4.4.0-0.nightly-2021-02-06-081501 True False False 42m operator-lifecycle-manager 4.4.0-0.nightly-2021-02-06-081501 True False False 3h3m operator-lifecycle-manager-catalog 4.4.0-0.nightly-2021-02-06-081501 True False False 3h3m operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2021-02-06-081501 True False False 11m service-ca 4.4.0-0.nightly-2021-02-06-081501 True False False 3h10m service-catalog-apiserver 4.4.0-0.nightly-2021-02-06-081501 True False False 7m47s service-catalog-controller-manager 4.4.0-0.nightly-2021-02-06-081501 True False False 51m storage 4.4.0-0.nightly-2021-02-06-081501 True False False 42m $ oc describe co/kube-apiserver Name: kube-apiserver Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-02-08T22:09:47Z Generation: 1 Resource Version: 118542 Self Link: /apis/config.openshift.io/v1/clusteroperators/kube-apiserver UID: 5b578b0b-6a5a-11eb-b551-42010a000003 Spec: Status: Conditions: Last Transition Time: 2021-02-09T01:11:52Z Message: NodeInstallerDegraded: 1 nodes are failing on revision 12: NodeInstallerDegraded: StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container="kube-apiserver" is not ready StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container="kube-apiserver-cert-regeneration-controller" is not ready StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container="kube-apiserver-cert-syncer" is not ready StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container="kube-apiserver-insecure-readyz" is not ready NodeControllerDegraded: All master nodes are ready Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2021-02-09T01:05:50Z Message: NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 12; 0 nodes have achieved new revision 13 Reason: NodeInstaller Status: True Type: Progressing Last Transition Time: 2021-02-08T22:12:14Z Message: StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 11; 2 nodes are at revision 12; 0 nodes have achieved new revision 13 Reason: AsExpected Status: True Type: Available Last Transition Time: 2021-02-08T22:09:47Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: kubeapiservers Group: apiextensions.k8s.io Name: Resource: customresourcedefinitions Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-kube-apiserver-operator Resource: namespaces Group: Name: openshift-kube-apiserver Resource: namespaces Versions: Name: raw-internal Version: 4.4.0-0.nightly-2021-02-06-081501 Name: operator Version: 4.4.0-0.nightly-2021-02-06-081501 Name: kube-apiserver Version: 1.17.1 Events: <none> Actually, checked the kube-apiserver operator logs from the must-gather, ... 2021-02-09T01:10:30.621125506Z I0209 01:10:30.621046 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"14aaa3a9-6a5a-11eb-b551-42010a000003", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ObserveStorageFailed' Required openshift-etcd/host-etcd-2 endpoint not found 2021-02-09T01:10:30.676780533Z E0209 01:10:30.676721 1 config_observer_controller.go:180] key failed with : endpoints/host-etcd-2.openshift-etcd: not found 2021-02-09T01:10:30.723510766Z I0209 01:10:30.723417 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"14aaa3a9-6a5a-11eb-b551-42010a000003", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 11:\nNodeInstallerDegraded: \nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready" to "NodeInstallerDegraded: 1 nodes are failing on revision 11:\nNodeInstallerDegraded: \nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready\nConfigObservationDegraded: endpoints/host-etcd-2.openshift-etcd: not found" 2021-02-09T01:10:30.767716706Z I0209 01:10:30.767618 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"14aaa3a9-6a5a-11eb-b551-42010a000003", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 11:\nNodeInstallerDegraded: \nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready\nConfigObservationDegraded: endpoints/host-etcd-2.openshift-etcd: not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 11:\nNodeInstallerDegraded: \nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-1.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-1.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready" ... 2021-02-09T01:17:31.787792886Z I0209 01:17:31.787685 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"14aaa3a9-6a5a-11eb-b551-42010a000003", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready" to "StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready\nConfigObservationDegraded: endpoints/host-etcd-2.openshift-etcd: not found" 2021-02-09T01:17:31.959257415Z I0209 01:17:31.959195 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"14aaa3a9-6a5a-11eb-b551-42010a000003", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready\nConfigObservationDegraded: endpoints/host-etcd-2.openshift-etcd: not found" to "StaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-regeneration-controller\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-cert-syncer\" is not ready\nStaticPodsDegraded: nodes/leap090550-02082152-m-0.c.openshift-qe.internal pods/kube-apiserver-leap090550-02082152-m-0.c.openshift-qe.internal container=\"kube-apiserver-insecure-readyz\" is not ready\nNodeControllerDegraded: All master nodes are ready" And the openshift-etcd-operator logs shows errors as below, $ grep 'endpoints \"host-etcd-2\" not found' current.log 2021-02-09T01:10:30.666366059Z E0209 01:10:30.666318 1 envvarcontroller.go:197] key failed with : endpoints "host-etcd-2" not found 2021-02-09T01:10:30.82976445Z E0209 01:10:30.814643 1 host_endpoints_controller.go:283] key failed with : endpoints "host-etcd-2" not found 2021-02-09T01:17:31.484422134Z E0209 01:17:31.484360 1 envvarcontroller.go:197] key failed with : endpoints "host-etcd-2" not found 2021-02-09T01:17:31.525872016Z E0209 01:17:31.525776 1 clustermembercontroller.go:119] key failed with : endpoints "host-etcd-2" not found 2021-02-09T01:17:31.57601325Z E0209 01:17:31.574202 1 etcdmemberipmigrator.go:363] key failed with : endpoints "host-etcd-2" not found 2021-02-09T01:17:31.686712619Z E0209 01:17:31.678058 1 host_endpoints_controller.go:283] key failed with : endpoints "host-etcd-2" not found Expected results: kube-apiserver of cluster-operator should be the correct status. Additional info:
We've been running into this problem all these days with our upgrade tests from 4.3 to 4.4, the kube-apiservers Degraded was finished at last.
Looks like the operator started at 01:08:11 > 2021-02-09T01:08:11.431099654Z I0209 01:08:11.430946 1 cmd.go:196] Using service-serving-cert provided certificates created endpoint resource at 01:10:30 > namespaces/openshift-etcd-operator/pods/etcd-operator-7bccfd8865-6kjh5/operator/operator/logs/current.log:1540:2021-02-09T01:10:30.644284521Z I0209 01:10:30.643124 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"e9d55a70-c20a-40ac-9b2c-a973fdee0bab", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EndpointsCreated' Created endpoints/host-etcd-2 -n openshift-etcd because it was missing created again at 01:17:31 > namespaces/openshift-etcd-operator/pods/etcd-operator-7bccfd8865-6kjh5/operator/operator/logs/current.log:3654:2021-02-09T01:17:31.509572929Z I0209 01:17:31.500276 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"e9d55a70-c20a-40ac-9b2c-a973fdee0bab", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EndpointsCreated' Created endpoints/host-etcd-2 -n openshift-etcd because it was missing There is an issue with endpoint resources being removed automatically if they don't have a service[2],[3]. We resolved this in later versions (4.5+) by moving from endpoint resource to configmap. ``` - apiVersion: v1 kind: Endpoints metadata: creationTimestamp: "2021-02-09T01:22:58Z" name: host-etcd-2 namespace: openshift-etcd [..] ``` As 4.4 is EOL on 4.7 GA I am not sure this will be resolved in 4.4. [1] https://github.com/openshift/cluster-etcd-operator/blob/release-4.4/pkg/operator/hostendpointscontroller2/host_endpoints_controller.go#L165 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1836927 [3] https://github.com/openshift/cluster-etcd-operator/pull/354
4.4 is now EOL