Bug 1775922

Summary: InstallerControllerDegraded: missing required resources: [configmaps: aggregator-client-ca,client-ca...
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, deads, hongkliu, jokerman, mfojtik, yinzhou
Target Milestone: ---Keywords: Reopened
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-07 18:07:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-11-23 15:51:38 UTC
In a 4.3->4.3 rollback release informer [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/128/build-log.txt | grep 'changed Degraded to\|Degraded message changed' | sort | head -3
Nov 23 12:04:06.872 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready" to "NodeControllerDegraded: All master node(s) are ready\nInstallerControllerDegraded: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: config-6,etcd-serving-ca-6,kube-apiserver-cert-syncer-kubeconfig-6,kube-apiserver-pod-6,kubelet-serving-ca-6,sa-token-signing-certs-6]"
Nov 23 12:04:16.060 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nInstallerControllerDegraded: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: config-6,etcd-serving-ca-6,kube-apiserver-cert-syncer-kubeconfig-6,kube-apiserver-pod-6,kubelet-serving-ca-6,sa-token-signing-certs-6]" to "NodeControllerDegraded: All master node(s) are ready"
Nov 23 12:04:59.860 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready" to "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-157-200.ec2.internal pods/kube-apiserver-ip-10-0-157-200.ec2.internal container=\"kube-apiserver-7\" is not ready\nStaticPodsDegraded: nodes/ip-10-0-157-200.ec2.internal pods/kube-apiserver-ip-10-0-157-200.ec2.internal container=\"kube-apiserver-cert-syncer-7\" is not ready\nStaticPodsDegraded: nodes/ip-10-0-157-200.ec2.internal pods/kube-apiserver-ip-10-0-157-200.ec2.internal container=\"kube-apiserver-insecure-readyz-7\" is not ready"

Possibly related to the fixed-in-4.2 bug 1749478.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3/128

Comment 2 Maciej Szulik 2019-12-02 15:34:14 UTC
This is not a bug, but exposing the information from the resource syncer for ease of debugging. 
If you carefully examine events for any particular operator you'll notice that within a few seconds
the missing resources are copied back from the source.

Comment 3 W. Trevor King 2019-12-02 17:27:53 UTC
> This is not a bug, but exposing the information from the resource syncer for ease of debugging. 

But this is setting clusteroperator/kube-apiserver Degraded=True for a few seconds (via [1]), which violates the Degraded API [2]:

  Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service.  The period of time may vary by component, but a Degraded state represents persistent observation of a condition.  As a result, a component should not oscillate in and out of Degraded state.

So the oscillation is probably not causing the test failures in these cases, but it is a bug in the operator to go Degraded=True when the quality of service is not impacted (as seems to be the case here).

[1]: https://github.com/openshift/library-go/blob/fc96c897f3b00d5ac27bc3a0017a0bf75aa790fc/pkg/operator/staticpod/controller/installer/installer_controller.go#L815-L837
[2]: https://github.com/openshift/api/blob/2ea89d203c53704f1fcfeb55c13ededab14fd020/config/v1/types_cluster_operator.go#L151-L168

Comment 4 Maciej Szulik 2019-12-05 21:26:52 UTC
This is not going to impact 4.3, moving to 4.4 for now until we agree on degraded setting.

Comment 5 David Eads 2019-12-06 13:01:05 UTC
The snippet above definitely looks degraded to me.  The current state does not match the desired state: an entire kube-apiserver is serving *with the wrong configuration*.  This looks like a very clear cut degraded.  We can talk about momentum times for particular sub-conditions, but degraded is accurate for that condition.

Comment 6 W. Trevor King 2019-12-06 15:45:45 UTC
> The snippet above definitely looks degraded to me.

"There's no QoS impact.  We'll adjust the operator logic to not report Degraded=True for this" is an internally-consistent ruling, and "There is a QoS impact, so Degraded=True is appropriate.  We'll adjust... something... to avoid this situation" is another internally-consistent ruling.  Sounds like we want the latter.  What do we need to change so we don't hit this and go degraded?

Comment 7 David Eads 2019-12-06 16:30:40 UTC
You're dealing with different levels of status, some more responsive than others. There *is* a quality of service impact.  The operator is *not* running as desired.  This *is* expected in many cases (rolling upgrades, configuration changes, etc): there are server outages when this happens, you aren't as HA as you desire, different servers will give different query results.  The pertinent question is when does a cluster-admin need to know, not when are you degraded.

You're degraded immediately, when you let a cluster admin know is a different choice.  We're discussing the latter, which is open to negotiation. We are clearly degraded.

Comment 8 W. Trevor King 2019-12-06 16:52:02 UTC
> ...rolling upgrades, configuration changes...

These are not QoS degradations.  They mean you are Progressing=True, moving from one healthy state to another in a highly available way.  From [1]:

  A component may be Progressing but not Degraded because the transition from one state to another does not persist over a long enough period to report Degraded.  A service should not report Degraded during the course of a normal upgrade.

Different servers giving different results is fine, that's just like cache-freshness, and clients need to be able to handle slightly-stale results.  Or the handoff needs to become more HA to avoid staleness (e.g. spin up a new Deployment and cut the Service over atomically once it's up).

[1]: https://github.com/openshift/api/blob/2ea89d203c53704f1fcfeb55c13ededab14fd020/config/v1/types_cluster_operator.go#L159-L162

Comment 9 David Eads 2019-12-06 16:55:45 UTC
Being down and currently unable to get back up is degraded.  I don't see another way to see it just because another server is still running.  If you're asking for a high level summary to hide that information for a period of time, we can do that (in fact we do it already for one minute), but it is very definitely degraded.

Comment 10 Maciej Szulik 2020-01-31 11:37:15 UTC
A lot has been changed in the 4.4 time frame around status information. Moving to qa for verification.

Comment 12 zhou ying 2020-02-04 03:08:06 UTC
Can't see the error from latest CI job:


[root@dhcp-140-138 roottest]# curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4/40/build-log.txt  |grep 'changed Degraded to\|Degraded message changed' | sort | head -3



[root@dhcp-140-138 roottest]# curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4/41/build-log.txt |grep 'changed Degraded to\|Degraded message changed' | sort | head -3

Comment 14 zhou ying 2020-04-02 01:53:43 UTC
@Hongkai Liu :

I'm ok reopen it.

Comment 15 Maciej Szulik 2020-04-03 09:16:05 UTC
This one looks like node related since it says:

NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-135-11.us-west-2.compute.internal\" not ready since 2020-03-27 14:09:08 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"

sending it over to node team to figure out what happened with kubelet that it stopped posting its status.

Comment 16 Ryan Phillips 2020-04-07 18:02:42 UTC
This should be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1821341 and the pending 4.3 PR https://github.com/openshift/origin/pull/24841

Comment 17 Ryan Phillips 2020-04-07 18:07:33 UTC

*** This bug has been marked as a duplicate of bug 1821341 ***