Bug 1702829 - [upgrade] clusteroperator/kube-scheduler changed Degraded to True: StaticPodsDegradedError... container="scheduler" is terminated
Summary: [upgrade] clusteroperator/kube-scheduler changed Degraded to True: StaticPods...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.4.0
Assignee: Mike Dame
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-24 21:33 UTC by W. Trevor King
Modified: 2020-05-04 11:13 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:12:48 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:13:13 UTC

Description W. Trevor King 2019-04-24 21:33:27 UTC
Description of problem:

From [1]:

  $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706 | grep 'clusteroperator/kube-scheduler changed Degraded to True' | head -n1 | sed 's|\\n|\n|g'
  Apr 24 19:25:24.484 E clusteroperator/kube-scheduler changed Degraded to True: StaticPodsDegradedError: StaticPodsDegraded: nodes/ip-10-0-140-230.ec2.internal pods/openshift-kube-scheduler-ip-10-0-140-230.ec2.internal container="scheduler" is not ready
  StaticPodsDegraded: nodes/ip-10-0-140-230.ec2.internal pods/openshift-kube-scheduler-ip-10-0-140-230.ec2.internal container="scheduler" is terminated: "Error" - "/localhost:6443/api/v1/namespaces/openshift-monitoring/pods/kube-state-metrics-5cb588685f-696cx: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.196464       1 factory.go:1570] Error getting pod e2e-tests-sig-apps-replicaset-upgrade-9jwhx/rs-2dvsn for retry: Get https://localhost:6443/api/v1/namespaces/e2e-tests-sig-apps-replicaset-upgrade-9jwhx/pods/rs-2dvsn: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.196846       1 factory.go:1570] Error getting pod openshift-monitoring/grafana-6c56d45755-zjslh for retry: Get https://localhost:6443/api/v1/namespaces/openshift-monitoring/pods/grafana-6c56d45755-zjslh: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.250882       1 factory.go:1570] Error getting pod openshift-console/downloads-8df7b68d5-gkllb for retry: Get https://localhost:6443/api/v1/namespaces/openshift-console/pods/downloads-8df7b68d5-gkllb: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.264865       1 factory.go:1570] Error getting pod openshift-image-registry/cluster-image-registry-operator-f5d964df5-6jtcz for retry: Get https://localhost:6443/api/v1/namespaces/openshift-image-registry/pods/cluster-image-registry-operator-f5d964df5-6jtcz: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.458056       1 factory.go:1570] Error getting pod openshift-dns-operator/dns-operator-54b4d748bf-gx4dw for retry: Get https://localhost:6443/api/v1/namespaces/openshift-dns-operator/pods/dns-operator-54b4d748bf-gx4dw: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.470869       1 factory.go:1570] Error getting pod openshift-monitoring/cluster-monitoring-operator-56cd5488d8-6p44h for retry: Get https://localhost:6443/api/v1/namespaces/openshift-monitoring/pods/cluster-monitoring-operator-56cd5488d8-6p44h: dial tcp [::1]:6443: connect: connection refused; retrying...
  I0424 19:19:52.491125       1 secure_serving.go:180] Stopped listening on [::]:10251
  "
  StaticPodsDegraded: nodes/ip-10-0-175-155.ec2.internal pods/openshift-kube-scheduler-ip-10-0-175-155.ec2.internal container="scheduler" is not ready

Michal feels like this is probably part of the local Kubernetes API-server upgrading, and that we want the local scheduler to release its leadership when that happens.  But crashing a Pod is a somewhat noisy way to hand off.  And setting your ClusterOperator Degraded is not something that should happen as part of a vanilla upgrade.

Can we only complain if we go more than $minutes without a backup scheduler?  I dunno if the underlying operator libraries expose "you're the leader, and there were $x other Pods participating in the last election" information to their callers?  Or we can solve this another way, as long as it doesn't involve going Degraded during each upgrade ;).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706

Comment 1 Seth Jennings 2019-04-24 21:42:11 UTC
What if we use the internal LB name `api-int` to connect to the apiserver.  Is there a reason we are connecting to the local master over localhost? Other than improved latency maybe?

Comment 2 Seth Jennings 2019-04-24 21:43:21 UTC
Not a blocker due to low severity but could you look into this Ravi?

Comment 8 Maciej Szulik 2020-02-26 19:27:32 UTC
A lot has changed in between when this was opened and now, moving to qa for verification against the current release.

Comment 18 errata-xmlrpc 2020-05-04 11:12:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.