Bug 1702829

Summary: [upgrade] clusteroperator/kube-scheduler changed Degraded to True: StaticPodsDegradedError... container="scheduler" is terminated
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: kube-schedulerAssignee: Mike Dame <mdame>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: low Docs Contact:
Priority: low    
Version: 4.1.0CC: aos-bugs, deads, jokerman, knarra, maszulik, mdame, mfojtik, mmccomas, rgudimet
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:12:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-04-24 21:33:27 UTC
Description of problem:

From [1]:

  $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706 | grep 'clusteroperator/kube-scheduler changed Degraded to True' | head -n1 | sed 's|\\n|\n|g'
  Apr 24 19:25:24.484 E clusteroperator/kube-scheduler changed Degraded to True: StaticPodsDegradedError: StaticPodsDegraded: nodes/ip-10-0-140-230.ec2.internal pods/openshift-kube-scheduler-ip-10-0-140-230.ec2.internal container=&#34;scheduler&#34; is not ready
  StaticPodsDegraded: nodes/ip-10-0-140-230.ec2.internal pods/openshift-kube-scheduler-ip-10-0-140-230.ec2.internal container=&#34;scheduler&#34; is terminated: &#34;Error&#34; - &#34;/localhost:6443/api/v1/namespaces/openshift-monitoring/pods/kube-state-metrics-5cb588685f-696cx: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.196464       1 factory.go:1570] Error getting pod e2e-tests-sig-apps-replicaset-upgrade-9jwhx/rs-2dvsn for retry: Get https://localhost:6443/api/v1/namespaces/e2e-tests-sig-apps-replicaset-upgrade-9jwhx/pods/rs-2dvsn: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.196846       1 factory.go:1570] Error getting pod openshift-monitoring/grafana-6c56d45755-zjslh for retry: Get https://localhost:6443/api/v1/namespaces/openshift-monitoring/pods/grafana-6c56d45755-zjslh: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.250882       1 factory.go:1570] Error getting pod openshift-console/downloads-8df7b68d5-gkllb for retry: Get https://localhost:6443/api/v1/namespaces/openshift-console/pods/downloads-8df7b68d5-gkllb: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.264865       1 factory.go:1570] Error getting pod openshift-image-registry/cluster-image-registry-operator-f5d964df5-6jtcz for retry: Get https://localhost:6443/api/v1/namespaces/openshift-image-registry/pods/cluster-image-registry-operator-f5d964df5-6jtcz: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.458056       1 factory.go:1570] Error getting pod openshift-dns-operator/dns-operator-54b4d748bf-gx4dw for retry: Get https://localhost:6443/api/v1/namespaces/openshift-dns-operator/pods/dns-operator-54b4d748bf-gx4dw: dial tcp [::1]:6443: connect: connection refused; retrying...
  E0424 19:19:52.470869       1 factory.go:1570] Error getting pod openshift-monitoring/cluster-monitoring-operator-56cd5488d8-6p44h for retry: Get https://localhost:6443/api/v1/namespaces/openshift-monitoring/pods/cluster-monitoring-operator-56cd5488d8-6p44h: dial tcp [::1]:6443: connect: connection refused; retrying...
  I0424 19:19:52.491125       1 secure_serving.go:180] Stopped listening on [::]:10251
  &#34;
  StaticPodsDegraded: nodes/ip-10-0-175-155.ec2.internal pods/openshift-kube-scheduler-ip-10-0-175-155.ec2.internal container=&#34;scheduler&#34; is not ready

Michal feels like this is probably part of the local Kubernetes API-server upgrading, and that we want the local scheduler to release its leadership when that happens.  But crashing a Pod is a somewhat noisy way to hand off.  And setting your ClusterOperator Degraded is not something that should happen as part of a vanilla upgrade.

Can we only complain if we go more than $minutes without a backup scheduler?  I dunno if the underlying operator libraries expose "you're the leader, and there were $x other Pods participating in the last election" information to their callers?  Or we can solve this another way, as long as it doesn't involve going Degraded during each upgrade ;).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706

Comment 1 Seth Jennings 2019-04-24 21:42:11 UTC
What if we use the internal LB name `api-int` to connect to the apiserver.  Is there a reason we are connecting to the local master over localhost? Other than improved latency maybe?

Comment 2 Seth Jennings 2019-04-24 21:43:21 UTC
Not a blocker due to low severity but could you look into this Ravi?

Comment 8 Maciej Szulik 2020-02-26 19:27:32 UTC
A lot has changed in between when this was opened and now, moving to qa for verification against the current release.

Comment 18 errata-xmlrpc 2020-05-04 11:12:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581