Bug 1874597 - Static pod installer controller deadlocks with non-existing installer pod, WAS: kube-apisrever of clsuter operator always with incorrect status due to pleg error
Summary: Static pod installer controller deadlocks with non-existing installer pod, WA...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.z
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
: 1817419 (view as bug list)
Depends On: 1858763
Blocks: 1876484 1822016
TreeView+ depends on / blocked
 
Reported: 2020-09-01 17:37 UTC by Scott Dodson
Modified: 2020-09-22 12:44 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1858763
: 1876484 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 434 None closed Bug 1874597: static-pod-installer: remove deadlock by recreating installers that disappear 2020-09-21 13:55:23 UTC
Github openshift cluster-kube-apiserver-operator pull 942 None closed Bug 1874597: static-pod-installer: remove deadlock by recreating inst… 2020-09-21 13:55:26 UTC
Github openshift cluster-kube-controller-manager-operator pull 445 None closed [release-4.5] Bug 1874597: static-pod-installer: remove deadlock by recreating installers that disappear 2020-09-21 13:55:23 UTC
Github openshift cluster-kube-scheduler-operator pull 278 None closed [release-4.5] Bug 1874597: static-pod-installer: remove deadlock by recreating installers that disappear 2020-09-21 13:55:23 UTC
Github openshift library-go pull 879 None closed [release-4.5] Bug 1874597: static-pod-installer: remove deadlock by recreating installers that disappear 2020-09-21 13:55:23 UTC

Description Scott Dodson 2020-09-01 17:37:27 UTC
+++ This bug was initially created as a clone of Bug #1858763 +++

Description of problem:
UPI install one disconnected cluster vSphere6.7, checked the cluster operators,
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
...
kube-apiserver                             4.5.2     True        True          True       124m


Version-Release number of selected component (if applicable):
4.5.2-x86_64

How reproducible:
Always

Reproduced steps:
UPI install one disconnected cluster vSphere6.7

Actual results:
$ oc get node
NAME              STATUS   ROLES    AGE     VERSION
compute-0         Ready    worker   3h49m   v1.18.3+b74c5ed
compute-1         Ready    worker   3h49m   v1.18.3+b74c5ed
compute-2         Ready    worker   3h49m   v1.18.3+b74c5ed
control-plane-0   Ready    master   4h2m    v1.18.3+b74c5ed
control-plane-1   Ready    master   4h2m    v1.18.3+b74c5ed
control-plane-2   Ready    master   4h2m    v1.18.3+b74c5ed

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.2     True        False         109m    Error while reconciling 4.5.2: the cluster operator kube-apiserver is degraded

$ oc describe co kube-apiserver 
...
Status:
  Conditions:
    Last Transition Time:  2020-07-20T05:41:04Z
    Message:               NodeInstallerDegraded: 1 nodes are failing on revision 7:
NodeInstallerDegraded: pods "installer-7-control-plane-0" not found
    Reason:                NodeInstaller_InstallerPodFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-07-20T05:20:51Z
    Message:               NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7
    Reason:                NodeInstaller
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-07-20T05:23:46Z
    Message:               StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-07-20T05:20:44Z
...

Actually, checked the kube-apiservers, they all are revision 7 as expected.
$ oc get pod -n openshift-kube-apiserver --show-labels
NAME                                READY   STATUS      RESTARTS   AGE    LABELS
kube-apiserver-control-plane-0      4/4     Running     0          120m   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-control-plane-1      4/4     Running     0          123m   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-control-plane-2      4/4     Running     0          124m   apiserver=true,app=openshift-kube-apiserver,revision=7

$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-56498d4959-sgprw | grep -n '1 nodes are failing on revision 7'

181:I0720 06:06:12.740042       1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-07-20T05:41:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])","reason":"NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-07-20T05:20:51Z","message":"NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-07-20T05:23:46Z","message":"StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-07-20T05:20:44Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
182:I0720 06:06:12.801423       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2a69b33f-6ce3-4cae-84d2-599cd26dafd5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])"
183:I0720 06:06:12.882579       1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-07-20T05:41:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found","reason":"NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-07-20T05:20:51Z","message":"NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-07-20T05:23:46Z","message":"StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-07-20T05:20:44Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
184:I0720 06:06:12.947788       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2a69b33f-6ce3-4cae-84d2-599cd26dafd5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found"


Expected results:
kube-apiserver of cluster-operator should be the correct status.

Additional info:

--- Additional comment from Ke Wang on 2020-07-20 06:52:55 EDT ---

must-gather logs: http://file.apac.redhat.com/~kewang/logs/1858763-must-gather.local.6400944162076975835.tar.gz

--- Additional comment from Eric Paris on 2020-07-20 07:00:17 EDT ---

This bug sets Target Release equal to a z-stream but has no bug in the 'Depends On' field. As such this is not a valid bug state and the target release is being unset.

Any bug targeting 4.1.z must have a bug targeting 4.2 in 'Depends On.'
Similarly, any bug targeting 4.2.z must have a bug with Target Release of 4.3 in 'Depends On.'

--- Additional comment from Seth Jennings on 2020-07-30 13:30:09 EDT ---

Something doesn't match up here.

All the nodes show Ready and all the static pods show Running at revision 7.

Seems like it the operator is seeing something we aren't or not seeing something that we are.

--- Additional comment from Stefan Schimanski on 2020-08-26 08:05:20 EDT ---

From kubelet logs:

767:Jul 20 05:41:36.046042 control-plane-0 hyperkube[1463]: E0720 05:41:36.045882    1463 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
19751:Jul 20 05:43:46.238236 control-plane-1 hyperkube[1472]: I0720 05:43:46.237981    1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 LastTransitionTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
19761:Jul 20 05:43:46.272003 control-plane-1 hyperkube[1472]: E0720 05:43:46.271855    1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
50484:Jul 20 05:46:49.843351 control-plane-2 hyperkube[1462]: E0720 05:46:49.843327    1462 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
172942:Jul 20 06:01:41.422914 control-plane-0 hyperkube[1482]: I0720 06:01:41.421909    1482 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 LastTransitionTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
173201:Jul 20 06:01:41.510137 control-plane-0 hyperkube[1482]: E0720 06:01:41.509003    1482 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
204954:Jul 20 06:03:52.931798 control-plane-1 hyperkube[1472]: I0720 06:03:52.931676    1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 LastTransitionTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
204963:Jul 20 06:03:52.952973 control-plane-1 hyperkube[1472]: E0720 06:03:52.951399    1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
239275:Jul 20 06:06:12.498371 control-plane-2 hyperkube[1471]: I0720 06:06:12.460705    1471 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 LastTransitionTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
239293:Jul 20 06:06:12.500399 control-plane-2 hyperkube[1471]: E0720 06:06:12.498580    1471 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]

So the operator(s) (all control-plane operators notice that!) do not lie. Kubelet report the PLEG error.

--- Additional comment from Stefan Schimanski on 2020-08-26 08:19:14 EDT ---

I also don't see the PLEG error showing up in the kubeapiserver operator CR lateron. So as far as I see the condition is right (also the PLEG error in between is right).

What seems to be the case though is that installer pods have disappeared:

  - lastTransitionTime: "2020-07-20T05:39:04Z"
    message: |-
      1 nodes are failing on revision 7:
      pods "installer-7-control-plane-0" not found
    reason: InstallerPodFailed
    status: "True"
    type: NodeInstallerDegraded

--- Additional comment from errata-xmlrpc on 2020-08-26 09:55:28 EDT ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2020:54579-02
https://errata.devel.redhat.com/advisory/54579

--- Additional comment from errata-xmlrpc on 2020-08-26 09:55:31 EDT ---

This bug has been added to advisory RHBA-2020:54579 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)

--- Additional comment from Venkata Siva Teja Areti on 2020-08-26 10:41:16 EDT ---



--- Additional comment from Stefan Schimanski on 2020-08-26 12:17:16 EDT ---

There is still one PR pending.

--- Additional comment from Stefan Schimanski on 2020-08-26 12:18:16 EDT ---



--- Additional comment from OpenShift Automated Release Tooling on 2020-08-31 01:32:38 EDT ---

Elliott changed bug status from ('MODIFIED',) to ON_QA.

--- Additional comment from Christoph Blecker on 2020-09-01 12:41:09 EDT ---

If a cluster gets stuck in this state, the following steps can be used to allow the kube-apiserver-operator to recover:

- Identify the node that has the failed installation
oc get kubeapiservers cluster -o json | jq .status.nodeStatuses
- Zero out the currentRevision of the node in question (in this example, it was the third master)
curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/currentRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status"
- Zero out the lastFailedRevision of the node in question (in this example, it was the third master)
curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/lastFailedRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status"
- If needed, delete and restart the kube-apiserver-operator pod
oc delete pods -n openshift-kube-apiserver-operator -l app=kube-apiserver-operator

Comment 1 Scott Dodson 2020-09-01 17:38:42 UTC
This led to an OSD upgrade deadlock when upgrading from 4.4.11 to 4.4.16 and as such we should backport to 4.5.z and ideally 4.4.z if reasonable.

Comment 2 Stefan Schimanski 2020-09-04 09:05:58 UTC
*** Bug 1817419 has been marked as a duplicate of this bug. ***

Comment 3 Stefan Schimanski 2020-09-11 15:29:05 UTC
PRs are up and have to be approved by branch managers. Adding UpcomingSprint.

Comment 5 Ke Wang 2020-09-18 04:05:04 UTC
The related PRs have been verified before merged, and 4.5.0-0.nightly-2020-09-17-145245 contains them, so moving to VERIFIED directly.

$ pr-check.sh 4.5.0-0.nightly-2020-09-17-145245 cluster-kube-controller-manager 445
5cd0f37a OpenShift Merge Robot Thu Sep 17 20:39:40 2020 - Merge pull request #445 from soltysh/bug1874597

$ pr-check.sh 4.5.0-0.nightly-2020-09-17-145245 cluster-kube-scheduler-operator 278
eaaf196b OpenShift Merge Robot Thu Sep 17 21:13:07 2020 - Merge pull request #278 from soltysh/bug1874597

$ pr-check.sh 4.5.0-0.nightly-2020-09-17-145245 cluster-kube-apiserver-operator 942
66704667 OpenShift Merge Robot Sat Sep 12 00:42:17 2020 - Merge pull request #942 from soltysh/bug1874597

$ pr-check.sh 4.5.0-0.nightly-2020-09-17-145245 cluster-etcd-operator 434
be2b8d97 OpenShift Merge Robot Sat Sep 12 00:23:34 2020 - Merge pull request #434 from hexfusion/cp-lb-bump


Note You need to log in before you can comment on or make changes to this bug.