Bug 1858763

Summary:	Static pod installer controller deadlocks with non-existing installer pod, WAS: kube-apisrever of clsuter operator always with incorrect status due to pleg error
Product:	OpenShift Container Platform	Reporter:	Ke Wang <kewang>
Component:	kube-apiserver	Assignee:	Luis Sanchez <sanchezl>
Status:	CLOSED ERRATA	QA Contact:	Ke Wang <kewang>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.5	CC:	aos-bugs, cblecker, cmeadors, jokerman, jupierce, mfojtik, schoudha, sttts, travi, wking, xxia
Target Milestone:	---	Keywords:	ServiceDeliveryImpact, Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1874597 1909600 (view as bug list)		Environment:
Last Closed:	2020-10-27 16:16:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1822018
Bug Blocks:	1817419, 1822016, 1874597, 1909600, 1949370

Description Ke Wang 2020-07-20 10:35:44 UTC

Description of problem:
UPI install one disconnected cluster vSphere6.7, checked the cluster operators,
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
...
kube-apiserver                             4.5.2     True        True          True       124m


Version-Release number of selected component (if applicable):
4.5.2-x86_64

How reproducible:
Always

Reproduced steps:
UPI install one disconnected cluster vSphere6.7

Actual results:
$ oc get node
NAME              STATUS   ROLES    AGE     VERSION
compute-0         Ready    worker   3h49m   v1.18.3+b74c5ed
compute-1         Ready    worker   3h49m   v1.18.3+b74c5ed
compute-2         Ready    worker   3h49m   v1.18.3+b74c5ed
control-plane-0   Ready    master   4h2m    v1.18.3+b74c5ed
control-plane-1   Ready    master   4h2m    v1.18.3+b74c5ed
control-plane-2   Ready    master   4h2m    v1.18.3+b74c5ed

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.2     True        False         109m    Error while reconciling 4.5.2: the cluster operator kube-apiserver is degraded

$ oc describe co kube-apiserver 
...
Status:
  Conditions:
    Last Transition Time:  2020-07-20T05:41:04Z
    Message:               NodeInstallerDegraded: 1 nodes are failing on revision 7:
NodeInstallerDegraded: pods "installer-7-control-plane-0" not found
    Reason:                NodeInstaller_InstallerPodFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-07-20T05:20:51Z
    Message:               NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7
    Reason:                NodeInstaller
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-07-20T05:23:46Z
    Message:               StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-07-20T05:20:44Z
...

Actually, checked the kube-apiservers, they all are revision 7 as expected.
$ oc get pod -n openshift-kube-apiserver --show-labels
NAME                                READY   STATUS      RESTARTS   AGE    LABELS
kube-apiserver-control-plane-0      4/4     Running     0          120m   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-control-plane-1      4/4     Running     0          123m   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-control-plane-2      4/4     Running     0          124m   apiserver=true,app=openshift-kube-apiserver,revision=7

$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-56498d4959-sgprw | grep -n '1 nodes are failing on revision 7'

181:I0720 06:06:12.740042       1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-07-20T05:41:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])","reason":"NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-07-20T05:20:51Z","message":"NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-07-20T05:23:46Z","message":"StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-07-20T05:20:44Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
182:I0720 06:06:12.801423       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2a69b33f-6ce3-4cae-84d2-599cd26dafd5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])"
183:I0720 06:06:12.882579       1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-07-20T05:41:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found","reason":"NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-07-20T05:20:51Z","message":"NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-07-20T05:23:46Z","message":"StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 7","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-07-20T05:20:44Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
184:I0720 06:06:12.947788       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2a69b33f-6ce3-4cae-84d2-599cd26dafd5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"control-plane-2\" not ready since 2020-07-20 06:06:12 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-control-plane-0\" not found"


Expected results:
kube-apiserver of cluster-operator should be the correct status.

Additional info:

Comment 3 Seth Jennings 2020-07-30 17:30:09 UTC

Something doesn't match up here.

All the nodes show Ready and all the static pods show Running at revision 7.

Seems like it the operator is seeing something we aren't or not seeing something that we are.

Comment 4 Stefan Schimanski 2020-08-26 12:05:20 UTC

From kubelet logs:

767:Jul 20 05:41:36.046042 control-plane-0 hyperkube[1463]: E0720 05:41:36.045882    1463 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
19751:Jul 20 05:43:46.238236 control-plane-1 hyperkube[1472]: I0720 05:43:46.237981    1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 LastTransitionTime:2020-07-20 05:43:46.237942637 +0000 UTC m=+7.361174164 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
19761:Jul 20 05:43:46.272003 control-plane-1 hyperkube[1472]: E0720 05:43:46.271855    1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
50484:Jul 20 05:46:49.843351 control-plane-2 hyperkube[1462]: E0720 05:46:49.843327    1462 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
172942:Jul 20 06:01:41.422914 control-plane-0 hyperkube[1482]: I0720 06:01:41.421909    1482 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 LastTransitionTime:2020-07-20 06:01:41.421861118 +0000 UTC m=+7.287055459 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
173201:Jul 20 06:01:41.510137 control-plane-0 hyperkube[1482]: E0720 06:01:41.509003    1482 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
204954:Jul 20 06:03:52.931798 control-plane-1 hyperkube[1472]: I0720 06:03:52.931676    1472 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 LastTransitionTime:2020-07-20 06:03:52.931638668 +0000 UTC m=+7.225051242 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
204963:Jul 20 06:03:52.952973 control-plane-1 hyperkube[1472]: E0720 06:03:52.951399    1472 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
239275:Jul 20 06:06:12.498371 control-plane-2 hyperkube[1471]: I0720 06:06:12.460705    1471 setters.go:559] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 LastTransitionTime:2020-07-20 06:06:12.460663895 +0000 UTC m=+7.069575540 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
239293:Jul 20 06:06:12.500399 control-plane-2 hyperkube[1471]: E0720 06:06:12.498580    1471 kubelet.go:1863] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]

So the operator(s) (all control-plane operators notice that!) do not lie. Kubelet report the PLEG error.

Comment 5 Stefan Schimanski 2020-08-26 12:19:14 UTC

I also don't see the PLEG error showing up in the kubeapiserver operator CR lateron. So as far as I see the condition is right (also the PLEG error in between is right).

What seems to be the case though is that installer pods have disappeared:

  - lastTransitionTime: "2020-07-20T05:39:04Z"
    message: |-
      1 nodes are failing on revision 7:
      pods "installer-7-control-plane-0" not found
    reason: InstallerPodFailed
    status: "True"
    type: NodeInstallerDegraded

Comment 8 Venkata Siva Teja Areti 2020-08-26 14:41:16 UTC

*** Bug 1723966 has been marked as a duplicate of this bug. ***

Comment 9 Stefan Schimanski 2020-08-26 16:17:16 UTC

There is still one PR pending.

Comment 10 Stefan Schimanski 2020-08-26 16:18:16 UTC

*** Bug 1861899 has been marked as a duplicate of this bug. ***

Comment 12 Christoph Blecker 2020-09-01 16:41:09 UTC

If a cluster gets stuck in this state, the following steps can be used to allow the kube-apiserver-operator to recover:

- Identify the node that has the failed installation
oc get kubeapiservers cluster -o json | jq .status.nodeStatuses
- Zero out the currentRevision of the node in question (in this example, it was the third master)
curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/currentRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status"
- Zero out the lastFailedRevision of the node in question (in this example, it was the third master)
curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $(oc whoami --show-token)" --request PATCH --data '[{"op": "replace", "path": "/status/nodeStatuses/2/lastFailedRevision", "value": 0}]' $(oc whoami --show-server)/apis/operator.openshift.io/v1/kubeapiservers/cluster/status"
- If needed, delete and restart the kube-apiserver-operator pod
oc delete pods -n openshift-kube-apiserver-operator -l app=kube-apiserver-operator

Comment 13 Stefan Schimanski 2020-09-02 14:53:29 UTC

*** Bug 1817419 has been marked as a duplicate of this bug. ***

Comment 15 Ke Wang 2020-09-04 02:02:00 UTC

UPI installed one disconnected cluster vSphere6.7 successfully, see cluster detail,

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-03-084733   True        False         41m     Cluster version is 4.6.0-0.nightly-2020-09-03-084733


$ oc get infrastructures.config.openshift.io  -o json | jq .items[0].spec.platformSpec
{
  "type": "VSphere"
}

$ oc get nodes
NAME              STATUS   ROLES    AGE   VERSION
compute-0         Ready    worker   52m   v1.19.0-rc.2+b5dc585-dirty
compute-1         Ready    worker   52m   v1.19.0-rc.2+b5dc585-dirty
control-plane-0   Ready    master   67m   v1.19.0-rc.2+b5dc585-dirty
control-plane-1   Ready    master   67m   v1.19.0-rc.2+b5dc585-dirty
control-plane-2   Ready    master   67m   v1.19.0-rc.2+b5dc585-dirty

$ oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-03-084733   True        False         False      52s
cloud-credential                           4.6.0-0.nightly-2020-09-03-084733   True        False         False      68m
cluster-autoscaler                         4.6.0-0.nightly-2020-09-03-084733   True        False         False      54m
config-operator                            4.6.0-0.nightly-2020-09-03-084733   True        False         False      63m
console                                    4.6.0-0.nightly-2020-09-03-084733   True        False         False      17m
csi-snapshot-controller                    4.6.0-0.nightly-2020-09-03-084733   True        False         False      21m
dns                                        4.6.0-0.nightly-2020-09-03-084733   True        False         False      60m
etcd                                       4.6.0-0.nightly-2020-09-03-084733   True        False         False      61m
image-registry                             4.6.0-0.nightly-2020-09-03-084733   True        False         False      22m
ingress                                    4.6.0-0.nightly-2020-09-03-084733   True        False         False      48m
insights                                   4.6.0-0.nightly-2020-09-03-084733   True        False         False      55m
kube-apiserver                             4.6.0-0.nightly-2020-09-03-084733   True        False         False      59m
kube-controller-manager                    4.6.0-0.nightly-2020-09-03-084733   True        False         False      60m
kube-scheduler                             4.6.0-0.nightly-2020-09-03-084733   True        False         False      59m
kube-storage-version-migrator              4.6.0-0.nightly-2020-09-03-084733   True        False         False      23m
machine-api                                4.6.0-0.nightly-2020-09-03-084733   True        False         False      51m
machine-approver                           4.6.0-0.nightly-2020-09-03-084733   True        False         False      60m
machine-config                             4.6.0-0.nightly-2020-09-03-084733   True        False         False      60m
marketplace                                4.6.0-0.nightly-2020-09-03-084733   True        False         False      23m
monitoring                                 4.6.0-0.nightly-2020-09-03-084733   True        False         False      17m
network                                    4.6.0-0.nightly-2020-09-03-084733   True        False         False      48m
node-tuning                                4.6.0-0.nightly-2020-09-03-084733   True        False         False      63m
openshift-apiserver                        4.6.0-0.nightly-2020-09-03-084733   True        False         False      18m
openshift-controller-manager               4.6.0-0.nightly-2020-09-03-084733   True        False         False      46m
openshift-samples                          4.6.0-0.nightly-2020-09-03-084733   True        False         False      17m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-09-03-084733   True        False         False      61m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-09-03-084733   True        False         False      61m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-03-084733   True        False         False      17m
service-ca                                 4.6.0-0.nightly-2020-09-03-084733   True        False         False      63m
storage                                    4.6.0-0.nightly-2020-09-03-084733   True        False         False      63m

The cluster works well, so move the bug Verified.

Comment 17 errata-xmlrpc 2020-10-27 16:16:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196