1779796 – kube-apiserver Progressing=True: 1 nodes are at revision 4; 2 nodes are at revision 6

Bug 1779796 - kube-apiserver Progressing=True: 1 nodes are at revision 4; 2 nodes are at revision 6

Summary: kube-apiserver Progressing=True: 1 nodes are at revision 4; 2 nodes are at re...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1781678
TreeView+	depends on / blocked

Reported:	2019-12-04 18:26 UTC by W. Trevor King
Modified:	2020-09-30 09:03 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1781678 (view as bug list)
Environment:
Last Closed:	2020-05-13 21:54:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:54:04 UTC

Description W. Trevor King 2019-12-04 18:26:53 UTC

Release promotion informer [1]:

level=info msg="Cluster operator authentication Progressing is True with ProgressingWellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.19:6443/.well-known/oauth-authorization-server endpoint data"
level=info msg="Cluster operator authentication Available is False with Available: "
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator kube-apiserver Progressing is True with Progressing: Progressing: 1 nodes are at revision 4; 2 nodes are at revision 6"
level=fatal msg="failed to initialize the cluster: Working towards 4.3.0-0.nightly-2019-12-04-004448: 100% complete"

Similar errors have been reported in bug 1768252 and bug 1776402.  But etcd vs. disk latency is implicated in those bugs, and Sam checked this job and saw no etcd latency issues.  Happened 13 times in the past 24h [2].  Seems well distributed among job names [3].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/514
[2]: https://search.svc.ci.openshift.org/chart?search=Cluster%20operator%20kube-apiserver%20Progressing%20is%20True.*nodes%20are%20at%20revision
[3]: https://search.svc.ci.openshift.org/?search=Cluster%20operator%20kube-apiserver%20Progressing%20is%20True.*nodes%20are%20at%20revision

Comment 1 Standa Laznicka 2019-12-10 08:46:23 UTC

From looking plainly at the attached openstack test-run, I can see that it took quite a long time for the KAS pods to come up with the revision that would actually be capable of serving the oauth-metadata endpoint, moving to KAS component.

Also, there's a nil deref panic in KAS-o in the logs in the test run above - https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/514/artifacts/e2e-openstack-serial/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-55b8787655-zbtbd_kube-apiserver-operator.log

Comment 7 Ke Wang 2020-01-19 08:41:05 UTC

Client Version: v4.3.0
Server Version: 4.3.0-rc.2
Kubernetes Version: v1.16.2

$ master=$(oc get node | grep master | awk '{print $1}' | head -1)
$ oc debug node/$master

After logged in the master debug pod, 
- check if the field "bindNetwork":"tcp4" have not been changed, found them as below,
# grep -rnw /etc/kubernetes -e '"bindNetwork":"tcp4"' | awk -F: '{print $1}'
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-3/configmaps/config/config.yaml
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-4/configmaps/config/config.yaml
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-6/configmaps/config/config.yaml
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-7/configmaps/config/config.yaml
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-8/configmaps/config/config.yaml
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-9/configmaps/config/config.yaml
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-10/configmaps/config/config.yaml
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-11/configmaps/config/config.yaml

- check if the field "bindNetwork":"tcp" have been changed, found them as below,
# grep -rnw /etc/kubernetes -e '"bindNetwork":"tcp"' | awk -F: '{print $1}'
/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-4/configmaps/cluster-policy-controller-config/config.yaml
/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-6/configmaps/cluster-policy-controller-config/config.yaml
/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-7/configmaps/cluster-policy-controller-config/config.yaml

So I think the fix is not complete for bug.

Comment 8 Ke Wang 2020-01-19 08:43:43 UTC

Please ignore the previous comments, pasted wrong bug.

Comment 11 errata-xmlrpc 2020-05-13 21:54:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.