1892129 – [AWS] Failed to create disconnected and private cluster etcd and apiserver is not ready.

Bug 1892129 - [AWS] Failed to create disconnected and private cluster etcd and apiserver is not ready. [NEEDINFO]

Summary: [AWS] Failed to create disconnected and private cluster etcd and apiserver is...

Keywords:
Status:	CLOSED DUPLICATE of bug 1921901
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Akhil Rane
QA Contact:	Yunfei Jiang
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Depends On:	1903226
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-28 01:38 UTC by Yunfei Jiang
Modified:	2021-02-01 16:28 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1903226 (view as bug list)
Environment:
Last Closed:	2021-02-01 16:28:49 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)
install log (149.10 KB, text/plain) 2020-10-28 01:39 UTC, Yunfei Jiang	no flags	Details
01_vpc_disconnected_aws_with_privatelink.yaml (7.28 KB, text/plain) 2020-11-25 07:25 UTC, Yunfei Jiang	no flags	Details
View All

Description Yunfei Jiang 2020-10-28 01:38:21 UTC

Error occurs occasionally, install log and bootstrap logs are attached.

Error from install log: 
level=info msg="Waiting up to 30m0s for bootstrapping to complete..." 
E1026 09:47:21.258406     885 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=13936&timeoutSeconds=465&watch=true": Service Unavailable 
E1026 09:47:22.314589     885 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0": Service Unavailable 
E1026 09:47:23.366783     885 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0": Service Unavailable 
E1026 09:47:24.428891     885 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0": Service Unavailable 
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError: OAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.211.114:443/healthz\": dial tcp 172.30.211.114:443: connect: connection refused\nOAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server" 
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: " 
level=info msg="Cluster operator authentication Available is False with OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable::OAuthServiceEndpointsCheckEndpointAccessibleController_EndpointUnavailable::ReadyIngressNodes_NoReadyIngressNodes: OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints\nReadyIngressNodesAvailable: Authentication require functional ingress which requires at least one schedulable and ready node. Got 2 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods).\nOAuthServiceCheckEndpointAccessibleControllerAvailable: Get \"https://172.30.211.114:443/healthz\": dial tcp 172.30.211.114:443: connect: connection refused" 
level=error msg="Cluster operator etcd Degraded is True with InstallerController_Error::StaticPods_Error: InstallerControllerDegraded: Internal error occurred: admission plugin \"MutatingAdmissionWebhook\" failed to complete mutation in 13s\nStaticPodsDegraded: pods \"etcd-ip-10-0-71-163.us-east-2.compute.internal\" not found\nStaticPodsDegraded: pods \"etcd-ip-10-0-49-162.us-east-2.compute.internal\" not found" 
level=info msg="Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2” 


NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS 
version             False       True          21h     Unable to apply 4.6.0: an unknown error has occurred: MultipleErrors 

NAME                                        STATUS     ROLES    AGE   VERSION 
ip-10-0-49-162.us-east-2.compute.internal   Ready      master   21h   v1.19.0+d59ce34 
ip-10-0-54-178.us-east-2.compute.internal   NotReady   worker   21h   v1.19.0+d59ce34 
ip-10-0-55-251.us-east-2.compute.internal   Ready      master   21h   v1.19.0+d59ce34 
ip-10-0-69-151.us-east-2.compute.internal   NotReady   worker   21h   v1.19.0+d59ce34 
ip-10-0-71-163.us-east-2.compute.internal   Ready      master   21h   v1.19.0+d59ce34 

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE 
authentication                                       False       Unknown       True       22h 
cloud-credential                           4.6.0     True        False         False      22h 
cluster-autoscaler                         4.6.0     True        False         False      22h 
config-operator                            4.6.0     True        False         False      22h 
console 
csi-snapshot-controller                    4.6.0     True        False         False      51m 
dns                                        4.6.0     True        False         False      22h 
etcd                                       4.6.0     True        True          True       22h 
image-registry 
ingress                                              False       True          True       22h 
insights                                   4.6.0     True        False         True       22h 
kube-apiserver                                       False       True          True       22h 
kube-controller-manager                              False       True          True       22h 
kube-scheduler                             4.6.0     False       True          True       22h 
kube-storage-version-migrator              4.6.0     False       False         False      22h 
machine-api                                4.6.0     True        False         False      22h 
machine-approver                           4.6.0     True        False         False      22h 
machine-config                             4.6.0     True        False         False      22h 
marketplace                                4.6.0     True        False         False      22h 
monitoring                                           False       True          True       21h 
network                                    4.6.0     True        False         False      22h 
node-tuning                                4.6.0     True        False         False      22h 
openshift-apiserver                        4.6.0     False       False         False      22h 
openshift-controller-manager                         False       True          False      22h 
openshift-samples 
operator-lifecycle-manager                 4.6.0     True        False         False      22h 
operator-lifecycle-manager-catalog         4.6.0     True        False         False      22h 
operator-lifecycle-manager-packageserver             False       True          False      22h 
service-ca                                 4.6.0     True        False         False      22h 
storage                                    4.6.0     True        False         False      22h 

oc describe node/ip-10-0-54-178.us-east-2.compute.internal
<—snip—> 
Conditions: 
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message 
  ----             ------  -----------------                 ------------------                ------                       ------- 
  MemoryPressure   False   Tue, 27 Oct 2020 03:25:56 -0400   Mon, 26 Oct 2020 05:39:43 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available 
  DiskPressure     False   Tue, 27 Oct 2020 03:25:56 -0400   Mon, 26 Oct 2020 05:39:43 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure 
  PIDPressure      False   Tue, 27 Oct 2020 03:25:56 -0400   Mon, 26 Oct 2020 05:39:43 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available 
  Ready            False   Tue, 27 Oct 2020 03:25:56 -0400   Mon, 26 Oct 2020 05:39:43 -0400   KubeletNotReady              runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? 
<—snip—> 

Checked etcd logs:
2020-10-26 09:25:56.696635 I | etcdserver/api: enabled capabilities for version 3.4
2020-10-26 09:34:30.420782 I | embed: rejected connection from "10.0.49.162:48474" (error "remote error: tls: bad certificate", ServerName "")
2020-10-26 09:34:31.427203 I | embed: rejected connection from "10.0.49.162:48484" (error "remote error: tls: bad certificate", ServerName “")

Version-Release number of the following components: 
4.6.0-x86_64 

How reproducible: 
occasionally 

Steps to Reproduce: 
1. create a disconnect and private cluster(config CCO in manual mode and no proxy) 

Actual results: 
Create cluster failed 

Expected results: 
Create cluster successfully 

Additional info:

Comment 1 Yunfei Jiang 2020-10-28 01:39:44 UTC

Created attachment 1724664 [details]
install log

Comment 2 Yunfei Jiang 2020-10-28 01:45:38 UTC

bootstrap logs: https://drive.google.com/file/d/17yWENjj2JAjGaWUhunQifxb1dWjhjdFp/view?usp=sharing

Comment 3 Yunfei Jiang 2020-10-28 02:18:39 UTC

I don't think it's a blocker, since this error occurs occasionally

Comment 4 Victor Pickard 2020-11-24 16:09:18 UTC

Hi,

I need to see the logs from the worker nodes when the install fails. I don't see them in the log bundle. 

Can you please run must-gather and be sure to include the worker node(s) that fail?

Thanks,
Victor

Comment 6 Yunfei Jiang 2020-11-25 07:25:32 UTC

Created attachment 1733283 [details]
01_vpc_disconnected_aws_with_privatelink.yaml

Comment 7 Victor Pickard 2020-11-25 14:25:35 UTC

Thanks Yunfei! I am debugging on this cluster now. Appreciate the detailed steps to setup the cluster, and also, thanks for setting up this cluster for me to debug. Much appreciated!

Comment 8 Victor Pickard 2020-11-25 17:58:59 UTC

The daemonsets in this cluster are all failing to create pods. It looks like the connection to the api server is not working. Still investigating to figure out why this connection is not working.

[vpickard@rippleRider$][~/bz1892129]$ oc describe ds sdn -n openshift-sdn

Events:
  Type     Reason        Age                   From                  Message
  ----     ------        ----                  ----                  -------
  Warning  FailedCreate  163m (x15 over 3h)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  161m                  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  142m (x15 over 158m)  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  140m (x9 over 140m)   daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  126m (x11 over 139m)  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  120m (x10 over 120m)  daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  105m (x11 over 119m)  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  81m (x15 over 97m)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  79m (x9 over 79m)     daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  62m (x11 over 79m)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  41m (x15 over 57m)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  21m (x14 over 36m)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  19m                   daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": http2: server sent GOAWAY and closed the connection; LastStreamID=1187, ErrCode=NO_ERROR, debug=""
  Warning  FailedCreate  19m (x9 over 19m)     daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  4m12s (x10 over 18m)  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s


[vpickard@rippleRider$][~/bz1892129]$ oc describe ds dns-default -n openshift-dns

Events:
  Type     Reason        Age                   From                  Message
  ----     ------        ----                  ----                  -------
  Warning  FailedCreate  170m (x15 over 3h6m)  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  150m (x14 over 165m)  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  148m (x9 over 148m)   daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  136m (x9 over 147m)   daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  128m (x10 over 128m)  daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  111m (x9 over 126m)   daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  107m                  daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  88m (x15 over 105m)   daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  87m                   daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": unexpected EOF
  Warning  FailedCreate  87m (x8 over 87m)     daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  76m (x9 over 86m)     daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  49m (x15 over 65m)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  27m (x15 over 44m)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  26m (x10 over 26m)    daemonset-controller  Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused
  Warning  FailedCreate  11m (x9 over 25m)     daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
  Warning  FailedCreate  2s (x5 over 4m36s)    daemonset-controller  Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s


And, you can see that the ds desired replicas is 3, should be 6 (for a good majority of the them)

This failed cluster
===================
[vpickard@rippleRider$][~/bz1892129]$ oc get ds -A
NAMESPACE                                NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
openshift-cluster-csi-drivers            aws-ebs-csi-driver-node           3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-cluster-node-tuning-operator   tuned                             3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-controller-manager             controller-manager                3         3         0       0            0           node-role.kubernetes.io/master=                15h
openshift-dns                            dns-default                       3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-image-registry                 node-ca                           3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-machine-api                    machine-api-termination-handler   0         0         0       0            0           machine.openshift.io/interruptible-instance=   15h
openshift-machine-config-operator        machine-config-daemon             3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-machine-config-operator        machine-config-server             3         3         3       3            3           node-role.kubernetes.io/master=                15h
openshift-monitoring                     node-exporter                     3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-multus                         multus                            3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-multus                         multus-admission-controller       3         3         3       3            3           node-role.kubernetes.io/master=                15h
openshift-multus                         network-metrics-daemon            3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-sdn                            ovs                               3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-sdn                            sdn                               3         3         3       3            3           kubernetes.io/os=linux                         15h
openshift-sdn                            sdn-controller                    3         3         3       3            3           node-role.kubernetes.io/master=                15h
[vpickard@rippleRider$][~/bz1892129]$ 



Working cluster
===============
[vpickard@rippleRider$][~]$ oc get ds -A
NAMESPACE                                NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
openshift-cluster-csi-drivers            aws-ebs-csi-driver-node           6         6         6       6            6           kubernetes.io/os=linux                         75m
openshift-cluster-node-tuning-operator   tuned                             6         6         6       6            6           kubernetes.io/os=linux                         75m
openshift-controller-manager             controller-manager                3         3         3       3            3           node-role.kubernetes.io/master=                75m
openshift-dns                            dns-default                       6         6         6       6            6           kubernetes.io/os=linux                         74m
openshift-image-registry                 node-ca                           6         6         6       6            6           kubernetes.io/os=linux                         74m
openshift-machine-api                    machine-api-termination-handler   0         0         0       0            0           machine.openshift.io/interruptible-instance=   69m
openshift-machine-config-operator        machine-config-daemon             6         6         6       6            6           kubernetes.io/os=linux                         75m
openshift-machine-config-operator        machine-config-server             3         3         3       3            3           node-role.kubernetes.io/master=                73m
openshift-monitoring                     node-exporter                     6         6         6       6            6           kubernetes.io/os=linux                         75m
openshift-multus                         multus                            6         6         6       6            6           kubernetes.io/os=linux                         76m
openshift-multus                         multus-admission-controller       3         3         3       3            3           node-role.kubernetes.io/master=                76m
openshift-multus                         network-metrics-daemon            6         6         6       6            6           kubernetes.io/os=linux                         76m
openshift-sdn                            ovs                               6         6         6       6            6           kubernetes.io/os=linux                         76m
openshift-sdn                            sdn                               6         6         6       6            6           kubernetes.io/os=linux                         76m
openshift-sdn                            sdn-controller                    3         3         3       3            3           node-role.kubernetes.io/master=                76m
[vpickard@rippleRider$][~]$

Comment 9 Victor Pickard 2020-11-25 19:52:47 UTC

It looks like the apiserver did not start. From the api-server-operator logs, I see this:


oc logs kube-apiserver-operator-787b8d6458-8ztt6 -n openshift-kube-apiserver-operator|more

I1125 18:08:05.811870       1 cmd.go:200] Using service-serving-cert provided certificates
I1125 18:08:05.821573       1 observer_polling.go:159] Starting file observer
W1125 18:08:05.835340       1 builder.go:207] unable to get owner reference (falling back to namespace): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-787b8d6458-8ztt6": dial tcp 172.30.0
.1:443: connect: connection refused
I1125 18:08:05.835521       1 builder.go:238] kube-apiserver-operator version v4.0.0-alpha.0-1126-g358d3e91-358d3e915b2e7df4e1557f4c73c3a911a151b456
W1125 18:08:27.021574       1 requestheader_controller.go:193] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'



[vpickard@rippleRider$][~/bz1892129]$ oc get pods -A |grep apiserver
openshift-apiserver-operator                       openshift-apiserver-operator-7585c9f557-vv9dn                        1/1     Running     13         17h
openshift-kube-apiserver-operator                  kube-apiserver-operator-787b8d6458-8ztt6                             1/1     Running     13         17h
[vpickard@rippleRider$][~/bz1892129]$

Comment 10 Victor Pickard 2020-11-30 21:35:06 UTC

I've been looking at this more today. It looks like the root of the problem may be that the user is being set to system:anonymous because the certificate is signed by unknown authority, as seen in the scheduler logs below.

In the kubelet logs from nodes on 2 nodes, I see logs of errors because the user is system:anonymous, like these:

kubelet.log on 10.0.49.162
==========================

user is system:anonymous... what should it be?
Lots of these errors

Oct 26 09:32:48 ip-10-0-49-162 hyperkube[1504]: E1026 09:32:48.979745    1504 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the cluster scope
Oct 26 09:32:48 ip-10-0-49-162 hyperkube[1504]: I1026 09:32:48.979781    1504 manager.go:987] Added container: "/system.slice/systemd-journal-flush.service" (aliases: [], namespace: "")
Oct 26 09:32:48 ip-10-0-49-162 hyperkube[1504]: E1026 09:32:48.979970    1504 reflector.go:127] k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: failed to list *v1.Node: nodes "ip-10-0-49-162.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope



kubelet.log on 10.0.55.251
===========================
Oct 26 09:32:49 ip-10-0-55-251 hyperkube[1508]: E1026 09:32:49.754065    1508 kubelet_node_status.go:92] Unable to register node "ip-10-0-55-251.us-east-2.compute.internal" with API server: nodes is forbidden: User "system:anonymous" cannot create resource "nodes" in API group "" at the cluster scope


And from the scheduler log on 10.0.49.162, I see these logs indicating there is a cert issue:

log-bundle-20201027031917/control-plane/10.0.49.162/containers/kube-scheduler-446df2e00c661ee9d4ddba97b7435a03e651b7b1bd665c334c4b5c8ab3435372.log
==================================================================================================================================================


W1026 09:35:15.409139       1 authentication.go:294] Error looking up in-cluster authentication configuration: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": x509: certificate signed by unknown authority
W1026 09:35:15.409224       1 authentication.go:295] Continuing without authentication configuration. This may treat all requests as anonymous.


E1026 09:35:15.440876       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/apis/apps/v1/statefulsets?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E1026 09:35:15.445589       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E1026 09:35:16.879068       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/pods?limit=500&resourceVersion=0": x509: certificate signed by unknown authority

Comment 11 Victor Pickard 2020-11-30 22:40:28 UTC

Yunfei,
Can you please setup another cluster today when you get in, and I will continue debugging tomorrow morning my time.

It would be good to capture the output of:

oc get csr

And also, a must-gather for the system in the failed state.

Thanks in advance!

Comment 13 Yunfei Jiang 2020-12-01 07:59:36 UTC

Victor,
the must-gather command (oc adm must-gather) failed, since there are too many operators are abnormal, the must-gather pod may not be running.
you may need to check logs on a live cluster.

>> oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-7lkqv   4h33m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-58-116.us-east-2.compute.internal                       Approved,Issued
csr-7rhpq   4h39m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-77-49.us-east-2.compute.internal                        Approved,Issued
csr-8cbx2   4h40m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-77-196.us-east-2.compute.internal                       Approved,Issued
csr-fqmkr   4h33m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-klt89   4h33m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-ks7sw   4h40m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-lprdm   4h40m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-mf97n   4h40m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-nlgpg   4h40m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-56-214.us-east-2.compute.internal                       Approved,Issued
csr-rqdth   4h33m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-57-1.us-east-2.compute.internal                         Approved,Issued

Comment 17 Michal Fojtik 2020-12-31 17:58:21 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 18 Yunfei Jiang 2021-01-19 05:29:37 UTC

Hello Stefan,

Any progress on this bug? the issue is still there, let me know if you need further information.

Comment 19 Michal Fojtik 2021-01-19 05:58:29 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 20 Johnny Liu 2021-01-19 14:50:42 UTC

Today, I tired twice, all failed.  I am adding "testblocker" keywords.

Comment 21 Johnny Liu 2021-01-19 15:00:17 UTC

[root@preserve-jialiu-ansible ~]# oc get node
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-49-35.us-east-2.compute.internal    Ready    master   97m   v1.19.0+7070803
ip-10-0-58-213.us-east-2.compute.internal   Ready    master   97m   v1.19.0+7070803
ip-10-0-79-220.us-east-2.compute.internal   Ready    master   98m   v1.19.0+7070803

[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                       False       Unknown       True       97m
cloud-credential                           4.6.9     True        False         False      96m
cluster-autoscaler                         4.6.9     True        False         False      95m
config-operator                            4.6.9     True        False         False      97m
console                                                                                   
csi-snapshot-controller                    4.6.9     True        False         False      97m
dns                                        4.6.9     True        False         False      96m
etcd                                       4.6.9     False       True          True       97m
image-registry                                                                            
ingress                                              False       True          True       96m
insights                                   4.6.9     True        False         True       97m
kube-apiserver                                       False       True          True       97m
kube-controller-manager                              False       True          True       97m
kube-scheduler                             4.6.9     False       True          True       97m
kube-storage-version-migrator              4.6.9     False       False         False      97m
machine-api                                4.6.9     True        False         False      93m
machine-approver                           4.6.9     True        False         False      96m
machine-config                                       False       True          True       86m
marketplace                                4.6.9     True        False         False      96m
monitoring                                           False       True          True       92m
network                                    4.6.9     True        False         False      96m
node-tuning                                4.6.9     True        False         False      97m
openshift-apiserver                        4.6.9     False       False         False      97m
openshift-controller-manager                         False       True          False      97m
openshift-samples                                                                         
operator-lifecycle-manager                 4.6.9     True        False         False      96m
operator-lifecycle-manager-catalog         4.6.9     True        False         False      96m
operator-lifecycle-manager-packageserver             False       True          False      96m
service-ca                                 4.6.9     True        False         False      97m
storage                                    4.6.9     True        False         False      96m

[root@preserve-jialiu-ansible ~]# oc get machine -n openshift-machine-api
NAME                                               PHASE         TYPE        REGION      ZONE         AGE
auto-jialiu-616184-tspvm-master-0                  Running       m5.xlarge   us-east-2   us-east-2a   104m
auto-jialiu-616184-tspvm-master-1                  Running       m5.xlarge   us-east-2   us-east-2b   104m
auto-jialiu-616184-tspvm-master-2                  Running       m5.xlarge   us-east-2   us-east-2a   104m
auto-jialiu-616184-tspvm-worker-us-east-2a-cl9zt   Provisioned   m5.large    us-east-2   us-east-2a   97m
auto-jialiu-616184-tspvm-worker-us-east-2a-t678x   Provisioned   m5.large    us-east-2   us-east-2a   97m
auto-jialiu-616184-tspvm-worker-us-east-2b-r8lrg   Provisioned   m5.large    us-east-2   us-east-2b   97m

worker is provisioned, but not running, because "Daemonset machine-config-server is not ready", that lead to worker can not get worker ignition file from https://api-int.auto-jialiu-616184.qe.devcluster.openshift.com:22623/config/worker to boot the system up.

[root@preserve-jialiu-ansible ~]# oc get po -n openshift-machine-config-operator
NAME                                         READY   STATUS    RESTARTS   AGE
machine-config-controller-5548b9c88f-tzjvd   1/1     Running   5          98m
machine-config-daemon-f9h6f                  2/2     Running   0          99m
machine-config-daemon-q556r                  2/2     Running   0          99m
machine-config-daemon-ww5mk                  2/2     Running   0          99m
machine-config-operator-7677b5fd8b-mp5g2     1/1     Running   5          106m


Seem like this cluster totally broken when apiserver and etcd operator get into degraded state.

[root@preserve-jialiu-ansible ~]# oc describe co machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-01-19T13:06:55Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
      f:spec:
      f:status:
        .:
        f:versions:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-01-19T13:06:55Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:extension:
          .:
          f:master:
          f:worker:
        f:relatedObjects:
    Manager:         machine-config-operator
    Operation:       Update
    Time:            2021-01-19T14:42:15Z
  Resource Version:  41218
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:               5b940e9e-1e52-4576-9b59-efcd11a3d510
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-01-19T13:13:57Z
    Message:               Working towards 4.6.9
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-01-19T13:25:23Z
    Message:               Unable to apply 4.6.9: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-server is not ready. status: (desired: 0, updated: 0, ready: 0, unavailable: 0)
    Reason:                MachineConfigServerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-01-19T13:25:23Z
    Message:               Cluster not available for 4.6.9
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-01-19T13:25:23Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
    Master:  all 3 nodes are at latest configuration rendered-master-372fb00019365868033c7aeb39ad30de
    Worker:  all 0 nodes are at latest configuration rendered-worker-ebfa863b048487ff5638da9807018019
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  controllerconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  kubeletconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  containerruntimeconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigs
    Group:     
    Name:      
    Resource:  nodes
Events:        <none>
[root@preserve-jialiu-ansible ~]# oc describe co etcd
Name:         etcd
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-01-19T13:06:54Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
      f:spec:
      f:status:
        .:
        f:extension:
        f:relatedObjects:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-01-19T13:06:54Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:versions:
    Manager:         cluster-etcd-operator
    Operation:       Update
    Time:            2021-01-19T14:31:17Z
  Resource Version:  37043
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/etcd
  UID:               703578de-988f-499d-8e00-8dab22b1c5cb
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-01-19T13:17:04Z
    Message:               InstallerControllerDegraded: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
NodeInstallerDegraded: 1 nodes are failing on revision 2:
NodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending
StaticPodsDegraded: pods "etcd-ip-10-0-58-213.us-east-2.compute.internal" not found
StaticPodsDegraded: pods "etcd-ip-10-0-79-220.us-east-2.compute.internal" not found
    Reason:                InstallerController_Error::NodeInstaller_InstallerPodFailed::StaticPods_Error
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-01-19T13:14:57Z
    Message:               NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 4
    Reason:                NodeInstaller
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-01-19T13:14:16Z
    Message:               StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 4
    Reason:                StaticPods_ZeroNodesActive
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-01-19T13:14:18Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  etcds
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-etcd-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-etcd
    Resource:  namespaces
  Versions:
    Name:     raw-internal
    Version:  4.6.9
    Name:     operator
    Version:  4.6.9
    Name:     etcd
    Version:  4.6.9
Events:       <none>


[root@preserve-jialiu-ansible ~]# oc describe co kube-apiserver
Name:         kube-apiserver
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-01-19T13:06:54Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
      f:spec:
      f:status:
        .:
        f:extension:
        f:relatedObjects:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-01-19T13:06:54Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:versions:
    Manager:         cluster-kube-apiserver-operator
    Operation:       Update
    Time:            2021-01-19T14:51:19Z
  Resource Version:  44335
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/kube-apiserver
  UID:               6dde2415-cc7e-461c-b48b-1da465ab16f8
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-01-19T13:16:19Z
    Message:               StaticPodsDegraded: pods "kube-apiserver-ip-10-0-49-35.us-east-2.compute.internal" not found
StaticPodsDegraded: pods "kube-apiserver-ip-10-0-79-220.us-east-2.compute.internal" not found
StaticPodsDegraded: pods "kube-apiserver-ip-10-0-58-213.us-east-2.compute.internal" not found
InstallerControllerDegraded: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s
    Reason:                InstallerController_Error::StaticPods_Error
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-01-19T13:15:16Z
    Message:               NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2
    Reason:                NodeInstaller
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-01-19T13:14:21Z
    Message:               StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2
    Reason:                StaticPods_ZeroNodesActive
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-01-19T13:14:20Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:      operator.openshift.io
    Name:       cluster
    Resource:   kubeapiservers
    Group:      apiextensions.k8s.io
    Name:       
    Resource:   customresourcedefinitions
    Group:      security.openshift.io
    Name:       
    Resource:   securitycontextconstraints
    Group:      
    Name:       openshift-config
    Resource:   namespaces
    Group:      
    Name:       openshift-config-managed
    Resource:   namespaces
    Group:      
    Name:       openshift-kube-apiserver-operator
    Resource:   namespaces
    Group:      
    Name:       openshift-kube-apiserver
    Resource:   namespaces
    Group:      admissionregistration.k8s.io
    Name:       
    Resource:   mutatingwebhookconfigurations
    Group:      admissionregistration.k8s.io
    Name:       
    Resource:   validatingwebhookconfigurations
    Group:      controlplane.operator.openshift.io
    Name:       
    Namespace:  openshift-kube-apiserver
    Resource:   podnetworkconnectivitychecks
  Versions:
    Name:     raw-internal
    Version:  4.6.9
Events:       <none>


[root@preserve-jialiu-ansible ~]# oc get po -n openshift-etcd
NAME                                                   READY   STATUS      RESTARTS   AGE
etcd-ip-10-0-49-35.us-east-2.compute.internal          3/3     Running     1          104m
installer-2-ip-10-0-49-35.us-east-2.compute.internal   0/1     Completed   0          104m
[root@preserve-jialiu-ansible ~]# oc get all -n openshift-etcd
NAME                                                       READY   STATUS      RESTARTS   AGE
pod/etcd-ip-10-0-49-35.us-east-2.compute.internal          3/3     Running     1          104m
pod/installer-2-ip-10-0-49-35.us-east-2.compute.internal   0/1     Completed   0          104m

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/etcd          ClusterIP   172.30.110.11   <none>        2379/TCP,9979/TCP   113m
service/host-etcd-2   ClusterIP   None            <none>        2379/TCP            113m

Comment 23 Stefan Schimanski 2021-01-20 15:51:59 UTC

The original root cause was an unrestricted MutatingAdmissionWebhook, fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1903226. That bug is VERIFIED for 4.7, but not backported to 4.6. Moving this one to cloud team as a reminder to backport https://bugzilla.redhat.com/show_bug.cgi?id=1903226 to 4.6.

Comment 24 Nick Stielau 2021-01-20 18:37:03 UTC

Setting to blocker- give stts' comment that this is a backport reminder.

Comment 25 Johnny Liu 2021-01-21 02:29:29 UTC

> That bug is VERIFIED for 4.7, but not backported to 4.6

Yeah, agree, from my test result, this issue is mainly happening on 4.6.

Comment 28 Akhil Rane 2021-02-01 16:28:49 UTC


*** This bug has been marked as a duplicate of bug 1921901 ***

Note You need to log in before you can comment on or make changes to this bug.