1836376 – operator failed with IngressStateEndpointsDegraded: No endpoints found for oauth-server

Bug 1836376 - operator failed with IngressStateEndpointsDegraded: No endpoints found for oauth-server

Summary: operator failed with IngressStateEndpointsDegraded: No endpoints found for oa...

Keywords:
Status:	CLOSED DUPLICATE of bug 1801089
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1836082 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-15 18:35 UTC by sumehta
Modified:	2020-08-05 15:02 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	test: operator
Last Closed:	2020-06-05 17:58:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description sumehta 2020-05-15 18:35:16 UTC

test: operator consistently failing for release-openshift-ocp-installer-e2e-aws-4.5 with IngressStateEndpointsDegraded: No endpoints found for oauth-server

Error:
level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server"

Test links:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1975
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1976
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1977

Comment 1 Andrew McDermott 2020-05-19 08:09:28 UTC

Marking this as Target release: 4.5.

Comment 2 Aniket Bhat 2020-05-20 19:16:23 UTC

*** Bug 1836082 has been marked as a duplicate of this bug. ***

Comment 3 Miciah Dashiel Butler Masters 2020-05-20 19:31:17 UTC

In all three CI runs cited in comment 0, the ingress operator is failing due to missing security context constraints:

                        {
                            "lastTransitionTime": "2020-05-15T15:35:38Z",
                            "lastUpdateTime": "2020-05-15T15:35:38Z",
                            "message": "pods \"ingress-operator-6c95d9c778-\" is forbidden: unable to validate against any security context constraint: []",
                            "reason": "FailedCreate",
                            "status": "True",
                            "type": "ReplicaFailure"
                        },

The reason is probably that the API never came up; its clusteroperator reports that it is crashlooping:

                    "name": "kube-apiserver",
                    "resourceVersion": "18694",
                    "selfLink": "/apis/config.openshift.io/v1/clusteroperators/kube-apiserver",
                    "uid": "292a6842-f489-451c-8fcf-7f9afa425e4d"
                },
                "spec": {},
                "status": {
                    "conditions": [
                        {
                            "lastTransitionTime": "2020-05-15T15:43:36Z",
                            "message": "StaticPodsDegraded: pod/kube-apiserver-ip-10-0-136-119.ec2.internal container \"kube-apiserver\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-ip-10-0-136-119.ec2.internal_openshift-kube-apiserver(7213f58333167c460bb82863d7a06142)\nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-136-119.ec2.internal container \"kube-apiserver\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-ip-10-0-136-119.ec2.internal_openshift-kube-apiserver(7213f58333167c460bb82863d7a06142)\nStaticPodsDegraded: pods \"kube-apiserver-ip-10-0-144-85.ec2.internal\" not found\nStaticPodsDegraded: pods \"kube-apiserver-ip-10-0-133-79.ec2.internal\" not found",
                            "reason": "StaticPods_Error",
                            "status": "True",
                            "type": "Degraded"
                        },
                        {
                            "lastTransitionTime": "2020-05-15T15:41:34Z",
                            "message": "NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2",
                            "reason": "NodeInstaller",
                            "status": "True",
                            "type": "Progressing"
                        },
                        {
                            "lastTransitionTime": "2020-05-15T15:41:27Z",
                            "message": "StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2",
                            "reason": "StaticPods_ZeroNodesActive",
                            "status": "False",
                            "type": "Available"
                        },
The above is from the first referenced CI run, and the other two are close to identical:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1975/artifacts/e2e-aws/clusteroperators.json
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1976/artifacts/e2e-aws/clusteroperators.json
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1977/artifacts/e2e-aws/clusteroperators.json

I am re-assigning this Bugzilla report to the API team to investigate why the API is crashloooping.

Comment 5 Standa Laznicka 2020-05-26 11:25:12 UTC

There are no SCCs because there is no kube-apiserver, and that's because etcd is in crashloop backoff (see the pod status from the first test link):
```
  - containerID: cri-o://c1f3d334f8d342a93f12a9a7209da11e034847cde8036edc675abd79402aa051
    lastState:
      terminated:
        containerID: cri-o://c1f3d334f8d342a93f12a9a7209da11e034847cde8036edc675abd79402aa051
        exitCode: 1
        finishedAt: "2020-05-15T16:17:59Z"
        message: |-
          {"level":"warn","ts":"2020-05-15T16:17:44.914Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2c5850cc-d47b-4fa2-9cea-c4b3eed452ba/10.0.136.119:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.144.85:2379: connect: connection refused\""}
          Error: context deadline exceeded
          context deadline exceeded
        reason: Error
        startedAt: "2020-05-15T16:17:39Z"
    name: etcd 
    ready: false
    restartCount: 11
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-136-119.ec2.internal_openshift-etcd(643ee1a35adfb0ebadf5cac91404caf1)
        reason: CrashLoopBackOff

```

I assume this is most probably due to sdn/node/installer error, moving to node for now.

Comment 8 zhaozhanqi 2020-05-28 01:50:01 UTC

I saw in above cluster some pod cannot be running due to 'kubelet, qe-test7-7wdgj-master-2  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-monitoring-operator-558978f45b-f46qg_openshift-monitoring_a43f6336-2885-4039-a9bf-ddfdb4482075_0(b5572b9ee54b40e1f6204d1a5235a23622ebc0d1d9a6820c84a7655140a66a5a): Multus: [openshift-monitoring/cluster-monitoring-operator-558978f45b-f46qg]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/cluster-monitoring-operator-558978f45b-f46qg] failed to get pod annotation: timed out waiting for the condition`

seem same error with https://bugzilla.redhat.com/show_bug.cgi?id=1801089

Comment 9 Anurag saxena 2020-05-28 02:00:43 UTC

Yea, and its also accompanied with error stated in this bug as observed in install logs.

ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found
C

Comment 10 Phil Cameron 2020-05-28 22:57:45 UTC

zzhao zzhao I think it is a mistake to divert this to an ovn problem. You should start a new bug for the ovn problem and keep this as sdn. Just my opinion.

Comment 11 Anurag saxena 2020-05-29 01:18:37 UTC

(In reply to Phil Cameron from comment #10)
> zzhao zzhao I think it is a mistake to divert this to
> an ovn problem. You should start a new bug for the ovn problem and keep this
> as sdn. Just my opinion.

@Pcameron, sorry for confusion if any. This issue is independent of networkType so apparently not a networking issue. "CNI request failed with status 400" error was accompanied with the error mentioned in this bug title as clarified in comment #10. OVN pods didn't come up with CNI request failed is mostly due to the cluster hitting "No endpoints found for Oauth server" issue first, and making cluster unstable post that .

Comment 12 Ryan Phillips 2020-06-01 17:50:07 UTC

Cluster authentication operator has a crash:


```
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: I0515 16:19:09.981188    1450 status_manager.go:435] Ignoring same status for pod "authentication-operator-5fcd45b79d-bw855_openshift-authentication-operator(8545ce48-181a-4684-b26d-46bca7b7bd02)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-15 15:41:15 +0000 UTC Reason: Message:}┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: panic(0x1d2c6c0, 0xc0011e94b0)                                                                                                                                                                                                                                                                                                                                                       ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2                                                                                                                                                                                                                                                                                  ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/controller.(*controller).Run(0xc000329570, 0x1, 0xc0003349c0)                                                                                                                                                                                                                                                                                                 ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/controller/controller.go:62 +0x46a                                                                                                                                                                                                                                                ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/operator.(*operator).Run(0xc0005be5c0, 0xc0003349c0)                                                                                                                                                                                                                                                                                                          ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/operator/operator.go:33 +0x8f                                                                                                                                                                                                                                                     ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: created by github.com/openshift/cluster-authentication-operator/pkg/operator2.RunOperator                                                                                                                                                                                                                                                                                            ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/pkg/operator2/starter.go:261 +0x1804                                                                                                                                                                                                                                                                            ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: panic: (controller.die) (0x1d2c6c0,0xc0011e94b0) [recovered]                                                                                                                                                                                                                                                                                                                         ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         panic: AuthenticationOperator2: timed out waiting for caches to sync                                                                                                                                                                                                                                                                                                         ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: goroutine 405 [running]:                                                                                                                                                                                                                                                                                                                                                             ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/controller.crash(0x1d2c6c0, 0xc0011e94b0)                                                                                                                                                                                                                                                                                                                     ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/controller/die.go:8 +0x7a                                                                                                                                                                                                                                                         │
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc000cdaf30, 0x1, 0x1)                                                                                                                                                                                                                                                                                                             ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 +0xc7                                                                                                                                                                                                                                                 ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: panic(0x1d2c6c0, 0xc0011e94b0)                                                                                                                                                                                                                                                                                                                                                       ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2                                                                                                                                                                                                                                                                                  ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/controller.(*controller).Run(0xc000329570, 0x1, 0xc0003349c0)                                                                                                                                                                                                                                                                                                 ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/controller/controller.go:62 +0x46a                                                                                                                                                                                                                                                ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/operator.(*operator).Run(0xc0005be5c0, 0xc0003349c0)                                                                                                                                                                                                                                                                                                          ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/operator/operator.go:33 +0x8f                                                                                                                                                                                                                                                     ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: created by github.com/openshift/cluster-authentication-operator/pkg/operator2.RunOperator                                                                                                                                                                                                                                                                                            ┤
│May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]:         /go/src/github.com/openshift/cluster-authentication-operator/pkg/operator2/starter.go:261 +0x1804   
```

Comment 13 Standa Laznicka 2020-06-02 08:05:18 UTC

The crash is only a symptom of the cluster being broken, not the operator being broken: `panic: AuthenticationOperator2: timed out waiting for caches to sync`. This means the informer caches failed to sync within 10 minutes from when the operator was started, nothing the operator can do but to die.

Comment 14 W. Trevor King 2020-06-02 21:23:28 UTC

> This means the informer caches failed to sync within 10 minutes from when the operator was started, nothing the operator can do but to die.

Can it set a Degraded message that includes "failing to connect to the Kubernetes API" or some such instead of the current "No endpoints found for oauth-server"?  As it stands, it's not clear to me how someone reading the auth operator's Degraded message would know that the auth operator was blaming the kube API (or ingress or... not sure why you picked the Node component...).

Comment 15 Standa Laznicka 2020-06-03 10:51:52 UTC

> set a Degraded message that includes "failing to connect to the Kubernetes API"

is a contradiction.

If we are seeing this panic in the operator, the operator can't do anything, it never actually starts its sync-loop.

The explanation as to why I moved it to Node is in Comment 5.

Comment 16 Ryan Phillips 2020-06-05 17:58:59 UTC

Looks like a dupe of 1801089

*** This bug has been marked as a duplicate of bug 1801089 ***

Comment 17 Anurag saxena 2020-06-05 20:50:51 UTC

Ryan, I am still not sure whats the reason behind this error as things starting to go bad after this. In what situation this error is pertinent to be logged?

ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found
ConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found

But on the other bug 1801089, we don't encounter above missing endpoint issue at all but CNI 400 issue. So I am still vaguely convinced that these both issues are same. Thanks

Note You need to log in before you can comment on or make changes to this bug.