test: operator consistently failing for release-openshift-ocp-installer-e2e-aws-4.5 with IngressStateEndpointsDegraded: No endpoints found for oauth-server Error: level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No endpoints found for oauth-server" Test links: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1975 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1976 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1977
Marking this as Target release: 4.5.
*** Bug 1836082 has been marked as a duplicate of this bug. ***
In all three CI runs cited in comment 0, the ingress operator is failing due to missing security context constraints: { "lastTransitionTime": "2020-05-15T15:35:38Z", "lastUpdateTime": "2020-05-15T15:35:38Z", "message": "pods \"ingress-operator-6c95d9c778-\" is forbidden: unable to validate against any security context constraint: []", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }, The reason is probably that the API never came up; its clusteroperator reports that it is crashlooping: "name": "kube-apiserver", "resourceVersion": "18694", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/kube-apiserver", "uid": "292a6842-f489-451c-8fcf-7f9afa425e4d" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2020-05-15T15:43:36Z", "message": "StaticPodsDegraded: pod/kube-apiserver-ip-10-0-136-119.ec2.internal container \"kube-apiserver\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-ip-10-0-136-119.ec2.internal_openshift-kube-apiserver(7213f58333167c460bb82863d7a06142)\nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-136-119.ec2.internal container \"kube-apiserver\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-ip-10-0-136-119.ec2.internal_openshift-kube-apiserver(7213f58333167c460bb82863d7a06142)\nStaticPodsDegraded: pods \"kube-apiserver-ip-10-0-144-85.ec2.internal\" not found\nStaticPodsDegraded: pods \"kube-apiserver-ip-10-0-133-79.ec2.internal\" not found", "reason": "StaticPods_Error", "status": "True", "type": "Degraded" }, { "lastTransitionTime": "2020-05-15T15:41:34Z", "message": "NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2", "reason": "NodeInstaller", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2020-05-15T15:41:27Z", "message": "StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2", "reason": "StaticPods_ZeroNodesActive", "status": "False", "type": "Available" }, The above is from the first referenced CI run, and the other two are close to identical: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1975/artifacts/e2e-aws/clusteroperators.json https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1976/artifacts/e2e-aws/clusteroperators.json https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1977/artifacts/e2e-aws/clusteroperators.json I am re-assigning this Bugzilla report to the API team to investigate why the API is crashloooping.
There are no SCCs because there is no kube-apiserver, and that's because etcd is in crashloop backoff (see the pod status from the first test link): ``` - containerID: cri-o://c1f3d334f8d342a93f12a9a7209da11e034847cde8036edc675abd79402aa051 lastState: terminated: containerID: cri-o://c1f3d334f8d342a93f12a9a7209da11e034847cde8036edc675abd79402aa051 exitCode: 1 finishedAt: "2020-05-15T16:17:59Z" message: |- {"level":"warn","ts":"2020-05-15T16:17:44.914Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2c5850cc-d47b-4fa2-9cea-c4b3eed452ba/10.0.136.119:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.144.85:2379: connect: connection refused\""} Error: context deadline exceeded context deadline exceeded reason: Error startedAt: "2020-05-15T16:17:39Z" name: etcd ready: false restartCount: 11 started: false state: waiting: message: back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-136-119.ec2.internal_openshift-etcd(643ee1a35adfb0ebadf5cac91404caf1) reason: CrashLoopBackOff ``` I assume this is most probably due to sdn/node/installer error, moving to node for now.
I saw in above cluster some pod cannot be running due to 'kubelet, qe-test7-7wdgj-master-2 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-monitoring-operator-558978f45b-f46qg_openshift-monitoring_a43f6336-2885-4039-a9bf-ddfdb4482075_0(b5572b9ee54b40e1f6204d1a5235a23622ebc0d1d9a6820c84a7655140a66a5a): Multus: [openshift-monitoring/cluster-monitoring-operator-558978f45b-f46qg]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-monitoring/cluster-monitoring-operator-558978f45b-f46qg] failed to get pod annotation: timed out waiting for the condition` seem same error with https://bugzilla.redhat.com/show_bug.cgi?id=1801089
Yea, and its also accompanied with error stated in this bug as observed in install logs. ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found C
zzhao zzhao I think it is a mistake to divert this to an ovn problem. You should start a new bug for the ovn problem and keep this as sdn. Just my opinion.
(In reply to Phil Cameron from comment #10) > zzhao zzhao I think it is a mistake to divert this to > an ovn problem. You should start a new bug for the ovn problem and keep this > as sdn. Just my opinion. @Pcameron, sorry for confusion if any. This issue is independent of networkType so apparently not a networking issue. "CNI request failed with status 400" error was accompanied with the error mentioned in this bug title as clarified in comment #10. OVN pods didn't come up with CNI request failed is mostly due to the cluster hitting "No endpoints found for Oauth server" issue first, and making cluster unstable post that .
Cluster authentication operator has a crash: ``` │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: I0515 16:19:09.981188 1450 status_manager.go:435] Ignoring same status for pod "authentication-operator-5fcd45b79d-bw855_openshift-authentication-operator(8545ce48-181a-4684-b26d-46bca7b7bd02)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-15 15:41:15 +0000 UTC Reason: Message:}┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: panic(0x1d2c6c0, 0xc0011e94b0) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/controller.(*controller).Run(0xc000329570, 0x1, 0xc0003349c0) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/controller/controller.go:62 +0x46a ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/operator.(*operator).Run(0xc0005be5c0, 0xc0003349c0) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/operator/operator.go:33 +0x8f ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: created by github.com/openshift/cluster-authentication-operator/pkg/operator2.RunOperator ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/pkg/operator2/starter.go:261 +0x1804 ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: panic: (controller.die) (0x1d2c6c0,0xc0011e94b0) [recovered] ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: panic: AuthenticationOperator2: timed out waiting for caches to sync ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: goroutine 405 [running]: ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/controller.crash(0x1d2c6c0, 0xc0011e94b0) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/controller/die.go:8 +0x7a │ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc000cdaf30, 0x1, 0x1) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 +0xc7 ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: panic(0x1d2c6c0, 0xc0011e94b0) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/controller.(*controller).Run(0xc000329570, 0x1, 0xc0003349c0) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/controller/controller.go:62 +0x46a ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: monis.app/go/openshift/operator.(*operator).Run(0xc0005be5c0, 0xc0003349c0) ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/vendor/monis.app/go/openshift/operator/operator.go:33 +0x8f ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: created by github.com/openshift/cluster-authentication-operator/pkg/operator2.RunOperator ┤ │May 15 16:19:09.981474 ip-10-0-144-85 hyperkube[1450]: /go/src/github.com/openshift/cluster-authentication-operator/pkg/operator2/starter.go:261 +0x1804 ```
The crash is only a symptom of the cluster being broken, not the operator being broken: `panic: AuthenticationOperator2: timed out waiting for caches to sync`. This means the informer caches failed to sync within 10 minutes from when the operator was started, nothing the operator can do but to die.
> This means the informer caches failed to sync within 10 minutes from when the operator was started, nothing the operator can do but to die. Can it set a Degraded message that includes "failing to connect to the Kubernetes API" or some such instead of the current "No endpoints found for oauth-server"? As it stands, it's not clear to me how someone reading the auth operator's Degraded message would know that the auth operator was blaming the kube API (or ingress or... not sure why you picked the Node component...).
> set a Degraded message that includes "failing to connect to the Kubernetes API" is a contradiction. If we are seeing this panic in the operator, the operator can't do anything, it never actually starts its sync-loop. The explanation as to why I moved it to Node is in Comment 5.
Looks like a dupe of 1801089 *** This bug has been marked as a duplicate of bug 1801089 ***
Ryan, I am still not sure whats the reason behind this error as things starting to go bad after this. In what situation this error is pertinent to be logged? ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found ConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found But on the other bug 1801089, we don't encounter above missing endpoint issue at all but CNI 400 issue. So I am still vaguely convinced that these both issues are same. Thanks