Description of problem: Additional context: https://issues.redhat.com/browse/OHSS-334 Customer observed on OSD 4.3.18 an inability to access web console and states "not one of the OpenShift URLs can be accessed". Review of state of kube-apiserver shows at the time of the outage certificates were renewed and kube-apiserver pods in openshift-kube-apiserver were rollbounced. Note the behavior of kube-apiserver restarts are reproduced with any revision change. Version-Release number of selected component (if applicable): 4.3.18 How reproducible: Presume every time certificates are renewed. Steps to Reproduce: 1. install ocp 4.3.18 2. update something to force kube-apiserver revision 3. review access to console etc Actual results: Unable to access web console and "OpenShift URLs" while kube-apiserver restarts are in progress Expected results: No outage of web console for certificate renewals. Additional info: I'll add a link to must-gather once it's ready.
This bug report lacks essential information: - must-gather output (promised but not available yet) - which platform? Without this, the bug is not actionable and not helpful.
Closing this after a week if no info arrives.
Customer has reported that the issue has evolved. Customer is reporting multiple instances daily where the web console is not accessible. It is beginning to impact their business and is seeding doubt in the reliability of the product. What is needed to move this issue further? The Customer has escalated the SFDC ticket stating the above.
So far I was unable to determine anything that would be causing these issues. Summary: - the kube-apiserver and openshift-apiserver operators appear to be acting normal - kube-apiservers seem to have many "http: TLS handshake error from 10.70.1.154:15170: EOF" errors in their logs but none of these come from the openshift-apiserver - DNS operator has reported Degraded twice in previous month and a half so it's probably not causing the observed error either - the actions in audit log from `ip-<redacted>-136.us-west-1.compute.internal-audit-2020-07-01T12-14-29.443.log` contains 65,42% (60598/92630) actions performed by the cluster-logging-operator I haven't checked the ingress and SDN logs, and since this issue appears on routes and the apiservers appear to be communicating correctly, I'm moving this to routing.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Target reset to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.
Not seeing hints that the API server is the root cause. All recent comments read like it is ingress issue. Reassigning. Note: - "Depends of Browser, in chrome this is the error, in Explorer only says that the server is not active." – the browser never talks to the API server - "server: authentication failed: http: named cookie not present" – this is console backend, not apiserver auth.
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
We're looking at https://issues.redhat.com/browse/NE-172 / https://github.com/openshift/cluster-ingress-operator/pull/472 as a solution to the issue with cloud-ingress-operator. We'll continue tracking the issue in the upcoming sprint.
https://github.com/openshift/cluster-ingress-operator/pull/472 has been merged and verified in 4.7 (tracked as bug 1891625). https://github.com/openshift/cluster-ingress-operator/pull/482 backports this change to 4.6 (tracked as bug 1891626) and is awaiting cherry-pick approval. These changes to cluster-ingress-operator add the capability to change the scope of an IngressController's load balancer without deleting and recreating that IngressController. Following up on the changes to cluster-ingress-operator, https://github.com/openshift/cloud-ingress-operator/pull/118 changes cloud-ingress-operator not to delete and recreate the IngressController. As I understand it, https://github.com/openshift/cloud-ingress-operator/pull/118 is blocked on getting https://github.com/openshift/cluster-ingress-operator/pull/482 merged, verified, and deployed. Furthermore, https://github.com/openshift/cloud-ingress-operator/pull/118 should ultimately resolve the issue in this Bugzilla report. @drow, can you confirm that my understanding is correct?
https://github.com/openshift/cluster-ingress-operator/pull/482 was merged but had to be reverted because it introduced new issue. We have filed a new Jira issue for re-adding the capability: https://issues.redhat.com/browse/NE-623
<https://github.com/openshift/cluster-ingress-operator/pull/582> has merged, re-adding the needed cluster-ingress-operator functionality as mentioned in comment 32. Once OpenShift 4.10 ships, <https://github.com/openshift/cloud-ingress-operator/pull/118> can be re-opened (or an equivalent PR opened) to complete the work required to close this BZ.
This issue is stale and closed because it has no activity for a significant amount of time and is reported on a version no longer in maintenance. If this issue should not be closed please verify the condition still exists on a supported release and submit an updated bug.
Dustin Row informs me that cloud-ingress-operator was updated with <https://github.com/openshift/cloud-ingress-operator/pull/241> as part of <https://issues.redhat.com/browse/OSD-9580> to take advantage of the functionality that was re-introduced in <https://github.com/openshift/cluster-ingress-operator/pull/582>, and so this issue is resolved.