Bug 1847693 - OSD ingress unstable – was: kube-apiserver restart on cert renewal impacts access to all OCP URLs
Summary: OSD ingress unstable – was: kube-apiserver restart on cert renewal impacts ac...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-16 19:51 UTC by Naveen Malik
Modified: 2024-10-01 16:39 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1879140 (view as bug list)
Environment:
Last Closed: 2022-11-04 15:01:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-ingress-operator pull 241 0 None Merged Remove deletion of IngressController to change apps domain scope in 4.10+ 2022-11-04 19:57:10 UTC
Github openshift cluster-ingress-operator pull 472 0 None closed Bug 1891625: Support changing ingresscontroller load balancer scope 2021-02-15 04:15:40 UTC
Github openshift cluster-ingress-operator pull 582 0 None Merged NE-621: Support changing ingresscontroller load balancer scope 2022-11-04 19:57:10 UTC

Description Naveen Malik 2020-06-16 19:51:26 UTC
Description of problem:
Additional context: https://issues.redhat.com/browse/OHSS-334

Customer observed on OSD 4.3.18 an inability to access web console and states "not one of the OpenShift URLs can be accessed".

Review of state of kube-apiserver shows at the time of the outage certificates were renewed and kube-apiserver pods in openshift-kube-apiserver were rollbounced.

Note the behavior of kube-apiserver restarts are reproduced with any revision change.

Version-Release number of selected component (if applicable):
4.3.18

How reproducible:
Presume every time certificates are renewed.

Steps to Reproduce:
1. install ocp 4.3.18
2. update something to force kube-apiserver revision
3. review access to console etc

Actual results:
Unable to access web console and "OpenShift URLs" while kube-apiserver restarts are in progress

Expected results:
No outage of web console for certificate renewals.


Additional info:
I'll add a link to must-gather once it's ready.

Comment 1 Stefan Schimanski 2020-06-17 07:48:05 UTC
This bug report lacks essential information:

- must-gather output (promised but not available yet)
- which platform?

Without this, the bug is not actionable and not helpful.

Comment 3 Stefan Schimanski 2020-06-18 11:42:27 UTC
Closing this after a week if no info arrives.

Comment 9 Greg Rodriguez II 2020-07-01 18:45:17 UTC
Customer has reported that the issue has evolved.  Customer is reporting multiple instances daily where the web console is not accessible.  It is beginning to impact their business and is seeding doubt in the reliability of the product.

What is needed to move this issue further?  The Customer has escalated the SFDC ticket stating the above.

Comment 11 Standa Laznicka 2020-07-07 13:55:41 UTC
So far I was unable to determine anything that would be causing these issues. Summary:
- the kube-apiserver and openshift-apiserver operators appear to be acting normal
- kube-apiservers seem to have many "http: TLS handshake error from 10.70.1.154:15170: EOF" errors in their logs but none of these come from the openshift-apiserver
- DNS operator has reported Degraded twice in previous month and a half so it's probably not causing the observed error either
- the actions in audit log from `ip-<redacted>-136.us-west-1.compute.internal-audit-2020-07-01T12-14-29.443.log` contains 65,42% (60598/92630) actions performed by the cluster-logging-operator

I haven't checked the ingress and SDN logs, and since this issue appears on routes and the apiservers appear to be communicating correctly, I'm moving this to routing.

Comment 12 Andrew McDermott 2020-07-09 12:11:23 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 13 Andrew McDermott 2020-07-30 10:06:33 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 mfisher 2020-08-18 19:55:57 UTC
Target reset to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 16 Andrew McDermott 2020-09-10 11:49:43 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 23 Stefan Schimanski 2020-09-15 11:43:09 UTC
Not seeing hints that the API server is the root cause. All recent comments read like it is ingress issue. Reassigning.

Note:
- "Depends of Browser, in chrome this is the error, in Explorer only says that the server is not active." – the browser never talks to the API server
- "server: authentication failed: http: named cookie not present" – this is console backend, not apiserver auth.

Comment 27 Andrew McDermott 2020-10-02 17:09:01 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 28 Miciah Dashiel Butler Masters 2020-10-26 05:29:53 UTC
We're looking at https://issues.redhat.com/browse/NE-172 / https://github.com/openshift/cluster-ingress-operator/pull/472 as a solution to the issue with cloud-ingress-operator.  We'll continue tracking the issue in the upcoming sprint.

Comment 30 Miciah Dashiel Butler Masters 2020-11-14 00:40:47 UTC
https://github.com/openshift/cluster-ingress-operator/pull/472 has been merged and verified in 4.7 (tracked as bug 1891625).  https://github.com/openshift/cluster-ingress-operator/pull/482 backports this change to 4.6 (tracked as bug 1891626) and is awaiting cherry-pick approval.  These changes to cluster-ingress-operator add the capability to change the scope of an IngressController's load balancer without deleting and recreating that IngressController.  

Following up on the changes to cluster-ingress-operator, https://github.com/openshift/cloud-ingress-operator/pull/118 changes cloud-ingress-operator not to delete and recreate the IngressController.  

As I understand it, https://github.com/openshift/cloud-ingress-operator/pull/118 is blocked on getting https://github.com/openshift/cluster-ingress-operator/pull/482 merged, verified, and deployed.  Furthermore, https://github.com/openshift/cloud-ingress-operator/pull/118 should ultimately resolve the issue in this Bugzilla report.  @drow, can you confirm that my understanding is correct?

Comment 32 Miciah Dashiel Butler Masters 2021-06-11 22:39:59 UTC
https://github.com/openshift/cluster-ingress-operator/pull/482 was merged but had to be reverted because it introduced new issue.  We have filed a new Jira issue for re-adding the capability: https://issues.redhat.com/browse/NE-623

Comment 34 Miciah Dashiel Butler Masters 2022-01-10 04:12:01 UTC
<https://github.com/openshift/cluster-ingress-operator/pull/582> has merged, re-adding the needed cluster-ingress-operator functionality as mentioned in comment 32.  Once OpenShift 4.10 ships, <https://github.com/openshift/cloud-ingress-operator/pull/118> can be re-opened (or an equivalent PR opened) to complete the work required to close this BZ.

Comment 35 mfisher 2022-11-04 15:01:56 UTC
This issue is stale and closed because it has no activity for a significant amount of time and is reported on a version no longer in maintenance.  If this issue should not be closed please verify the condition still exists on a supported release and submit an updated bug.

Comment 36 Miciah Dashiel Butler Masters 2022-11-04 19:57:10 UTC
Dustin Row informs me that cloud-ingress-operator was updated with <https://github.com/openshift/cloud-ingress-operator/pull/241> as part of <https://issues.redhat.com/browse/OSD-9580> to take advantage of the functionality that was re-introduced in <https://github.com/openshift/cluster-ingress-operator/pull/582>, and so this issue is resolved.


Note You need to log in before you can comment on or make changes to this bug.