Bug 1905991 - [release-4.5] Detecting broken connections to the Kube API takes up to 15 minutes [NEEDINFO]
Summary: [release-4.5] Detecting broken connections to the Kube API takes up to 15 min...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.5.z
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard: LifecycleReset
: 1881878 (view as bug list)
Depends On: 1905195
Blocks: 1723620 1881878
TreeView+ depends on / blocked
 
Reported: 2020-12-09 13:26 UTC by Lukasz Szaszkiewicz
Modified: 2024-06-13 23:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Ungraceful network termination could lead to a situation in which communication between components could be cut off for up to 15 minutes. The issue has been fixed by setting TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) socket option which controls for how long transmitted data may be unacknowledged before the connection is forcefully closed. After applying the fix new connections will be crated after 30~ instead of 15 minutes.
Clone Of: 1905195
Environment:
Last Closed: 2021-03-03 04:40:30 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift oauth-server pull 66 0 None closed Bug 1905991: Detecting broken connections to the Kube API takes up to 15 minutes 2021-02-17 08:37:21 UTC
Github openshift openshift-apiserver pull 164 0 None closed Bug 1905991: Detecting broken connections to the Kube API takes up to 15 minutes 2021-02-17 08:37:21 UTC
Red Hat Product Errata RHSA-2021:0428 0 None None None 2021-03-03 04:40:52 UTC

Description Lukasz Szaszkiewicz 2020-12-09 13:26:33 UTC
+++ This bug was initially created as a clone of Bug #1905195 +++

+++ This bug was initially created as a clone of Bug #1905194 +++

All API servers (openshift-apiserver, oauth-server) rely on the TCP stack to detect broken network connections to KAS.
This can take up to 15 minutes. During that time our platform might be unavailable. 


There are already reported cases in which aggregated APIs (i.e. `openshift-apiserver`) were unable to establish a new connection to the Kube API for 15 minutes:

 - after "ungraceful termination" https://bugzilla.redhat.com/show_bug.cgi?id=1881878
 - after a network error https://bugzilla.redhat.com/show_bug.cgi?id=1879232#c39)


Detecting a broken connection should be quicker ideally it should take seconds not minutes.

Comment 1 Michal Fojtik 2021-01-08 14:17:25 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 Lukasz Szaszkiewicz 2021-01-11 09:42:22 UTC
PRs are in the merge queue waiting for the QE team to verify.

Comment 3 Michal Fojtik 2021-01-11 10:38:51 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 4 Xingxing Xia 2021-01-13 12:17:04 UTC
Ke Wang, when PRs/bug statuses are ready, help verify (or pre-merge verify) openshift-apiserver, oauth-openshift, and oauth-apiserver, see my steps in the 4.7 clone bug 1905194#c14 ~ c16 (you can ignore other preceding exploring comments there). It has experiences of crafting broken connection for our components. May be re-used in future other bugs.

Comment 5 Lukasz Szaszkiewicz 2021-01-15 10:08:45 UTC
PRs are ready and have been already tagged. PRs will be merged after https://bugzilla.redhat.com/show_bug.cgi?id=1905195 is verified.

Comment 6 Lukasz Szaszkiewicz 2021-02-05 13:28:30 UTC
There is only one pending PR https://github.com/openshift/openshift-apiserver/pull/164

Comment 8 Lukasz Szaszkiewicz 2021-02-12 09:32:30 UTC
*** Bug 1881878 has been marked as a duplicate of this bug. ***

Comment 9 Lukasz Szaszkiewicz 2021-02-12 09:35:09 UTC
All pending PRs have merged. Please verify openshift-apiserver and oauth-server.
oauth-apiserver was added in 4.6

Comment 10 weiguo fan 2021-02-12 13:03:14 UTC
(In reply to Lukasz Szaszkiewicz from comment #9)

Hi, Lukasz,

> All pending PRs have merged. Please verify openshift-apiserver and
> oauth-server.
> oauth-apiserver was added in 4.6

cloud you let us know which version of 4.6 fixed this problem?
We still see the problem on OCP4.6.12.

Comment 11 Lukasz Szaszkiewicz 2021-02-12 13:24:51 UTC
(In reply to weiguo fan from comment #10)
> (In reply to Lukasz Szaszkiewicz from comment #9)
> 
> Hi, Lukasz,
> 
> > All pending PRs have merged. Please verify openshift-apiserver and
> > oauth-server.
> > oauth-apiserver was added in 4.6
> 
> cloud you let us know which version of 4.6 fixed this problem?
> We still see the problem on OCP4.6.12.

Hi, sure, it was https://bugzilla.redhat.com/show_bug.cgi?id=1905195

Comment 14 Ke Wang 2021-02-18 14:56:43 UTC
In summary, previous 2 comments verify openshift-apiserver, oauth-openshift, they can detect broken connections to the kube apiserver immediately, so move the bug VERIFIED.

Comment 17 errata-xmlrpc 2021-03-03 04:40:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.5.33 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0428


Note You need to log in before you can comment on or make changes to this bug.