Bug 2048801 - kube-apiserver remains in ready state even after losing connectivity with pod network [NEEDINFO]
Summary: kube-apiserver remains in ready state even after losing connectivity with pod...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: All
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Wally
QA Contact: Ke Wang
URL:
Whiteboard: EmergencyRequest
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-31 21:13 UTC by Joe Williams
Modified: 2022-03-28 17:14 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2063953 (view as bug list)
Environment:
Last Closed: 2022-02-01 10:42:20 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 1158 0 None open Always check connectivity to overlay network 2022-01-31 21:13:19 UTC

Description Joe Williams 2022-01-31 21:13:20 UTC
Description of problem:
When kube-apiserver is no longer able to connect to the overlay network where oAuth PODs reside, it still remains active but cannot properly handle requests. In that case, kube-apiserver should be marked as not ready by one of the readyz checks. Current checks only verify connectivity once after kube-apiserver is started. 

These changes modify the checks to constantly monitor connection to the overlay network.


Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cbe068078491c057d6045816392d0e6f04e10135b4d94a7d0139059847eb634d


How reproducible:
Can reproduce every time.


Steps to Reproduce:
1. On one of the master nodes, remove route to pod network


Actual results:
kube-apiserver still listed as ready


Expected results:
kube-apiserver listed as not ready


Additional info:

Comment 1 Michal Fojtik 2022-01-31 21:41:43 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 2 Stefan Schimanski 2022-02-01 10:42:20 UTC
I disagree with the assessment that it is not ready. It can perfectly answer requests that don't require oauth token authentication. And as such being unavailable or unready is wrong. It is clearly degraded.

Comment 3 Michal Skalski 2022-02-01 11:20:02 UTC
Hi Stefan, if the kube-apiserver is not able connect to overlay network where for example oauth servers are, we cannot do basic operation like job creating. In such case does kube-apiserver should receive this types of requests? What other options we have to prevent problematic kube-apiserver to be a target of such requests than removing it from endpoint list of kube-apiserver service?


Note You need to log in before you can comment on or make changes to this bug.