Bug 1939537 - p&f: probes should not get 429s
Summary: p&f: probes should not get 429s
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.7.z
Assignee: Abu Kashem
QA Contact: Ke Wang
URL:
Whiteboard: LifecycleReset
Depends On: 1937916
Blocks: 1939541
TreeView+ depends on / blocked
 
Reported: 2021-03-16 15:18 UTC by Abu Kashem
Modified: 2021-12-01 13:35 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1937916
: 1939541 (view as bug list)
Environment:
Last Closed: 2021-12-01 13:35:22 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 658 0 None open Bug 1939537: UPSTREAM: 100678: apf: exempt probes /healthz /livez /readyz 2021-11-11 01:01:27 UTC
Red Hat Product Errata RHBA-2021:4802 0 None None None 2021-12-01 13:35:43 UTC

Description Abu Kashem 2021-03-16 15:18:13 UTC
+++ This bug was initially created as a clone of Bug #1937916 +++

Messages like

Get "https://10.0.175.171:17697/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

appear to be produced by kubelet even when the pod is able to communicate with itself. This causes outage of system services as they are not deemed healthy:

	Liveness probe failed: HTTP probe failed with statuscode: 429

More details in https://coreos.slack.com/archives/C01RLRP2F9N

--- Additional comment from David Eads on 2021-03-11 18:29:01 UTC ---

I've opened https://github.com/openshift/cluster-kube-apiserver-operator/pull/1060 as a possibility, but I'd like a review from Abu

Comment 1 Michal Fojtik 2021-04-15 15:48:32 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 Ke Wang 2021-11-11 10:55:37 UTC
Launched one 4.7 cluster with the PR of the bug by cluster-bot,

$ oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci.test-2021-11-11-093118-ci-ln-xrsxv7t-latest   True        False         48m     Cluster version is 4.7.0-0.ci.test-2021-11-11-093118-ci-ln-xrsxv7t-latest

$ oc get flowschema | grep probe
probes                              exempt                              2                    <none>                37m   False

$ oc edit  kubeapiserver/cluster # change the loglevel to TraceAll
kubeapiserver.operator.openshift.io/cluster edited

After the kube-apiservers fnished the restarting, make some readyz requests to the apiserver,

$ for i in {1..30}; do curl -k https://api.ci-ln-xrsxv7t-76ef8.origin-....com:6443/readyz;done;

$ kas_pods=$(oc get pods -n openshift-kube-apiserver | grep 'kube-apiserver' | awk '{print $1}'); for pod in $kas_pods; do oc -n openshift-kube-apiserver logs $pod -c kube-apiserver | grep 'exempt' | grep 'readyz' | head -1;done

I1111 10:50:22.561904      19 apf_controller.go:702] startRequest(RequestDigest{RequestInfo: &request.RequestInfo{IsResourceRequest:false, Path:"/readyz", Verb:"get", APIPrefix:"", APIGroup:"", APIVersion:"", Namespace:"", Resource:"", Subresource:"", Name:"", Parts:[]string(nil)}, User: &user.DefaultInfo{Name:"system:anonymous", UID:"", Groups:[]string{"system:unauthenticated"}, Extra:map[string][]string(nil)}}) => fsName="probes", distMethod=(*v1beta1.FlowDistinguisherMethod)(nil), plName="exempt", immediate
I1111 10:50:41.930591      18 apf_controller.go:702] startRequest(RequestDigest{RequestInfo: &request.RequestInfo{IsResourceRequest:false, Path:"/readyz", Verb:"get", APIPrefix:"", APIGroup:"", APIVersion:"", Namespace:"", Resource:"", Subresource:"", Name:"", Parts:[]string(nil)}, User: &user.DefaultInfo{Name:"system:anonymous", UID:"", Groups:[]string{"system:unauthenticated"}, Extra:map[string][]string(nil)}}) => fsName="probes", distMethod=(*v1beta1.FlowDistinguisherMethod)(nil), plName="exempt", immediate
I1111 10:46:32.752183      17 apf_controller.go:702] startRequest(RequestDigest{RequestInfo: &request.RequestInfo{IsResourceRequest:false, Path:"/readyz", Verb:"get", APIPrefix:"", APIGroup:"", APIVersion:"", Namespace:"", Resource:"", Subresource:"", Name:"", Parts:[]string(nil)}, User: &user.DefaultInfo{Name:"system:anonymous", UID:"", Groups:[]string{"system:unauthenticated"}, Extra:map[string][]string(nil)}}) => fsName="probes", distMethod=(*v1beta1.FlowDistinguisherMethod)(nil), plName="exempt", immediate

The new flowschema probes works as expected, so the bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically or, if not working, by me manually.

Comment 3 Michal Fojtik 2021-11-11 11:35:21 UTC
The LifecycleStale keyword was removed because the needinfo? flag was reset.
The bug assignee was notified.

Comment 4 Abu Kashem 2021-11-11 13:05:36 UTC
The PR is green, waiting for it to be cherry-pick approved

Comment 10 errata-xmlrpc 2021-12-01 13:35:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4802


Note You need to log in before you can comment on or make changes to this bug.