+++ This bug was initially created as a clone of Bug #1937916 +++ Messages like Get "https://10.0.175.171:17697/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) appear to be produced by kubelet even when the pod is able to communicate with itself. This causes outage of system services as they are not deemed healthy: Liveness probe failed: HTTP probe failed with statuscode: 429 More details in https://coreos.slack.com/archives/C01RLRP2F9N --- Additional comment from David Eads on 2021-03-11 18:29:01 UTC --- I've opened https://github.com/openshift/cluster-kube-apiserver-operator/pull/1060 as a possibility, but I'd like a review from Abu
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Launched one 4.7 cluster with the PR of the bug by cluster-bot, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.ci.test-2021-11-11-093118-ci-ln-xrsxv7t-latest True False 48m Cluster version is 4.7.0-0.ci.test-2021-11-11-093118-ci-ln-xrsxv7t-latest $ oc get flowschema | grep probe probes exempt 2 <none> 37m False $ oc edit kubeapiserver/cluster # change the loglevel to TraceAll kubeapiserver.operator.openshift.io/cluster edited After the kube-apiservers fnished the restarting, make some readyz requests to the apiserver, $ for i in {1..30}; do curl -k https://api.ci-ln-xrsxv7t-76ef8.origin-....com:6443/readyz;done; $ kas_pods=$(oc get pods -n openshift-kube-apiserver | grep 'kube-apiserver' | awk '{print $1}'); for pod in $kas_pods; do oc -n openshift-kube-apiserver logs $pod -c kube-apiserver | grep 'exempt' | grep 'readyz' | head -1;done I1111 10:50:22.561904 19 apf_controller.go:702] startRequest(RequestDigest{RequestInfo: &request.RequestInfo{IsResourceRequest:false, Path:"/readyz", Verb:"get", APIPrefix:"", APIGroup:"", APIVersion:"", Namespace:"", Resource:"", Subresource:"", Name:"", Parts:[]string(nil)}, User: &user.DefaultInfo{Name:"system:anonymous", UID:"", Groups:[]string{"system:unauthenticated"}, Extra:map[string][]string(nil)}}) => fsName="probes", distMethod=(*v1beta1.FlowDistinguisherMethod)(nil), plName="exempt", immediate I1111 10:50:41.930591 18 apf_controller.go:702] startRequest(RequestDigest{RequestInfo: &request.RequestInfo{IsResourceRequest:false, Path:"/readyz", Verb:"get", APIPrefix:"", APIGroup:"", APIVersion:"", Namespace:"", Resource:"", Subresource:"", Name:"", Parts:[]string(nil)}, User: &user.DefaultInfo{Name:"system:anonymous", UID:"", Groups:[]string{"system:unauthenticated"}, Extra:map[string][]string(nil)}}) => fsName="probes", distMethod=(*v1beta1.FlowDistinguisherMethod)(nil), plName="exempt", immediate I1111 10:46:32.752183 17 apf_controller.go:702] startRequest(RequestDigest{RequestInfo: &request.RequestInfo{IsResourceRequest:false, Path:"/readyz", Verb:"get", APIPrefix:"", APIGroup:"", APIVersion:"", Namespace:"", Resource:"", Subresource:"", Name:"", Parts:[]string(nil)}, User: &user.DefaultInfo{Name:"system:anonymous", UID:"", Groups:[]string{"system:unauthenticated"}, Extra:map[string][]string(nil)}}) => fsName="probes", distMethod=(*v1beta1.FlowDistinguisherMethod)(nil), plName="exempt", immediate The new flowschema probes works as expected, so the bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically or, if not working, by me manually.
The LifecycleStale keyword was removed because the needinfo? flag was reset. The bug assignee was notified.
The PR is green, waiting for it to be cherry-pick approved
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.38 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4802