Hide Forgot
Messages like Get "https://10.0.175.171:17697/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) appear to be produced by kubelet even when the pod is able to communicate with itself. This causes outage of system services as they are not deemed healthy: Liveness probe failed: HTTP probe failed with statuscode: 429 More details in https://coreos.slack.com/archives/C01RLRP2F9N
I've opened https://github.com/openshift/cluster-kube-apiserver-operator/pull/1060 as a possibility, but I'd like a review from Abu
*** Bug 1939732 has been marked as a duplicate of this bug. ***
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-07-115443 True False 75m Cluster version is 4.8.0-0.nightly-2021-04-07-115443 Per the PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1060 change, one new flowschema named probes should be created, but actually not found. $ oc get flowschema NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD AGE MISSINGPL exempt exempt 1 <none> 105m False openshift-apiserver-sar exempt 2 ByUser 92m False openshift-oauth-apiserver-sar exempt 2 ByUser 75m False system-leader-election leader-election 100 ByUser 105m False workload-leader-election leader-election 200 ByUser 105m False openshift-sdn system 500 ByUser 98m False system-nodes system 500 ByUser 105m False kube-controller-manager workload-high 800 ByNamespace 105m False kube-scheduler workload-high 800 ByNamespace 105m False kube-system-service-accounts workload-high 900 ByNamespace 105m False openshift-apiserver workload-high 1000 ByUser 92m False openshift-controller-manager workload-high 1000 ByUser 104m False openshift-oauth-apiserver workload-high 1000 ByUser 75m False openshift-oauth-server workload-high 1000 ByUser 75m False openshift-apiserver-operator openshift-control-plane-operators 2000 ByUser 92m False openshift-authentication-operator openshift-control-plane-operators 2000 ByUser 75m False openshift-etcd-operator openshift-control-plane-operators 2000 ByUser 96m False openshift-kube-apiserver-operator openshift-control-plane-operators 2000 ByUser 95m False openshift-monitoring-metrics workload-high 2000 ByUser 95m False service-accounts workload-low 9000 ByUser 105m False global-default global-default 9900 ByUser 105m False catch-all catch-all 10000 ByUser 105m False Check the cvo pod, make sure if the bug's PR manifests is used in cvo file system, $ oc get pods -A | grep openshift-cluster-version openshift-cluster-version cluster-version-operator-6555549458-r5bdn 1/1 Running 0 107m $ oc exec -n openshift-cluster-version cluster-version-operator-6555549458-r5bdn -it -- cat /release-manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml | grep -A100 '# probes' # probes need to always work. If probes get 429s, then the kubelet will treat them as probe failures. # Since probes are cheap to run, we won't rate limit these at all. apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: FlowSchema metadata: name: probes spec: distinguisherMethod: type: ByUser matchingPrecedence: 2 priorityLevelConfiguration: name: exempt rules: - nonResourceRules: - nonResourceURLs: - '/healthz' - '/readyz' - '/livez' verbs: - 'get' subjects: - group: name: system:authenticated kind: Group - group: name: system:unauthenticated kind: Group The the bug's PR update part of manifest is in, but doesn't take effect, compared with others flowschema, missed the following annotations, annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" I copied the new probes flowschema to a yaml file and applied, the new probes flowschema was created. $ cat probes-fs.yaml apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: FlowSchema metadata: name: probes spec: distinguisherMethod: type: ByUser matchingPrecedence: 2 priorityLevelConfiguration: name: exempt rules: - nonResourceRules: - nonResourceURLs: - '/healthz' - '/readyz' - '/livez' verbs: - 'get' subjects: - group: name: system:authenticated kind: Group - group: name: system:unauthenticated kind: Group $ oc apply -f probes-fs.yaml flowschema.flowcontrol.apiserver.k8s.io/probes created $ oc get flowschema | grep probes probes exempt 2 ByUser 6m6s False Since the PR fix doesn't work as expected, so assign it back.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.20 True False 3h56m Cluster version is 4.8.20 $ oc get flowschema | grep probe probes exempt 2 <none> 3h58m False $ oc edit kubeapiserver/cluster # change the loglevel to TraceAll kubeapiserver.operator.openshift.io/cluster edited After the kube-apiservers fnished the restarting, make some readyz requests to the apiserver, $ for i in {1..30}; do curl -k https://api.kewang-1048g1.qe.gcp...com:6443/readyz;done;done $ kas_pods=$(oc get pods -n openshift-kube-apiserver | grep 'kube-apiserver' | awk '{print $1}'); for pod in $kas_pods; do oc -n openshift-kube-apiserver logs $pod -c kube-apiserver | grep 'exempt' | grep 'readyz' | head -1;done I1110 14:53:13.679381 20 apf_controller.go:792] startRequest(RequestDigest{RequestInfo: &request.RequestInfo{IsResourceRequest:false, Path:"/readyz", Verb:"get", APIPrefix:"", APIGroup:"", APIVersion:"", Namespace:"", Resource:"", Subresource:"", Name:"", Parts:[]string(nil)}, User: &user.DefaultInfo{Name:"system:anonymous", UID:"", Groups:[]string{"system:unauthenticated"}, Extra:map[string][]string(nil)}}) => fsName="probes", distMethod=(*v1beta1.FlowDistinguisherMethod)(nil), plName="exempt", immediate The new flowschema probes works as expected.
Per Comment 5, move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.21 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4716