If you are on a 4.5 cluster and seeing KubeAPIErrorsHigh followed by "Timeout: Too large resource version" error reporting by a kubelet then you have hit https://github.com/kubernetes/kubernetes/issues/91073 Note that this BZ was created from https://bugzilla.redhat.com/show_bug.cgi?id=1748434 and it is meant to keep track of issues that are specific to a 4.5 cluster.
Hi Lukasz, My customer, after upgrading to 4.5 is currently experiencing the alerts below. Similar but not the same of the description but the same of others cases linked. - Should I consider this BZ too to track, or open a new one? ~~~ API server is returning errors for 100% of requests for LIST networkpolicies. The API server is burning too much error budget API server is returning errors for 100% of requests for LIST csidrivers . ~~~
(In reply to hgomes from comment #22) > Hi Lukasz, > > My customer, after upgrading to 4.5 is currently experiencing the alerts > below. Similar but not the same of the description but the same of others > cases linked. > - Should I consider this BZ too to track, or open a new one? > > ~~~ > API server is returning errors for 100% of requests for LIST networkpolicies. > The API server is burning too much error budget > API server is returning errors for 100% of requests for LIST csidrivers . > ~~~ Please check the audit logs. If they contain HTTP 504 for LIST networkpolicies and csidrivers then the customer run into the same issue. You should also be able to tell which pods caused those errors, in that case make sure the logs contain "Timeout: Too large resource version"
Verified with OCP 4.5.0-0.nightly-2020-10-08-211214, according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. $ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal sh-4.4# cat > test.sh <<EOF ifconfig ens3 down sleep 300 ifconfig ens3 up EOF sh-4.4# bash ./test.sh & After connection recovered, reconnected to node, check if the such error messages in bug can be found, kubelet logs, $ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal sh-4.4# journalctl -b -u kubelet | grep -i 'Too large resource version' No longer see the such error messages in kubelet logs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4228
I'm reopening this BZ as not all operators have been fixed. https://bugzilla.redhat.com/show_bug.cgi?id=1879901 is a top-level issue that captures all operators that should be fixed.
Moving this BZ to QE since all dependant BZs in https://bugzilla.redhat.com/show_bug.cgi?id=1879901 have been verified.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2021-02-05-192721 True False 5h47m Cluster version is 4.5.0-0.nightly-2021-02-05-192721 To verify, we need to do following according to the PR https://github.com/kubernetes/kubernetes/issues/91073, disconnected a node from network for a few minutes. $ oc debug node/ip-xx-0-xxx-242.us-east-2.compute.internal sh-4.4# cat > test.sh <<EOF ifconfig ens3 down sleep 300 ifconfig ens3 up EOF sh-4.4# bash ./test.sh & After connection recovered, reconnected to node, check if the such error messages in bug can be found, kubelet logs, $ oc debug node/ip-xx-0-xxx-41.us-east-2.compute.internal sh-4.4# journalctl -b -u kubelet | grep -i 'Too large resource version' No longer see the such error messages in kubelet logs. Check all pods log files: sh-4.4# cd /var/log sh-4.4# ls audit btmp chrony containers crio glusterfs journal kube-apiserver lastlog openshift-apiserver openvswitch pods private samba sssd wtmp sh-4.4# grep -nr 'Too large resource version' | grep -v debug No longer see the such error messages in bug, based on above test results, move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.5.33 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0428