Description of problem: APIRequestCount names must be of the form "resource.version.group" as documented via "oc explain apirequestcount". However, nothing prevents them from being created without any dot in the name. If somebody creates an apirequestcount without dots in the name, the following code panics[1] because it tries to split the name by "." and then access both the first and second segment, so if there is no second segment, it panics. This code seems to be triggerable only at pod startup, i.e. it is also required that the kube-apiserver pod restarts for some reason (like a new revision rollout made by kube-apiserver-operator). [1] - https://github.com/openshift/kubernetes/blob/db16f7d/openshift-kube-apiserver/filters/deprecatedapirequest/apiaccess_count_controller.go#L214 Version-Release number of selected component (if applicable): 4.8.25 (but code doesn't seem to have changed in recent versions) How reproducible: Always Steps to Reproduce: 1. Manually create an APIRequestCount object without any dot in the name (example below) 2. Restart kube-apiserver container Actual results: - User is able to create the wrong APIRequestCount object - Kube APIServer panics after following the steps above Expected results: Either no panic or the creation of any wrong APIRequestCount being rejected at admission phase. Additional info: Example of wrong APIRequestCount object and panic will follow in comments
------------------------------Steps to reproduce in UNFIXED Build------------------------ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-07-053433 True False 4h25m Cluster version is 4.11.0-0.nightly-2022-04-07-053433 Apply the apirequestcount object with an unsupproted name. $ cat wrongapirequestcountyaml apiVersion: apiserver.openshift.io/v1 kind: APIRequestCount metadata: name: test-alert spec: numberOfUsersToReport: 10 groups: - name: test-alert-rules rules: # Alert for any instance that is unreachable for >5 minutes. - alert: InstanceDown expr: up == 0 for: 5m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." # Alert for any instance that has a median request latency >1s. - alert: APIHighRequestLatency expr: api_http_request_latencies_second{quantile="0.5"} > 1 for: 10m annotations: summary: "High request latency on {{ $labels.instance }}" description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)" $ oc create -f wrongapirequestcountyaml apirequestcount.apiserver.openshift.io/test-alert created $ oc get apirequestcount | grep "test-alert" test-alert Allow kue-apiserver to roll out to new versions $ oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge apiserver.config.openshift.io/cluster patched $ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard' kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal 5/5 Running 0 133m kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal 5/5 Running 0 137m kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal 4/5 CrashLoopBackOff 27 (4m8s ago) 120m $ oc logs -n openshift-kube-apiserver kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal | grep -i panic E0412 09:29:21.979868 16 runtime.go:78] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1) k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic({0x4f10b20, 0xc006437968}) panic({0x4f10b20, 0xc006437968}) /usr/lib/golang/src/runtime/panic.go:1038 +0x215 panic: runtime error: index out of range [1] with length 1 [recovered] panic: runtime error: index out of range [1] with length 1 panic({0x4f10b20, 0xc006437968}) /usr/lib/golang/src/runtime/panic.go:1038 +0x215 kube-apiserver was in crashloop with an panic error mentioned in customer side after new revision of kube-apiserver rolled out. ---------------------- Steps to reproduce in Fixed (latest 4.11) build--------------------------- $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-12-000004 True False 4h41m Cluster version is 4.11.0-0.nightly-2022-04-12-000004 $ oc create -f wrongapirequestcountyaml The APIRequestCount "test-alert" is invalid: metadata.name: Invalid value: "test-alert": apirequestcount test-alert: name must be of the form 'resource.version.group' $ oc get apirequestcount | grep "test-alert" $ "apirequestcount" object was unable to create as it violates the name form "resource.version.group" and hence the issue is not happening with in fixed/latest build. I have tried to create "apiresourcecount" object with valid name as below and worked as expected. $ cat apirequestcount.yaml apiVersion: apiserver.openshift.io/v1 kind: APIRequestCount metadata: name: test-alert.api.v2 spec: numberOfUsersToReport: 10 groups: - name: test-alert-rules rules: # Alert for any instance that is unreachable for >5 minutes. - alert: InstanceDown expr: up == 0 for: 5m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." # Alert for any instance that has a median request latency >1s. - alert: APIHighRequestLatency expr: api_http_request_latencies_second{quantile="0.5"} > 1 for: 10m annotations: summary: "High request latency on {{ $labels.instance }}" description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)" $ oc create -f apirequestcount.yaml apirequestcount.apiserver.openshift.io/test-alert.api.v2 created $ oc get apirequestcount | grep test-alert.api test-alert.api.v2 Allowed to roll out the kube-apiserver with new revision to see if that create issues. $oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge apiserver.config.openshift.io/cluster patched $ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard' kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal 5/5 Running 0 2m29s kube-apiserver-xxx-njk-4c7px-master-1.c.openshift-qe.internal 5/5 Running 0 8m11s kube-apiserver-xxx-njk-4c7px-master-2.c.openshift-qe.internal 5/5 Running 0 5m24s $ oc logs -n openshift-kube-apiserver kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal | grep -i panic $ Hence the issue has not seen with fixed (latest 4.11 build) version , moved ticket state to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069