Bug 2051985

Summary: An APIRequestCount without dots in the name can cause a panic
Product: OpenShift Container Platform Reporter: Pablo Alonso Rodriguez <palonsor>
Component: kube-apiserverAssignee: Luis Sanchez <sanchezl>
Status: CLOSED ERRATA QA Contact: jmekkatt
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: akashem, andbartl, aos-bugs, jmekkatt, mfojtik, rsandu, sanchezl, snetting, xxia
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:47:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2074094    

Description Pablo Alonso Rodriguez 2022-02-08 13:34:47 UTC
Description of problem:

APIRequestCount names must be of the form "resource.version.group" as documented via "oc explain apirequestcount". However, nothing prevents them from being created without any dot in the name.

If somebody creates an apirequestcount without dots in the name, the following code panics[1] because it tries to split the name by "." and then access both the first and second segment, so if there is no second segment, it panics.

This code seems to be triggerable only at pod startup, i.e. it is also required that the kube-apiserver pod restarts for some reason (like a new revision rollout made by kube-apiserver-operator).

[1] - https://github.com/openshift/kubernetes/blob/db16f7d/openshift-kube-apiserver/filters/deprecatedapirequest/apiaccess_count_controller.go#L214

Version-Release number of selected component (if applicable):

4.8.25 (but code doesn't seem to have changed in recent versions)

How reproducible:

Always

Steps to Reproduce:
1. Manually create an APIRequestCount object without any dot in the name (example below)
2. Restart kube-apiserver container


Actual results:

- User is able to create the wrong APIRequestCount object
- Kube APIServer panics after following the steps above

Expected results:

Either no panic or the creation of any wrong APIRequestCount being rejected at admission phase.

Additional info:

Example of wrong APIRequestCount object and panic will follow in comments

Comment 6 jmekkatt 2022-04-12 09:53:28 UTC
------------------------------Steps to reproduce in UNFIXED Build------------------------
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-07-053433   True        False         4h25m   Cluster version is 4.11.0-0.nightly-2022-04-07-053433

Apply the apirequestcount object with an unsupproted name.

$ cat wrongapirequestcountyaml 
apiVersion: apiserver.openshift.io/v1
kind: APIRequestCount
metadata:
  name: test-alert
spec:
  numberOfUsersToReport: 10
  groups:
  - name: test-alert-rules
    rules:
    # Alert for any instance that is unreachable for >5 minutes.
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
    # Alert for any instance that has a median request latency >1s.
    - alert: APIHighRequestLatency
      expr: api_http_request_latencies_second{quantile="0.5"} > 1
      for: 10m
      annotations:
        summary: "High request latency on {{ $labels.instance }}"
        description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
$ oc create -f wrongapirequestcountyaml 
apirequestcount.apiserver.openshift.io/test-alert created

$ oc get apirequestcount | grep "test-alert"
test-alert     

Allow kue-apiserver to roll out to new versions
$ oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge
apiserver.config.openshift.io/cluster patched

$ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard'
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal         5/5     Running            0               133m
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal         5/5     Running            0               137m
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal         4/5     CrashLoopBackOff   27 (4m8s ago)   120m

$ oc logs -n openshift-kube-apiserver  kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal | grep -i panic

E0412 09:29:21.979868      16 runtime.go:78] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1)
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic({0x4f10b20, 0xc006437968})
panic({0x4f10b20, 0xc006437968})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
panic: runtime error: index out of range [1] with length 1 [recovered]
	panic: runtime error: index out of range [1] with length 1
panic({0x4f10b20, 0xc006437968})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215

kube-apiserver was in crashloop with an panic error mentioned in customer side after new revision of kube-apiserver rolled out.

---------------------- Steps to reproduce in Fixed (latest 4.11) build---------------------------

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-12-000004   True        False         4h41m   Cluster version is 4.11.0-0.nightly-2022-04-12-000004

$ oc create -f wrongapirequestcountyaml 
The APIRequestCount "test-alert" is invalid: metadata.name: Invalid value: "test-alert": apirequestcount test-alert: name must be of the form 'resource.version.group'

$ oc get apirequestcount | grep "test-alert"
$ 

"apirequestcount" object was unable to create as it violates the name form "resource.version.group" and hence the issue is not happening with in fixed/latest build.
I have tried to create "apiresourcecount" object with valid name as below and worked as expected.

$ cat apirequestcount.yaml 
apiVersion: apiserver.openshift.io/v1
kind: APIRequestCount
metadata:
  name: test-alert.api.v2
spec:
  numberOfUsersToReport: 10
  groups:
  - name: test-alert-rules
    rules:
    # Alert for any instance that is unreachable for >5 minutes.
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
    # Alert for any instance that has a median request latency >1s.
    - alert: APIHighRequestLatency
      expr: api_http_request_latencies_second{quantile="0.5"} > 1
      for: 10m
      annotations:
        summary: "High request latency on {{ $labels.instance }}"
        description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"


$ oc create -f apirequestcount.yaml 
apirequestcount.apiserver.openshift.io/test-alert.api.v2 created

$ oc get apirequestcount | grep  test-alert.api
test-alert.api.v2                                         

Allowed to roll out the kube-apiserver with new revision to see if that create issues.

$oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge
apiserver.config.openshift.io/cluster patched

$ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard'
kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal         5/5     Running     0          2m29s
kube-apiserver-xxx-njk-4c7px-master-1.c.openshift-qe.internal         5/5     Running     0          8m11s
kube-apiserver-xxx-njk-4c7px-master-2.c.openshift-qe.internal         5/5     Running     0          5m24s

$ oc logs -n openshift-kube-apiserver  kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal | grep -i panic
$ 

Hence the issue has not seen with fixed (latest 4.11 build) version , moved ticket state to VERIFIED.

Comment 14 errata-xmlrpc 2022-08-10 10:47:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069