2051985 – An APIRequestCount without dots in the name can cause a panic

Bug 2051985 - An APIRequestCount without dots in the name can cause a panic

Summary: An APIRequestCount without dots in the name can cause a panic

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Luis Sanchez
QA Contact:	jmekkatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2074094
TreeView+	depends on / blocked

Reported:	2022-02-08 13:34 UTC by Pablo Alonso Rodriguez
Modified:	2022-08-10 10:48 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:47:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 1172	None	open	Bug 2051985: UPSTREAM: <carry>: An APIRequestCount without dots in the name can cause a panic	2022-02-09 05:35:48 UTC
Red Hat Knowledge Base (Solution)	6716501	None	None	None	2022-02-08 17:13:34 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:48:20 UTC

Description Pablo Alonso Rodriguez 2022-02-08 13:34:47 UTC

Description of problem:

APIRequestCount names must be of the form "resource.version.group" as documented via "oc explain apirequestcount". However, nothing prevents them from being created without any dot in the name.

If somebody creates an apirequestcount without dots in the name, the following code panics[1] because it tries to split the name by "." and then access both the first and second segment, so if there is no second segment, it panics.

This code seems to be triggerable only at pod startup, i.e. it is also required that the kube-apiserver pod restarts for some reason (like a new revision rollout made by kube-apiserver-operator).

[1] - https://github.com/openshift/kubernetes/blob/db16f7d/openshift-kube-apiserver/filters/deprecatedapirequest/apiaccess_count_controller.go#L214

Version-Release number of selected component (if applicable):

4.8.25 (but code doesn't seem to have changed in recent versions)

How reproducible:

Always

Steps to Reproduce:
1. Manually create an APIRequestCount object without any dot in the name (example below)
2. Restart kube-apiserver container


Actual results:

- User is able to create the wrong APIRequestCount object
- Kube APIServer panics after following the steps above

Expected results:

Either no panic or the creation of any wrong APIRequestCount being rejected at admission phase.

Additional info:

Example of wrong APIRequestCount object and panic will follow in comments

Comment 6 jmekkatt 2022-04-12 09:53:28 UTC

------------------------------Steps to reproduce in UNFIXED Build------------------------
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-07-053433   True        False         4h25m   Cluster version is 4.11.0-0.nightly-2022-04-07-053433

Apply the apirequestcount object with an unsupproted name.

$ cat wrongapirequestcountyaml 
apiVersion: apiserver.openshift.io/v1
kind: APIRequestCount
metadata:
  name: test-alert
spec:
  numberOfUsersToReport: 10
  groups:
  - name: test-alert-rules
    rules:
    # Alert for any instance that is unreachable for >5 minutes.
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
    # Alert for any instance that has a median request latency >1s.
    - alert: APIHighRequestLatency
      expr: api_http_request_latencies_second{quantile="0.5"} > 1
      for: 10m
      annotations:
        summary: "High request latency on {{ $labels.instance }}"
        description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
$ oc create -f wrongapirequestcountyaml 
apirequestcount.apiserver.openshift.io/test-alert created

$ oc get apirequestcount | grep "test-alert"
test-alert     

Allow kue-apiserver to roll out to new versions
$ oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge
apiserver.config.openshift.io/cluster patched

$ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard'
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal         5/5     Running            0               133m
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal         5/5     Running            0               137m
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal         4/5     CrashLoopBackOff   27 (4m8s ago)   120m

$ oc logs -n openshift-kube-apiserver  kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal | grep -i panic

E0412 09:29:21.979868      16 runtime.go:78] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1)
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic({0x4f10b20, 0xc006437968})
panic({0x4f10b20, 0xc006437968})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215
panic: runtime error: index out of range [1] with length 1 [recovered]
	panic: runtime error: index out of range [1] with length 1
panic({0x4f10b20, 0xc006437968})
	/usr/lib/golang/src/runtime/panic.go:1038 +0x215

kube-apiserver was in crashloop with an panic error mentioned in customer side after new revision of kube-apiserver rolled out.

---------------------- Steps to reproduce in Fixed (latest 4.11) build---------------------------

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-12-000004   True        False         4h41m   Cluster version is 4.11.0-0.nightly-2022-04-12-000004

$ oc create -f wrongapirequestcountyaml 
The APIRequestCount "test-alert" is invalid: metadata.name: Invalid value: "test-alert": apirequestcount test-alert: name must be of the form 'resource.version.group'

$ oc get apirequestcount | grep "test-alert"
$ 

"apirequestcount" object was unable to create as it violates the name form "resource.version.group" and hence the issue is not happening with in fixed/latest build.
I have tried to create "apiresourcecount" object with valid name as below and worked as expected.

$ cat apirequestcount.yaml 
apiVersion: apiserver.openshift.io/v1
kind: APIRequestCount
metadata:
  name: test-alert.api.v2
spec:
  numberOfUsersToReport: 10
  groups:
  - name: test-alert-rules
    rules:
    # Alert for any instance that is unreachable for >5 minutes.
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
    # Alert for any instance that has a median request latency >1s.
    - alert: APIHighRequestLatency
      expr: api_http_request_latencies_second{quantile="0.5"} > 1
      for: 10m
      annotations:
        summary: "High request latency on {{ $labels.instance }}"
        description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"


$ oc create -f apirequestcount.yaml 
apirequestcount.apiserver.openshift.io/test-alert.api.v2 created

$ oc get apirequestcount | grep  test-alert.api
test-alert.api.v2                                         

Allowed to roll out the kube-apiserver with new revision to see if that create issues.

$oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge
apiserver.config.openshift.io/cluster patched

$ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard'
kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal         5/5     Running     0          2m29s
kube-apiserver-xxx-njk-4c7px-master-1.c.openshift-qe.internal         5/5     Running     0          8m11s
kube-apiserver-xxx-njk-4c7px-master-2.c.openshift-qe.internal         5/5     Running     0          5m24s

$ oc logs -n openshift-kube-apiserver  kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal | grep -i panic
$ 

Hence the issue has not seen with fixed (latest 4.11 build) version , moved ticket state to VERIFIED.

Comment 14 errata-xmlrpc 2022-08-10 10:47:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.