Bug 2051985
| Summary: | An APIRequestCount without dots in the name can cause a panic | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pablo Alonso Rodriguez <palonsor> |
| Component: | kube-apiserver | Assignee: | Luis Sanchez <sanchezl> |
| Status: | CLOSED ERRATA | QA Contact: | jmekkatt |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.8 | CC: | akashem, andbartl, aos-bugs, jmekkatt, mfojtik, rsandu, sanchezl, snetting, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 10:47:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2074094 | ||
|
Description
Pablo Alonso Rodriguez
2022-02-08 13:34:47 UTC
------------------------------Steps to reproduce in UNFIXED Build------------------------
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-07-053433 True False 4h25m Cluster version is 4.11.0-0.nightly-2022-04-07-053433
Apply the apirequestcount object with an unsupproted name.
$ cat wrongapirequestcountyaml
apiVersion: apiserver.openshift.io/v1
kind: APIRequestCount
metadata:
name: test-alert
spec:
numberOfUsersToReport: 10
groups:
- name: test-alert-rules
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
$ oc create -f wrongapirequestcountyaml
apirequestcount.apiserver.openshift.io/test-alert created
$ oc get apirequestcount | grep "test-alert"
test-alert
Allow kue-apiserver to roll out to new versions
$ oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge
apiserver.config.openshift.io/cluster patched
$ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard'
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal 5/5 Running 0 133m
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal 5/5 Running 0 137m
kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal 4/5 CrashLoopBackOff 27 (4m8s ago) 120m
$ oc logs -n openshift-kube-apiserver kube-apiserver-ip-10-0-xxx-xxx.us-east-2.compute.internal | grep -i panic
E0412 09:29:21.979868 16 runtime.go:78] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1)
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic({0x4f10b20, 0xc006437968})
panic({0x4f10b20, 0xc006437968})
/usr/lib/golang/src/runtime/panic.go:1038 +0x215
panic: runtime error: index out of range [1] with length 1 [recovered]
panic: runtime error: index out of range [1] with length 1
panic({0x4f10b20, 0xc006437968})
/usr/lib/golang/src/runtime/panic.go:1038 +0x215
kube-apiserver was in crashloop with an panic error mentioned in customer side after new revision of kube-apiserver rolled out.
---------------------- Steps to reproduce in Fixed (latest 4.11) build---------------------------
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-12-000004 True False 4h41m Cluster version is 4.11.0-0.nightly-2022-04-12-000004
$ oc create -f wrongapirequestcountyaml
The APIRequestCount "test-alert" is invalid: metadata.name: Invalid value: "test-alert": apirequestcount test-alert: name must be of the form 'resource.version.group'
$ oc get apirequestcount | grep "test-alert"
$
"apirequestcount" object was unable to create as it violates the name form "resource.version.group" and hence the issue is not happening with in fixed/latest build.
I have tried to create "apiresourcecount" object with valid name as below and worked as expected.
$ cat apirequestcount.yaml
apiVersion: apiserver.openshift.io/v1
kind: APIRequestCount
metadata:
name: test-alert.api.v2
spec:
numberOfUsersToReport: 10
groups:
- name: test-alert-rules
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
$ oc create -f apirequestcount.yaml
apirequestcount.apiserver.openshift.io/test-alert.api.v2 created
$ oc get apirequestcount | grep test-alert.api
test-alert.api.v2
Allowed to roll out the kube-apiserver with new revision to see if that create issues.
$oc patch apiserver cluster -p '{"spec": {"audit": {"profile": "AllRequestBodies"}}}' --type merge
apiserver.config.openshift.io/cluster patched
$ oc get pods -n openshift-kube-apiserver | grep 'apiserver' | grep -v 'guard'
kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal 5/5 Running 0 2m29s
kube-apiserver-xxx-njk-4c7px-master-1.c.openshift-qe.internal 5/5 Running 0 8m11s
kube-apiserver-xxx-njk-4c7px-master-2.c.openshift-qe.internal 5/5 Running 0 5m24s
$ oc logs -n openshift-kube-apiserver kube-apiserver-xxx-njk-4c7px-master-0.c.openshift-qe.internal | grep -i panic
$
Hence the issue has not seen with fixed (latest 4.11 build) version , moved ticket state to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |