Bug 1863011
Summary: | RBAC refuses requests at times – was: openshift-state-metrics/kube-state-metrics degraded for a few minutes | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | |
Component: | Monitoring | Assignee: | Pawel Krupa <pkrupa> | |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.6 | CC: | aivaraslaimikis, alegrand, andbartl, anpicker, aos-bugs, apjagtap, aprajapa, erooth, guilvey, jnordell, john.johansson, kakkoyun, lcosic, lszaszki, mfojtik, mloibl, naoto30, naygupta, pkrupa, rabdulra, sttts, surbania, vjaypurk, wking | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1887354 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:22:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1879019, 1879159 | |||
Bug Blocks: | 1887354 |
Description
Junqi Zhao
2020-08-03 13:29:15 UTC
Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1861917. I haven't found anything interesting except a few entries in the audit logs. It looks like the request was rejected by ip-10-0-48-82 (KubeAPI) instance. The following entry was found in the audit log of that machine: { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "a8a501c6-2411-4604-afb2-d9e3a1681ce4", "stage": "ResponseComplete", "requestURI": "/apis/build.openshift.io/v1/builds?resourceVersion=53019", "verb": "list", "user": { "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.49.232" ], "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN", "objectRef": { "resource": "builds", "apiGroup": "build.openshift.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "status": "Failure", "reason": "Forbidden", "code": 403 }, "requestReceivedTimestamp": "2020-08-03T09:26:41.922395Z", "stageTimestamp": "2020-08-03T09:26:41.922666Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "authorization.k8s.io/decision": "forbid", "authorization.k8s.io/reason": "" } } What's interesting, a few seconds later it was actually admitted: { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "95530591-1627-43ef-b11d-f7e0eb318da4", "stage": "ResponseComplete", "requestURI": "/apis/build.openshift.io/v1/builds?resourceVersion=53019", "verb": "list", "user": { "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.49.232" ], "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN", "objectRef": { "resource": "builds", "apiGroup": "build.openshift.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "code": 200 }, "requestReceivedTimestamp": "2020-08-03T09:26:44.613235Z", "stageTimestamp": "2020-08-03T09:26:44.618978Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"openshift-state-metrics\" of ClusterRole \"openshift-state-metrics\" to ServiceAccount \"openshift-state-metrics/openshift-monitoring\"" } } I also found an entry suggesting it had been admitted (09:19:01) before it was rejected (09:26:41) { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "50576d07-a133-4ba7-9be0-c3666b77490c", "stage": "ResponseComplete", "requestURI": "/apis/build.openshift.io/v1/builds?allowWatchBookmarks=true&resourceVersion=53019&timeout=7m39s&timeoutSeconds=459&watch=true", "verb": "watch", "user": { "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.49.232" ], "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN", "objectRef": { "resource": "builds", "apiGroup": "build.openshift.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "code": 200 }, "requestReceivedTimestamp": "2020-08-03T09:19:01.932581Z", "stageTimestamp": "2020-08-03T09:26:40.942692Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"openshift-state-metrics\" of ClusterRole \"openshift-state-metrics\" to ServiceAccount \"openshift-state-metrics/openshift-monitoring\"" } } I also found frequent entries updating the associated cluster role (openshift-state-metrics), it looks like they were updated by the cluster-monitoring-operator. One particular update was very close (09:26:38) to the time the request was rejected (09:26:41). { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "0b6c1dfa-24d9-4a9d-8eec-df80897b7d49", "stage": "ResponseComplete", "requestURI": "/apis/rbac.authorization.k8s.io/v1/clusterroles/openshift-state-metrics", "verb": "update", "user": { "username": "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator", "uid": "751dd8ce-3625-41d7-ab5f-b07af70b0719", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.78.5" ], "userAgent": "operator/v0.0.0 (linux/amd64) kubernetes/$Format", "objectRef": { "resource": "clusterroles", "name": "openshift-state-metrics", "apiGroup": "rbac.authorization.k8s.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "code": 200 }, "requestReceivedTimestamp": "2020-08-03T09:26:38.536065Z", "stageTimestamp": "2020-08-03T09:26:38.594847Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator", "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\"" } } Increasing verb Increasing priority and renamed. We are on this. We don't know the root cause. It has blocker potential depending on the outcome. it looks like the RBAC related error is frequent during an upgrade - https://search.ci.openshift.org/?search=Failed+to+list+.*+is+forbidden%3A+.*+in+API+group&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job We have opened https://github.com/openshift/origin/pull/25504 that changes the RBAC controller to output the full trace in case a request was denied. It turned out that when a request was rejected there was no matching rule found. It also turned out that cluster-monitoring-operator was hotlooping in creating and deleting a clusterrolebinding. After rechecking the attached audit logs to this BZ it looks like it also applies here. I'm assigning back to the monitoring team to fix the operator. {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"7d373e49-5735-4f14-8f7e-7abedb116f93","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:26:40.319157Z","stageTimestamp":"2020-08-03T09:26:40.327536Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"d0748c2f-dc2f-4176-84bb-f09f7f3a31cd","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:32:09.580340Z","stageTimestamp":"2020-08-03T09:32:09.589009Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4d058af9-3369-46a3-b65f-08d329b598c0","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:37:37.645928Z","stageTimestamp":"2020-08-03T09:37:37.654610Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}} Thank you for the analysis, that is of great help! I believe we converged against the same hypothesis as I found the same in https://bugzilla.redhat.com/show_bug.cgi?id=1856189#c40. *** Bug 1856189 has been marked as a duplicate of this bug. *** upgraded from 4.5.12 to 4.6.0-0.nightly-2020-09-25-054214, and watched the status of monitoring cluster operator, openshift-state-metrics/kube-state-metrics did not degraded during the upgrade progress *** Bug 1832825 has been marked as a duplicate of this bug. *** @Vendanti there is no workaround and we are in the middle of backporting a fix to 4.5 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 (In reply to Pawel Krupa from comment #22) > @Vendanti there is no workaround and we are in the middle of backporting a > fix to 4.5 Hello Pawel, Do you know if the fix backport was performed on Openshift 4.5 branch, and if yes, starting at which version? Thank you, Guillaume. |