Description of problem: upgrade from 4.5.0-0.nightly-2020-08-01-204100 to 4.6.0-0.nightly-2020-08-02-091622, monitoring is degraded(UpdatingopenshiftStateMetricsFailed) for a few minutes, then it becomes normal # oc get co/monitoring -oyaml ... - lastTransitionTime: "2020-08-03T09:43:04Z" message: 'Failed to rollout the stack. Error: running task Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: current generation 24, observed generation 23' reason: UpdatingopenshiftStateMetricsFailed status: "True" type: Degraded ... # oc -n openshift-monitoring logs openshift-state-metrics-7844b4bcf-mx77n -c openshift-state-metrics 2020-08-03T08:57:24.575024913Z I0803 08:57:24.568691 1 main.go:66] Using default collectors 2020-08-03T08:57:24.575024913Z I0803 08:57:24.568788 1 main.go:73] Using all namespace 2020-08-03T08:57:24.575024913Z I0803 08:57:24.568796 1 main.go:89] metric white- blacklisting: blacklisting the following items: 2020-08-03T08:57:24.579985472Z W0803 08:57:24.575364 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-08-03T08:57:24.579985472Z I0803 08:57:24.576669 1 main.go:114] Starting openshift-state-metrics self metrics server: 127.0.0.1:8082 2020-08-03T08:57:24.588194091Z W0803 08:57:24.580645 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-08-03T08:57:24.589022873Z W0803 08:57:24.588597 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-08-03T08:57:24.602347291Z W0803 08:57:24.591988 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-08-03T08:57:24.602347291Z W0803 08:57:24.592766 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-08-03T08:57:24.602347291Z W0803 08:57:24.594003 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-08-03T08:57:24.602347291Z I0803 08:57:24.594704 1 builder.go:105] Active collectors: buildconfigs,builds,clusterresourcequotas,deploymentConfigs,groups,routes 2020-08-03T08:57:24.602347291Z I0803 08:57:24.594718 1 main.go:140] Starting metrics server: 127.0.0.1:8081 2020-08-03T09:26:40.944140352Z E0803 09:26:40.944105 1 reflector.go:382] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to watch *v1.Build: unknown (get builds.build.openshift.io) 2020-08-03T09:26:41.923186249Z E0803 09:26:41.923153 1 reflector.go:178] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to list *v1.Build: builds.build.openshift.io is forbidden: User "system:serviceaccount:openshift-monitoring:openshift-state-metrics" cannot list resource "builds" in API group "build.openshift.io" at the cluster scope ... checked the event, the former openshift-state-metrics-64fd69bb9c-qmc9k pod's event involvedObject: apiVersion: v1 kind: Pod name: openshift-state-metrics-64fd69bb9c-qmc9k namespace: openshift-monitoring resourceVersion: "13688" uid: f6909145-f86e-401f-b821-e860e5c1f9f8 kind: Event lastTimestamp: null message: '0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn''t tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate.' metadata: Version-Release number of selected component (if applicable): upgrade from 4.5.0-0.nightly-2020-08-01-204100 to 4.6.0-0.nightly-2020-08-02-091622, How reproducible: a few times Steps to Reproduce: 1. See the description 2. 3. Actual results: Expected results: Additional info:
Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1861917.
I haven't found anything interesting except a few entries in the audit logs. It looks like the request was rejected by ip-10-0-48-82 (KubeAPI) instance. The following entry was found in the audit log of that machine: { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "a8a501c6-2411-4604-afb2-d9e3a1681ce4", "stage": "ResponseComplete", "requestURI": "/apis/build.openshift.io/v1/builds?resourceVersion=53019", "verb": "list", "user": { "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.49.232" ], "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN", "objectRef": { "resource": "builds", "apiGroup": "build.openshift.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "status": "Failure", "reason": "Forbidden", "code": 403 }, "requestReceivedTimestamp": "2020-08-03T09:26:41.922395Z", "stageTimestamp": "2020-08-03T09:26:41.922666Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "authorization.k8s.io/decision": "forbid", "authorization.k8s.io/reason": "" } } What's interesting, a few seconds later it was actually admitted: { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "95530591-1627-43ef-b11d-f7e0eb318da4", "stage": "ResponseComplete", "requestURI": "/apis/build.openshift.io/v1/builds?resourceVersion=53019", "verb": "list", "user": { "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.49.232" ], "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN", "objectRef": { "resource": "builds", "apiGroup": "build.openshift.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "code": 200 }, "requestReceivedTimestamp": "2020-08-03T09:26:44.613235Z", "stageTimestamp": "2020-08-03T09:26:44.618978Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"openshift-state-metrics\" of ClusterRole \"openshift-state-metrics\" to ServiceAccount \"openshift-state-metrics/openshift-monitoring\"" } } I also found an entry suggesting it had been admitted (09:19:01) before it was rejected (09:26:41) { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "50576d07-a133-4ba7-9be0-c3666b77490c", "stage": "ResponseComplete", "requestURI": "/apis/build.openshift.io/v1/builds?allowWatchBookmarks=true&resourceVersion=53019&timeout=7m39s&timeoutSeconds=459&watch=true", "verb": "watch", "user": { "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.49.232" ], "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN", "objectRef": { "resource": "builds", "apiGroup": "build.openshift.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "code": 200 }, "requestReceivedTimestamp": "2020-08-03T09:19:01.932581Z", "stageTimestamp": "2020-08-03T09:26:40.942692Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics", "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"openshift-state-metrics\" of ClusterRole \"openshift-state-metrics\" to ServiceAccount \"openshift-state-metrics/openshift-monitoring\"" } }
I also found frequent entries updating the associated cluster role (openshift-state-metrics), it looks like they were updated by the cluster-monitoring-operator. One particular update was very close (09:26:38) to the time the request was rejected (09:26:41). { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "0b6c1dfa-24d9-4a9d-8eec-df80897b7d49", "stage": "ResponseComplete", "requestURI": "/apis/rbac.authorization.k8s.io/v1/clusterroles/openshift-state-metrics", "verb": "update", "user": { "username": "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator", "uid": "751dd8ce-3625-41d7-ab5f-b07af70b0719", "groups": [ "system:serviceaccounts", "system:serviceaccounts:openshift-monitoring", "system:authenticated" ] }, "sourceIPs": [ "10.0.78.5" ], "userAgent": "operator/v0.0.0 (linux/amd64) kubernetes/$Format", "objectRef": { "resource": "clusterroles", "name": "openshift-state-metrics", "apiGroup": "rbac.authorization.k8s.io", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "code": 200 }, "requestReceivedTimestamp": "2020-08-03T09:26:38.536065Z", "stageTimestamp": "2020-08-03T09:26:38.594847Z", "annotations": { "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator", "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\"" } }
Increasing verb
Increasing priority and renamed. We are on this. We don't know the root cause. It has blocker potential depending on the outcome.
it looks like the RBAC related error is frequent during an upgrade - https://search.ci.openshift.org/?search=Failed+to+list+.*+is+forbidden%3A+.*+in+API+group&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
We have opened https://github.com/openshift/origin/pull/25504 that changes the RBAC controller to output the full trace in case a request was denied. It turned out that when a request was rejected there was no matching rule found. It also turned out that cluster-monitoring-operator was hotlooping in creating and deleting a clusterrolebinding. After rechecking the attached audit logs to this BZ it looks like it also applies here. I'm assigning back to the monitoring team to fix the operator. {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"7d373e49-5735-4f14-8f7e-7abedb116f93","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:26:40.319157Z","stageTimestamp":"2020-08-03T09:26:40.327536Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"d0748c2f-dc2f-4176-84bb-f09f7f3a31cd","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:32:09.580340Z","stageTimestamp":"2020-08-03T09:32:09.589009Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4d058af9-3369-46a3-b65f-08d329b598c0","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:37:37.645928Z","stageTimestamp":"2020-08-03T09:37:37.654610Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}}
Thank you for the analysis, that is of great help! I believe we converged against the same hypothesis as I found the same in https://bugzilla.redhat.com/show_bug.cgi?id=1856189#c40.
*** Bug 1856189 has been marked as a duplicate of this bug. ***
upgraded from 4.5.12 to 4.6.0-0.nightly-2020-09-25-054214, and watched the status of monitoring cluster operator, openshift-state-metrics/kube-state-metrics did not degraded during the upgrade progress
*** Bug 1832825 has been marked as a duplicate of this bug. ***
@Vendanti there is no workaround and we are in the middle of backporting a fix to 4.5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
(In reply to Pawel Krupa from comment #22) > @Vendanti there is no workaround and we are in the middle of backporting a > fix to 4.5 Hello Pawel, Do you know if the fix backport was performed on Openshift 4.5 branch, and if yes, starting at which version? Thank you, Guillaume.