Bug 1863011 - RBAC refuses requests at times – was: openshift-state-metrics/kube-state-metrics degraded for a few minutes
Summary: RBAC refuses requests at times – was: openshift-state-metrics/kube-state-metr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1832825 1856189 (view as bug list)
Depends On: 1879019 1879159
Blocks: 1887354
TreeView+ depends on / blocked
 
Reported: 2020-08-03 13:29 UTC by Junqi Zhao
Modified: 2020-12-15 18:17 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1887354 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:22:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 932 0 None closed Bug 1863011: Fix checking if ClusterRoleBindings need update 2021-02-19 19:06:20 UTC
Red Hat Knowledge Base (Solution) 5582161 0 None None None 2020-11-18 08:13:45 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:22:58 UTC

Description Junqi Zhao 2020-08-03 13:29:15 UTC
Description of problem:
upgrade from 4.5.0-0.nightly-2020-08-01-204100 to 4.6.0-0.nightly-2020-08-02-091622, monitoring is degraded(UpdatingopenshiftStateMetricsFailed) for a few minutes, then it becomes normal
# oc get co/monitoring -oyaml
...
  - lastTransitionTime: "2020-08-03T09:43:04Z"
    message: 'Failed to rollout the stack. Error: running task Updating openshift-state-metrics
      failed: reconciling openshift-state-metrics Deployment failed: updating Deployment
      object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics:
      current generation 24, observed generation 23'
    reason: UpdatingopenshiftStateMetricsFailed
    status: "True"
    type: Degraded
...
# oc -n openshift-monitoring logs openshift-state-metrics-7844b4bcf-mx77n -c openshift-state-metrics
2020-08-03T08:57:24.575024913Z I0803 08:57:24.568691       1 main.go:66] Using default collectors
2020-08-03T08:57:24.575024913Z I0803 08:57:24.568788       1 main.go:73] Using all namespace
2020-08-03T08:57:24.575024913Z I0803 08:57:24.568796       1 main.go:89] metric white- blacklisting: blacklisting the following items: 
2020-08-03T08:57:24.579985472Z W0803 08:57:24.575364       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-08-03T08:57:24.579985472Z I0803 08:57:24.576669       1 main.go:114] Starting openshift-state-metrics self metrics server: 127.0.0.1:8082
2020-08-03T08:57:24.588194091Z W0803 08:57:24.580645       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-08-03T08:57:24.589022873Z W0803 08:57:24.588597       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-08-03T08:57:24.602347291Z W0803 08:57:24.591988       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-08-03T08:57:24.602347291Z W0803 08:57:24.592766       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-08-03T08:57:24.602347291Z W0803 08:57:24.594003       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-08-03T08:57:24.602347291Z I0803 08:57:24.594704       1 builder.go:105] Active collectors: buildconfigs,builds,clusterresourcequotas,deploymentConfigs,groups,routes
2020-08-03T08:57:24.602347291Z I0803 08:57:24.594718       1 main.go:140] Starting metrics server: 127.0.0.1:8081
2020-08-03T09:26:40.944140352Z E0803 09:26:40.944105       1 reflector.go:382] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to watch *v1.Build: unknown (get builds.build.openshift.io)
2020-08-03T09:26:41.923186249Z E0803 09:26:41.923153       1 reflector.go:178] github.com/openshift/openshift-state-metrics/pkg/collectors/builder.go:228: Failed to list *v1.Build: builds.build.openshift.io is forbidden: User "system:serviceaccount:openshift-monitoring:openshift-state-metrics" cannot list resource "builds" in API group "build.openshift.io" at the cluster scope
...

checked the event, the former openshift-state-metrics-64fd69bb9c-qmc9k pod's event
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: openshift-state-metrics-64fd69bb9c-qmc9k
    namespace: openshift-monitoring
    resourceVersion: "13688"
    uid: f6909145-f86e-401f-b821-e860e5c1f9f8
  kind: Event
  lastTimestamp: null
  message: '0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready:
    }, that the pod didn''t tolerate, 3 node(s) had taint {node-role.kubernetes.io/master:
    }, that the pod didn''t tolerate.'
  metadata:

Version-Release number of selected component (if applicable):
upgrade from 4.5.0-0.nightly-2020-08-01-204100 to 4.6.0-0.nightly-2020-08-02-091622,

How reproducible:
a few times

Steps to Reproduce:
1. See the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Stefan Schimanski 2020-08-26 09:23:05 UTC
Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1861917.

Comment 5 Lukasz Szaszkiewicz 2020-09-03 10:58:30 UTC
I haven't found anything interesting except a few entries in the audit logs. It looks like the request was rejected by ip-10-0-48-82 (KubeAPI) instance. 
The following entry was found in the audit log of that machine:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "a8a501c6-2411-4604-afb2-d9e3a1681ce4",
  "stage": "ResponseComplete",
  "requestURI": "/apis/build.openshift.io/v1/builds?resourceVersion=53019",
  "verb": "list",
  "user": {
    "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics",
    "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:openshift-monitoring",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "10.0.49.232"
  ],
  "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN",
  "objectRef": {
    "resource": "builds",
    "apiGroup": "build.openshift.io",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "status": "Failure",
    "reason": "Forbidden",
    "code": 403
  },
  "requestReceivedTimestamp": "2020-08-03T09:26:41.922395Z",
  "stageTimestamp": "2020-08-03T09:26:41.922666Z",
  "annotations": {
    "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics",
    "authorization.k8s.io/decision": "forbid",
    "authorization.k8s.io/reason": ""
  }
}


What's interesting, a few seconds later it was actually admitted:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "95530591-1627-43ef-b11d-f7e0eb318da4",
  "stage": "ResponseComplete",
  "requestURI": "/apis/build.openshift.io/v1/builds?resourceVersion=53019",
  "verb": "list",
  "user": {
    "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics",
    "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:openshift-monitoring",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "10.0.49.232"
  ],
  "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN",
  "objectRef": {
    "resource": "builds",
    "apiGroup": "build.openshift.io",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2020-08-03T09:26:44.613235Z",
  "stageTimestamp": "2020-08-03T09:26:44.618978Z",
  "annotations": {
    "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics",
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"openshift-state-metrics\" of ClusterRole \"openshift-state-metrics\" to ServiceAccount \"openshift-state-metrics/openshift-monitoring\""
  }
}

I also found an entry suggesting it had been admitted (09:19:01) before it was rejected (09:26:41)

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "50576d07-a133-4ba7-9be0-c3666b77490c",
  "stage": "ResponseComplete",
  "requestURI": "/apis/build.openshift.io/v1/builds?allowWatchBookmarks=true&resourceVersion=53019&timeout=7m39s&timeoutSeconds=459&watch=true",
  "verb": "watch",
  "user": {
    "username": "system:serviceaccount:openshift-monitoring:openshift-state-metrics",
    "uid": "84a4e18c-b54b-4c47-9d2c-4c22a88be696",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:openshift-monitoring",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "10.0.49.232"
  ],
  "userAgent": "openshift-state-metrics/UNKNOWN (linux/amd64) kube-state-metrics/UNKNOWN",
  "objectRef": {
    "resource": "builds",
    "apiGroup": "build.openshift.io",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2020-08-03T09:19:01.932581Z",
  "stageTimestamp": "2020-08-03T09:26:40.942692Z",
  "annotations": {
    "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:openshift-state-metrics",
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"openshift-state-metrics\" of ClusterRole \"openshift-state-metrics\" to ServiceAccount \"openshift-state-metrics/openshift-monitoring\""
  }
}

Comment 6 Lukasz Szaszkiewicz 2020-09-03 11:01:45 UTC
I also found frequent entries updating the associated cluster role (openshift-state-metrics), it looks like they were updated by the cluster-monitoring-operator.
One particular update was very close (09:26:38) to the time the request was rejected (09:26:41).

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "0b6c1dfa-24d9-4a9d-8eec-df80897b7d49",
  "stage": "ResponseComplete",
  "requestURI": "/apis/rbac.authorization.k8s.io/v1/clusterroles/openshift-state-metrics",
  "verb": "update",
  "user": {
    "username": "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator",
    "uid": "751dd8ce-3625-41d7-ab5f-b07af70b0719",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:openshift-monitoring",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "10.0.78.5"
  ],
  "userAgent": "operator/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "clusterroles",
    "name": "openshift-state-metrics",
    "apiGroup": "rbac.authorization.k8s.io",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2020-08-03T09:26:38.536065Z",
  "stageTimestamp": "2020-08-03T09:26:38.594847Z",
  "annotations": {
    "authentication.k8s.io/legacy-token": "system:serviceaccount:openshift-monitoring:cluster-monitoring-operator",
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""
  }
}

Comment 7 Stefan Schimanski 2020-09-11 10:57:47 UTC
Increasing verb

Comment 8 Stefan Schimanski 2020-09-11 10:59:07 UTC
Increasing priority and renamed. We are on this. We don't know the root cause. It has blocker potential depending on the outcome.

Comment 10 Lukasz Szaszkiewicz 2020-09-16 08:45:09 UTC
We have opened https://github.com/openshift/origin/pull/25504 that changes the RBAC controller to output the full trace in case a request was denied.
It turned out that when a request was rejected there was no matching rule found.

It also turned out that cluster-monitoring-operator was hotlooping in creating and deleting a clusterrolebinding. 
After rechecking the attached audit logs to this BZ it looks like it also applies here.
I'm assigning back to the monitoring team to fix the operator.


{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"7d373e49-5735-4f14-8f7e-7abedb116f93","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:26:40.319157Z","stageTimestamp":"2020-08-03T09:26:40.327536Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"d0748c2f-dc2f-4176-84bb-f09f7f3a31cd","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:32:09.580340Z","stageTimestamp":"2020-08-03T09:32:09.589009Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4d058af9-3369-46a3-b65f-08d329b598c0","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/openshift-state-metrics","verb":"delete","user":{"username":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","uid":"751dd8ce-3625-41d7-ab5f-b07af70b0719","groups":["system:serviceaccounts","system:serviceaccounts:openshift-monitoring","system:authenticated"]},"sourceIPs":["10.0.78.5"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"clusterrolebindings","name":"openshift-state-metrics","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-08-03T09:37:37.645928Z","stageTimestamp":"2020-08-03T09:37:37.654610Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-monitoring:cluster-monitoring-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"cluster-monitoring-operator\" of ClusterRole \"cluster-monitoring-operator\" to ServiceAccount \"cluster-monitoring-operator/openshift-monitoring\""}}

Comment 11 Sergiusz Urbaniak 2020-09-17 07:07:11 UTC
Thank you for the analysis, that is of great help!

I believe we converged against the same hypothesis as I found the same in https://bugzilla.redhat.com/show_bug.cgi?id=1856189#c40.

Comment 12 Sergiusz Urbaniak 2020-09-17 07:08:39 UTC
*** Bug 1856189 has been marked as a duplicate of this bug. ***

Comment 14 Junqi Zhao 2020-09-27 00:57:44 UTC
upgraded from 4.5.12 to 4.6.0-0.nightly-2020-09-25-054214, and watched the status of monitoring cluster operator, openshift-state-metrics/kube-state-metrics did not degraded during the upgrade progress

Comment 15 Sergiusz Urbaniak 2020-10-01 05:50:10 UTC
*** Bug 1832825 has been marked as a duplicate of this bug. ***

Comment 22 Pawel Krupa 2020-10-12 08:27:39 UTC
@Vendanti there is no workaround and we are in the middle of backporting a fix to 4.5

Comment 24 errata-xmlrpc 2020-10-27 16:22:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 28 Guillaume 2020-12-03 11:22:08 UTC
(In reply to Pawel Krupa from comment #22)
> @Vendanti there is no workaround and we are in the middle of backporting a
> fix to 4.5

Hello Pawel,
Do you know if the fix backport was performed on Openshift 4.5 branch, and if yes, starting at which version?
Thank you,
Guillaume.


Note You need to log in before you can comment on or make changes to this bug.