Bug 1706478

Summary: Could not get elasticsearch metrics in prometheus server -- happen again.
Product: OpenShift Container Platform Reporter: Qiaoling Tang <qitang>
Component: LoggingAssignee: Josef Karasek <jkarasek>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, jcantril, pweil, rmeggins, wsun
Target Milestone: ---Keywords: Regression
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1712423 (view as bug list) Environment:
Last Closed: 2019-06-04 10:48:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1712423    

Description Qiaoling Tang 2019-05-05 02:38:42 UTC
Description of problem:
Could not get elasticsearch prometheus server:

$ oc get svc
NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
elasticsearch           ClusterIP   172.30.255.185   <none>        9200/TCP    27m
elasticsearch-cluster   ClusterIP   172.30.67.28     <none>        9300/TCP    27m
elasticsearch-metrics   ClusterIP   172.30.31.72     <none>        60000/TCP   27m
fluentd                 ClusterIP   172.30.163.225   <none>        24231/TCP   27m
kibana                  ClusterIP   172.30.42.149    <none>        443/TCP     27m

$ oc exec fluentd-k852g -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://172.30.31.72:60000/_prometheus/metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   674  100   674    0     0    424      0  0:00:01  0:00:01 --:--:--   424
{"error":{"root_cause":[{"type":"security_exception","reason":"no permissions for [indices:monitor/stats] and User [name=system:serviceaccount:openshift-monitoring:prometheus-k8s, roles=[gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac, gen_kibana_647a750f1787408bf50088234ec0edd5a6a9b2ac, prometheus]]"}],"type":"exception","reason":"Indices stats request failed","caused_by":{"type":"security_exception","reason":"no permissions for [indices:monitor/stats] and User [name=system:serviceaccount:openshift-monitoring:prometheus-k8s, roles=[gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac, gen_kibana_647a750f1787408bf50088234ec0edd5a6a9b2ac, prometheus]]"}},"status":500}

$ oc exec fluentd-k852g -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://172.30.255.185:9200/_prometheus/metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"error":{"root_cause":[{"type":"security_exception","reason":"no permissions for [indices:monitor/stats] and User [name=system:serviceaccount:openshift-monitoring:prometheus-k8s, roles=[gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac, gen_kibana_647a750f1787408bf50088234ec0edd5a6a9b2ac, prometheus]]"}],"type":"exception","reason":"Indices stats request failed","caused_by":{"type":"security_exception","reason":"no permissions for [indices:monitor/stats] and User [name=system:serviceaccount:openshift-monitoring:prometheus-k8s, roles=[gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac, gen_kibana_647a750f178100   674  100   674    0     0    807      0 --:--:-- --:--:-- --:--:--   807

Logs in ES pod:
[2019-05-05T02:16:15,924][INFO ][c.f.s.c.PrivilegesEvaluator] No index-level perm match for User [name=system:serviceaccount:openshift-monitoring:prometheus-k8s, roles=[gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac, gen_kibana_647a750f1787408bf50088234ec0edd5a6a9b2ac, prometheus]] [IndexType [index=project.project-5.1b7b290e-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=.searchguard, type=*], IndexType [index=project.project-1.054edf75-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-4.15f05262-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-7.267f7de0-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-3.107c79e0-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=.operations.2019.05.05, type=*], IndexType [index=project.project-2.0afb16f4-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=.kibana, type=*], IndexType [index=project.qitang1.b3789746-6ed2-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-6.210ceb80-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=.kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac, type=*]] [Action [[indices:monitor/stats]]] [RolesChecked [gen_kibana_647a750f1787408bf50088234ec0edd5a6a9b2ac, gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac, sg_role_prometheus]]
[2019-05-05T02:16:15,924][INFO ][c.f.s.c.PrivilegesEvaluator] No permissions for {gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac=[IndexType [index=.searchguard, type=*], IndexType [index=project.project-1.054edf75-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-7.267f7de0-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-6.210ceb80-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-5.1b7b290e-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=project.project-4.15f05262-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-3.107c79e0-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=.operations.2019.05.05, type=*], IndexType [index=project.project-2.0afb16f4-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=.kibana, type=*], IndexType [index=project.qitang1.b3789746-6ed2-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=.kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac, type=*]], sg_role_prometheus=[IndexType [index=.searchguard, type=*], IndexType [index=project.project-1.054edf75-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-7.267f7de0-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-6.210ceb80-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-5.1b7b290e-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=project.project-4.15f05262-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-3.107c79e0-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=.operations.2019.05.05, type=*], IndexType [index=project.project-2.0afb16f4-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=.kibana, type=*], IndexType [index=project.qitang1.b3789746-6ed2-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=.kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac, type=*]], gen_kibana_647a750f1787408bf50088234ec0edd5a6a9b2ac=[IndexType [index=.searchguard, type=*], IndexType [index=project.project-1.054edf75-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-7.267f7de0-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-6.210ceb80-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-5.1b7b290e-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=project.project-4.15f05262-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=project.project-3.107c79e0-6ed9-11e9-be40-0ae526d081c4.2019.05.05, type=*], IndexType [index=.operations.2019.05.05, type=*], IndexType [index=project.project-2.0afb16f4-6ed9-11e9-94ff-0201c63c112c.2019.05.05, type=*], IndexType [index=.kibana, type=*], IndexType [index=project.qitang1.b3789746-6ed2-11e9-94ff-0201c63c112c.2019.05.05, type=*]]}
[2019-05-05T02:16:15,924][WARN ][r.suppressed             ] path: /_prometheus/metrics, params: {}
org.elasticsearch.ElasticsearchException: Indices stats request failed
	at org.elasticsearch.action.TransportNodePrometheusMetricsAction$AsyncAction$2.onFailure(TransportNodePrometheusMetricsAction.java:154) [prometheus-exporter-5.6.13.2.jar:5.6.13]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) [elasticsearch-5.6.13.jar:5.6.13]
	at com.floragunn.searchguard.filter.SearchGuardFilter.apply(SearchGuardFilter.java:143) [search-guard-5-5.6.13-19.2.jar:?]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:168) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1256) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1577) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.TransportNodePrometheusMetricsAction$AsyncAction$3.onResponse(TransportNodePrometheusMetricsAction.java:164) [prometheus-exporter-5.6.13.2.jar:5.6.13]
	at org.elasticsearch.action.TransportNodePrometheusMetricsAction$AsyncAction$3.onResponse(TransportNodePrometheusMetricsAction.java:159) [prometheus-exporter-5.6.13.2.jar:5.6.13]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.finishHim(TransportNodesAction.java:254) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onOperation(TransportNodesAction.java:229) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:153) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:206) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:198) [elasticsearch-5.6.13.jar:5.6.13]
	at com.floragunn.searchguard.transport.SearchGuardInterceptor$RestoringTransportResponseHandler.handleResponse(SearchGuardInterceptor.java:158) [search-guard-5-5.6.13-19.2.jar:?]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1078) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1152) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1142) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1131) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:60) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:111) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258) [elasticsearch-5.6.13.jar:5.6.13]
	at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceivedDecorate(SearchGuardSSLRequestHandler.java:178) [search-guard-ssl-5.6.13-23.jar:5.6.13-23]
	at com.floragunn.searchguard.transport.SearchGuardRequestHandler.messageReceivedDecorate(SearchGuardRequestHandler.java:107) [search-guard-5-5.6.13-19.2.jar:?]
	at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceived(SearchGuardSSLRequestHandler.java:92) [search-guard-ssl-5.6.13-23.jar:5.6.13-23]
	at com.floragunn.searchguard.SearchGuardPlugin$5$1.messageReceived(SearchGuardPlugin.java:493) [search-guard-5-5.6.13-19.2.jar:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:662) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:675) [elasticsearch-5.6.13.jar:5.6.13]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.13.jar:5.6.13]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: org.elasticsearch.ElasticsearchSecurityException: no permissions for [indices:monitor/stats] and User [name=system:serviceaccount:openshift-monitoring:prometheus-k8s, roles=[gen_user_647a750f1787408bf50088234ec0edd5a6a9b2ac, gen_kibana_647a750f1787408bf50088234ec0edd5a6a9b2ac, prometheus]]
	... 38 more

The metrics are already exposed:
$ oc exec elasticsearch-cdm-g342mj5c-1-fcbbf47d5-l7kcj -- es_util --query=_prometheus/metrics
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-g342mj5c-1-fcbbf47d5-l7kcj -n openshift-logging' to see all of the containers in this pod.
# HELP es_index_querycache_hit_count Count of hits in query cache
# TYPE es_index_querycache_hit_count gauge
es_index_querycache_hit_count{cluster="elasticsearch",index="project.project-7.267f7de0-6ed9-11e9-94ff-0201c63c112c.2019.05.05",context="total",} 0.0
es_index_querycache_hit_count{cluster="elasticsearch",index=".searchguard",context="total",} 0.0
es_index_querycache_hit_count{cluster="elasticsearch",index="project.project-4.15f05262-6ed9-11e9-94ff-0201c63c112c.2019.05.05",context="primaries",} 0.0
es_index_querycache_hit_count{cluster="elasticsearch",index="project.project-2.0afb16f4-6ed9-11e9-94ff-0201c63c112c.2019.05.05",context="primaries",} 0.0
es_index_querycache_hit_count{cluster="elasticsearch",index="project.project-2.0afb16f4-6ed9-11e9-94ff-0201c63c112c.2019.05.05",context="total",} 0.0
es_index_querycache_hit_count{cluster="elasticsearch",index=".kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac",context="primaries",} 0.0
es_index_querycache_hit_count{cluster="elasticsearch",index=".kibana",context="primaries",} 0.0


Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-04-210601   True        False         93m     Cluster version is 4.1.0-0.nightly-2019-05-04-210601

quay.io/openshift/origin-cluster-logging-operator@sha256:c2988870f4f47617394e6510aca690ddf6ec448b2579c4829e7d34e67e1129ee
quay.io/openshift/origin-logging-elasticsearch5@sha256:eaa6d1f258bc58758a9275b0b097cf38db5bf923b261e3f0c57d3cd087997ee4
quay.io/openshift/origin-oauth-proxy@sha256:f73bfe880c1caaf4a0a03cb6ffdb58baab2170e12ebafab26ea8e6abba66b3f4

How reproducible:
Always

Steps to Reproduce:
1. Deploy logging via OLM
2. wait until all pods become running, check es metrics in prometheus server
3.

Actual results:


Expected results:


Additional info:

Comment 1 Qiaoling Tang 2019-05-05 03:16:59 UTC
$ oc get clusterrole -o yaml -n openshift-logging elasticsearch-metrics
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: 2019-05-05T02:49:05Z
  name: elasticsearch-metrics
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: Elasticsearch
    name: elasticsearch
    uid: 58f9e953-6ee0-11e9-8423-0622f9bac76a
  resourceVersion: "76248"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/elasticsearch-metrics
  uid: 58fc1373-6ee0-11e9-be40-0ae526d081c4
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  verbs:
  - list
  - watch
- nonResourceURLs:
  - /metrics
  verbs:
  - get

$ oc get clusterrolebinding -o yaml -n openshift-logging elasticsearch-metrics
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: 2019-05-05T02:49:05Z
  name: elasticsearch-metrics
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: Elasticsearch
    name: elasticsearch
    uid: 58f9e953-6ee0-11e9-8423-0622f9bac76a
  resourceVersion: "76250"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/elasticsearch-metrics
  uid: 58fcbe86-6ee0-11e9-be40-0ae526d081c4
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: elasticsearch-metrics
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: openshift-monitoring

$ oc get servicemonitor monitor-elasticsearch-cluster -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: 2019-05-05T02:49:32Z
  generation: 1
  labels:
    cluster-name: elasticsearch
    scrape-metrics: enabled
  name: monitor-elasticsearch-cluster
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: Elasticsearch
    name: elasticsearch
    uid: 58f9e953-6ee0-11e9-8423-0622f9bac76a
  resourceVersion: "76764"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/servicemonitors/monitor-elasticsearch-cluster
  uid: 694633ad-6ee0-11e9-be40-0ae526d081c4
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    path: /_prometheus/metrics
    port: elasticsearch
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: elasticsearch-metrics.openshift-logging.svc
  jobLabel: monitor-elasticsearch
  namespaceSelector:
    matchNames:
    - openshift-logging
  selector:
    matchLabels:
      cluster-name: elasticsearch
      scrape-metrics: enabled

Comment 4 Qiaoling Tang 2019-05-08 00:51:37 UTC
Verified with quay.io/openshift/origin-logging-elasticsearch5@sha256:cb880d5d4758b9155e5143c2024af414b76bd8f7e3a70ad8512efb6f09084d16

Comment 6 errata-xmlrpc 2019-06-04 10:48:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758