Bug 2021342 - Prometheus could not scrape fluentd for more than 10m
Summary: Prometheus could not scrape fluentd for more than 10m
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-08 22:34 UTC by Steven Walter
Modified: 2022-03-02 07:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-17 17:56:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Steven Walter 2021-11-08 22:34:38 UTC
Description of problem:
Getting message, "Prometheus could not scrape fluentd for more than 10m."

Version-Release number of selected component (if applicable):
4.7.34

How reproducible:
Unconfirmed



Additional info:
Customer set label openshift.io/cluster-monitoring: "true" set but still that error is not clearing.

The prometheus pods are noting this error on repeat:

2021-10-31T03:05:06.385693354Z level=error ts=2021-10-31T03:05:06.385Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-logging\""
2021-10-31T03:05:08.607296440Z level=error ts=2021-10-31T03:05:08.607Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-logging\""
2021-10-31T03:05:31.197590776Z level=error ts=2021-10-31T03:05:31.197Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:426: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-logging\""

We found a similar bug from an older version:
https://bugzilla.redhat.com/show_bug.cgi?id=1774907
Using diagnostic steps from that bug:

# token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
# oc auth can-i list endpoints -n openshift-logging --token $token
# oc auth can-i list endpoints -n openshift-logging --token $token
# oc auth can-i list endpoints -n openshift-logging --token $token
# oc auth can-i list endpoints -n openshift-logging --token $token
# oc auth can-i list endpoints -n openshift-logging --token $token
# oc auth can-i list endpoints -n openshift-logging --token $token

These all result "no". I suspect something has failed to set the proper rolebindings for prometheus-k8s. Are there roles that should be added? Can they be added manually?

Comment 1 Arunprasad Rajkumar 2021-11-09 06:35:35 UTC
Other cluster operators(e.g. cluster-etcd-operator] defines explicit role[1] bindings[2] to the `prometheus-k8s` service account. You may need to follow the same. 

But I'm wondering why it was not done from cluster-logging operator!


[1] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_90_etcd-operator_01_prometheusrole.yaml
[2] https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_90_etcd-operator_02_prometheusrolebinding.yaml

Comment 2 Arunprasad Rajkumar 2021-11-09 07:52:24 UTC
It seems cluster-logging-operator has the necessary role[1] binding[2] to the `prometheus-k8s` service account.

[1] https://github.com/openshift/cluster-logging-operator/blob/release-4.7/manifests/4.7/0100_clusterroles.yaml
[2] https://github.com/openshift/cluster-logging-operator/blob/release-4.7/manifests/4.7/0110_clusterrolebindings.yaml

Comment 4 Periklis Tsirakidis 2021-11-09 14:54:03 UTC
The product OpenShift Logging does not compile any version 4.7. The last release bound to OCP is 4.6.z. Anything after that has a prefix 5.x (e.g. 5.0, 5.1, 5.2). If your image bundle registry is showing up anything like 4.7 it is a registry issue. Please use 5.x.

Comment 5 Steven Walter 2021-11-17 17:49:07 UTC
Hi, my apologies, the OpenShift Logging version is: cluster-logging.5.2.2-21

It slipped my mind that the products are tracking different release cycles now. Can we re-open this against 5.2.2?

Comment 6 Steven Walter 2021-11-17 17:50:58 UTC
Actually, looks like I set the version 4.7 when the component was set to Monitoring...
Should we move this to JIRA as this is Logging 5.2? I know new bugs should be sent there but not sure what the protocol is if it's possibly related to other components like monitoring.

Comment 7 Steven Walter 2021-11-17 17:56:06 UTC
Apologies for the comment spam; I'll close this and move to JIRA.


Note You need to log in before you can comment on or make changes to this bug.