Description of problem: Alert `FluentdNodeDown` firing when logforwarding is enabled and the logstore is not set, I checked the fluentd metrics, they were exposed, but couldn't show up in the prometheus metrics console, and there were lots of error logs in the prometheus-k8s pod: level=error ts=2019-11-21T06:27:43.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-logging\"" level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-logging\"" level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-logging\"" I could get the fluentd metrics using user system:serviceaccount:openshift-monitoring:prometheus-k8s: oc exec cluster-logging-operator-64ccbb7b68-svcvg -- curl -ks -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" https://172.30.225.76:24231/metrics # TYPE fluentd_output_status_buffer_total_bytes gauge # HELP fluentd_output_status_buffer_total_bytes Current total size of stage and queue buffers. fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0 fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 56180.0 # TYPE fluentd_output_status_buffer_stage_length gauge # HELP fluentd_output_status_buffer_stage_length Current length of stage buffers. fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0 fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 5.0 # TYPE fluentd_output_status_buffer_stage_byte_size gauge $ oc get sa NAME SECRETS AGE builder 2 81m cluster-logging-operator 2 81m default 2 81m deployer 2 81m elasticsearch-server 2 64m logcollector 2 36m $ oc get secret |grep fluentd fluentd Opaque 3 36m fluentd-metrics kubernetes.io/tls 2 36m $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE fluentd ClusterIP 172.30.225.76 <none> 24231/TCP 41m $ oc get servicemonitor fluentd -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2019-11-21T05:59:14Z" generation: 1 name: fluentd namespace: openshift-logging ownerReferences: - apiVersion: logging.openshift.io/v1 controller: true kind: ClusterLogging name: instance uid: 34da88dc-2122-463f-829d-18d293c4cbf5 resourceVersion: "187593" selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/servicemonitors/fluentd uid: 14255069-1dd4-4a1a-adac-7f6055a8cc7d spec: endpoints: - path: /metrics port: metrics scheme: https tlsConfig: caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt serverName: fluentd.openshift-logging.svc jobLabel: monitor-fluentd namespaceSelector: matchNames: - openshift-logging selector: matchLabels: logging-infra: support FluentdNodeDown alert details alert: FluentdNodeDown expr: absent(up{job="fluentd"} == 1) absent(up{job="fluentd"} == 1) Element Value {} 1 This means prometheus think the fluentd is Down Version-Release number of selected component (if applicable): ose-cluster-logging-operator-v4.3.0-201911201806 ose-logging-fluentd-v4.3.0-201911151317 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-11-19-122017 True False 5h51m Cluster version is 4.3.0-0.nightly-2019-11-19-122017 How reproducible: Always Steps to Reproduce: 1.deploy logging operators 2.deploy a log receiver 3.create logforwarding instance 4.create clusterlogging instance, don't set logStore: apiVersion: "logging.openshift.io/v1" kind: "ClusterLogging" metadata: annotations: clusterlogging.openshift.io/logforwardingtechpreview: enabled name: "instance" namespace: "openshift-logging" spec: managementState: "Managed" collection: logs: type: "fluentd" fluentd: {} Actual results: Alert `FluentdNodeDown` is firing when all the fluetd pods are working as expected. Expected results: Additional info:
Verified with clusterlogging.4.4.0-202002170216
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581