Bug 1776533 - Alert `FluentdNodeDown` is firing when enable logforwarding and the logstore is not set in the clusterlogging instance.
Summary: Alert `FluentdNodeDown` is firing when enable logforwarding and the logstore ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.3.0
Assignee: Jeff Cantrill
QA Contact: Qiaoling Tang
URL:
Whiteboard:
Depends On: 1774864
Blocks: 1813085
TreeView+ depends on / blocked
 
Reported: 2019-11-25 22:38 UTC by Jeff Cantrill
Modified: 2020-03-13 13:47 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1774864
Environment:
Last Closed: 2020-01-23 11:14:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 321 0 None closed [release-4.3] Bug 1776533: Enable metrics for collector when LF enabled 2021-02-16 16:33:55 UTC
Github openshift cluster-logging-operator pull 324 0 None closed Bug 1776533: Enable metrics for collector when LF enabled 2021-02-16 16:33:56 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:14:31 UTC

Description Jeff Cantrill 2019-11-25 22:38:49 UTC
+++ This bug was initially created as a clone of Bug #1774864 +++

Description of problem:
Alert `FluentdNodeDown` firing when logforwarding is enabled and the logstore is not set, I checked the fluentd metrics, they were exposed, but couldn't show up in the prometheus metrics console, and there were lots of error logs in the prometheus-k8s pod:
level=error ts=2019-11-21T06:27:43.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-logging\""
level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-logging\""
level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-logging\""

I could get the fluentd metrics using user system:serviceaccount:openshift-monitoring:prometheus-k8s:
oc exec cluster-logging-operator-64ccbb7b68-svcvg -- curl -ks -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://172.30.225.76:24231/metrics
# TYPE fluentd_output_status_buffer_total_bytes gauge
# HELP fluentd_output_status_buffer_total_bytes Current total size of stage and queue buffers.
fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0
fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 56180.0
# TYPE fluentd_output_status_buffer_stage_length gauge
# HELP fluentd_output_status_buffer_stage_length Current length of stage buffers.
fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0
fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 5.0
# TYPE fluentd_output_status_buffer_stage_byte_size gauge

$ oc get sa
NAME                       SECRETS   AGE
builder                    2         81m
cluster-logging-operator   2         81m
default                    2         81m
deployer                   2         81m
elasticsearch-server       2         64m
logcollector               2         36m

$ oc get secret |grep fluentd
fluentd                                    Opaque                                3      36m
fluentd-metrics                            kubernetes.io/tls                     2      36m

$ oc get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
fluentd                ClusterIP   172.30.225.76    <none>        24231/TCP           41m


$ oc get servicemonitor fluentd -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2019-11-21T05:59:14Z"
  generation: 1
  name: fluentd
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: 34da88dc-2122-463f-829d-18d293c4cbf5
  resourceVersion: "187593"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/servicemonitors/fluentd
  uid: 14255069-1dd4-4a1a-adac-7f6055a8cc7d
spec:
  endpoints:
  - path: /metrics
    port: metrics
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: fluentd.openshift-logging.svc
  jobLabel: monitor-fluentd
  namespaceSelector:
    matchNames:
    - openshift-logging
  selector:
    matchLabels:
      logging-infra: support

FluentdNodeDown alert details
alert: FluentdNodeDown 
expr: absent(up{job="fluentd"}   == 1)

absent(up{job="fluentd"} == 1)
Element  Value                                                
{}       1
This means prometheus think the fluentd is Down



Version-Release number of selected component (if applicable):
ose-cluster-logging-operator-v4.3.0-201911201806
ose-logging-fluentd-v4.3.0-201911151317
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-19-122017   True        False         5h51m   Cluster version is 4.3.0-0.nightly-2019-11-19-122017


How reproducible:
Always

Steps to Reproduce:
1.deploy logging operators
2.deploy a log receiver
3.create logforwarding instance
4.create clusterlogging instance, don't set logStore:
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  annotations:
    clusterlogging.openshift.io/logforwardingtechpreview: enabled
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}


Actual results:
Alert `FluentdNodeDown` is firing when all the fluetd pods are working as expected.

Expected results:


Additional info:

Comment 2 Qiaoling Tang 2019-12-16 01:03:31 UTC
Verified with clusterlogging.4.3.0-201912130552

ls tempdir/manifests/cluster-logging/4.3/
0100_clusterroles.yaml         cluster-loggings.crd.yaml                          collectors.crd.yaml
0110_clusterrolebindings.yaml  cluster-logging.v4.3.0.clusterserviceversion.yaml  logforwardings.crd.yaml

Comment 4 errata-xmlrpc 2020-01-23 11:14:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.