1776533 – Alert `FluentdNodeDown` is firing when enable logforwarding and the logstore is not set in the clusterlogging instance.

Bug 1776533 - Alert `FluentdNodeDown` is firing when enable logforwarding and the logstore is not set in the clusterlogging instance.

Summary: Alert `FluentdNodeDown` is firing when enable logforwarding and the logstore ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Jeff Cantrill
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:
Depends On:	1774864
Blocks:	1813085
TreeView+	depends on / blocked

Reported:	2019-11-25 22:38 UTC by Jeff Cantrill
Modified:	2020-03-13 13:47 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1774864
Environment:
Last Closed:	2020-01-23 11:14:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 321	None	closed	[release-4.3] Bug 1776533: Enable metrics for collector when LF enabled	2021-02-16 16:33:55 UTC
Github	openshift cluster-logging-operator pull 324	None	closed	Bug 1776533: Enable metrics for collector when LF enabled	2021-02-16 16:33:56 UTC
Red Hat Product Errata	RHBA-2020:0062	None	None	None	2020-01-23 11:14:31 UTC

Description Jeff Cantrill 2019-11-25 22:38:49 UTC

+++ This bug was initially created as a clone of Bug #1774864 +++

Description of problem:
Alert `FluentdNodeDown` firing when logforwarding is enabled and the logstore is not set, I checked the fluentd metrics, they were exposed, but couldn't show up in the prometheus metrics console, and there were lots of error logs in the prometheus-k8s pod:
level=error ts=2019-11-21T06:27:43.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-logging\""
level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-logging\""
level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-logging\""

I could get the fluentd metrics using user system:serviceaccount:openshift-monitoring:prometheus-k8s:
oc exec cluster-logging-operator-64ccbb7b68-svcvg -- curl -ks -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://172.30.225.76:24231/metrics
# TYPE fluentd_output_status_buffer_total_bytes gauge
# HELP fluentd_output_status_buffer_total_bytes Current total size of stage and queue buffers.
fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0
fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 56180.0
# TYPE fluentd_output_status_buffer_stage_length gauge
# HELP fluentd_output_status_buffer_stage_length Current length of stage buffers.
fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0
fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 5.0
# TYPE fluentd_output_status_buffer_stage_byte_size gauge

$ oc get sa
NAME                       SECRETS   AGE
builder                    2         81m
cluster-logging-operator   2         81m
default                    2         81m
deployer                   2         81m
elasticsearch-server       2         64m
logcollector               2         36m

$ oc get secret |grep fluentd
fluentd                                    Opaque                                3      36m
fluentd-metrics                            kubernetes.io/tls                     2      36m

$ oc get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
fluentd                ClusterIP   172.30.225.76    <none>        24231/TCP           41m


$ oc get servicemonitor fluentd -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2019-11-21T05:59:14Z"
  generation: 1
  name: fluentd
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: 34da88dc-2122-463f-829d-18d293c4cbf5
  resourceVersion: "187593"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/servicemonitors/fluentd
  uid: 14255069-1dd4-4a1a-adac-7f6055a8cc7d
spec:
  endpoints:
  - path: /metrics
    port: metrics
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: fluentd.openshift-logging.svc
  jobLabel: monitor-fluentd
  namespaceSelector:
    matchNames:
    - openshift-logging
  selector:
    matchLabels:
      logging-infra: support

FluentdNodeDown alert details
alert: FluentdNodeDown 
expr: absent(up{job="fluentd"}   == 1)

absent(up{job="fluentd"} == 1)
Element  Value                                                
{}       1
This means prometheus think the fluentd is Down



Version-Release number of selected component (if applicable):
ose-cluster-logging-operator-v4.3.0-201911201806
ose-logging-fluentd-v4.3.0-201911151317
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-19-122017   True        False         5h51m   Cluster version is 4.3.0-0.nightly-2019-11-19-122017


How reproducible:
Always

Steps to Reproduce:
1.deploy logging operators
2.deploy a log receiver
3.create logforwarding instance
4.create clusterlogging instance, don't set logStore:
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  annotations:
    clusterlogging.openshift.io/logforwardingtechpreview: enabled
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}


Actual results:
Alert `FluentdNodeDown` is firing when all the fluetd pods are working as expected.

Expected results:


Additional info:

Comment 2 Qiaoling Tang 2019-12-16 01:03:31 UTC

Verified with clusterlogging.4.3.0-201912130552

ls tempdir/manifests/cluster-logging/4.3/
0100_clusterroles.yaml         cluster-loggings.crd.yaml                          collectors.crd.yaml
0110_clusterrolebindings.yaml  cluster-logging.v4.3.0.clusterserviceversion.yaml  logforwardings.crd.yaml

Comment 4 errata-xmlrpc 2020-01-23 11:14:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.