Bug 1774864

Summary: Alert `FluentdNodeDown` is firing when enable logforwarding and the logstore is not set in the clusterlogging instance.
Product: OpenShift Container Platform Reporter: Qiaoling Tang <qitang>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Qiaoling Tang <qitang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, rmeggins, surbania
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1776533 (view as bug list) Environment:
Last Closed: 2020-05-04 11:16:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1776533    

Description Qiaoling Tang 2019-11-21 07:04:18 UTC
Description of problem:
Alert `FluentdNodeDown` firing when logforwarding is enabled and the logstore is not set, I checked the fluentd metrics, they were exposed, but couldn't show up in the prometheus metrics console, and there were lots of error logs in the prometheus-k8s pod:
level=error ts=2019-11-21T06:27:43.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-logging\""
level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-logging\""
level=error ts=2019-11-21T06:27:44.316Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"openshift-logging\""

I could get the fluentd metrics using user system:serviceaccount:openshift-monitoring:prometheus-k8s:
oc exec cluster-logging-operator-64ccbb7b68-svcvg -- curl -ks -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://172.30.225.76:24231/metrics
# TYPE fluentd_output_status_buffer_total_bytes gauge
# HELP fluentd_output_status_buffer_total_bytes Current total size of stage and queue buffers.
fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0
fluentd_output_status_buffer_total_bytes{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 56180.0
# TYPE fluentd_output_status_buffer_stage_length gauge
# HELP fluentd_output_status_buffer_stage_length Current length of stage buffers.
fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="retry_user_created_es",type="elasticsearch"} 0.0
fluentd_output_status_buffer_stage_length{hostname="fluentd-sbfzk",plugin_id="user_created_es",type="elasticsearch"} 5.0
# TYPE fluentd_output_status_buffer_stage_byte_size gauge

$ oc get sa
NAME                       SECRETS   AGE
builder                    2         81m
cluster-logging-operator   2         81m
default                    2         81m
deployer                   2         81m
elasticsearch-server       2         64m
logcollector               2         36m

$ oc get secret |grep fluentd
fluentd                                    Opaque                                3      36m
fluentd-metrics                            kubernetes.io/tls                     2      36m

$ oc get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
fluentd                ClusterIP   172.30.225.76    <none>        24231/TCP           41m


$ oc get servicemonitor fluentd -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2019-11-21T05:59:14Z"
  generation: 1
  name: fluentd
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: 34da88dc-2122-463f-829d-18d293c4cbf5
  resourceVersion: "187593"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/servicemonitors/fluentd
  uid: 14255069-1dd4-4a1a-adac-7f6055a8cc7d
spec:
  endpoints:
  - path: /metrics
    port: metrics
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: fluentd.openshift-logging.svc
  jobLabel: monitor-fluentd
  namespaceSelector:
    matchNames:
    - openshift-logging
  selector:
    matchLabels:
      logging-infra: support

FluentdNodeDown alert details
alert: FluentdNodeDown 
expr: absent(up{job="fluentd"}   == 1)

absent(up{job="fluentd"} == 1)
Element  Value                                                
{}       1
This means prometheus think the fluentd is Down



Version-Release number of selected component (if applicable):
ose-cluster-logging-operator-v4.3.0-201911201806
ose-logging-fluentd-v4.3.0-201911151317
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-19-122017   True        False         5h51m   Cluster version is 4.3.0-0.nightly-2019-11-19-122017


How reproducible:
Always

Steps to Reproduce:
1.deploy logging operators
2.deploy a log receiver
3.create logforwarding instance
4.create clusterlogging instance, don't set logStore:
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  annotations:
    clusterlogging.openshift.io/logforwardingtechpreview: enabled
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}


Actual results:
Alert `FluentdNodeDown` is firing when all the fluetd pods are working as expected.

Expected results:


Additional info:

Comment 8 Qiaoling Tang 2020-02-18 00:38:56 UTC
Verified with clusterlogging.4.4.0-202002170216

Comment 10 errata-xmlrpc 2020-05-04 11:16:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581