1848186 – Web console generates bad YAML for default clusterlogging CR - results in bad retentionPolicy configuration where infra and app indices never get created.

Bug 1848186 - Web console generates bad YAML for default clusterlogging CR - results in bad retentionPolicy configuration where infra and app indices never get created.

Summary: Web console generates bad YAML for default clusterlogging CR - results in bad...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Periklis Tsirakidis
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1864284
TreeView+	depends on / blocked

Reported:	2020-06-17 21:24 UTC by Mike Fiedler
Modified:	2020-10-27 16:08 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1864284 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:08:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 568	0	None	closed	Bug 1848186: Fix retention policy in CSV alm example manifest	2020-09-29 18:52:26 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:08:39 UTC

Description Mike Fiedler 2020-06-17 21:24:09 UTC

Description of problem:

ELO: 4.5.0-202006161654  and CLO: 4.5.0-202006161654

On the latest ART builds the app-* and infra-* indices are never being created.   Only .security and .kibana:

# oc exec -n openshift-logging -c elasticsearch $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/indices?v                                                               
health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size                                                                                                                                                                                                                      
green  open   .security HAp4A55hQluOFGSDcZicQA   1   1          5            0     59.2kb         29.6kb                                                                                                                                                                                                                      
green  open   .kibana_1 ngZnLKi0TDif_KmKgpOPTA   1   1          0            0       460b           230b  

Operations and app pod logging is occurring and the fluentd logs look reasonably normal (initial connect failures while ES starts followed by the retry succeeded message) but the indices are never created.

fluentd log excerpt (full logs in must-gather):

  2020-06-17 21:07:26 +0000 [warn]: suppressed same stacktrace
2020-06-17 21:07:43 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=6 next_retry_seconds=2020-06-17 21:08:13 +0000 chunk="5a84e038f011085568ca17f6789532da" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:hos
t=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Connection refused - connect(2) for 172.30.233.113:9200 (Errno::ECONNREFUSED)"
  2020-06-17 21:07:43 +0000 [warn]: suppressed same stacktrace
2020-06-17 21:07:49 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=7 next_retry_seconds=2020-06-17 21:08:57 +0000 chunk="5a84e038d9b8856f22109530caf228b2" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:hos
t=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Connection refused - connect(2) for 172.30.233.113:9200 (Errno::ECONNREFUSED)"
  2020-06-17 21:07:49 +0000 [warn]: suppressed same stacktrace
2020-06-17 21:08:57 +0000 [warn]: [clo_default_output_es] retry succeeded. chunk_id="5a84e038d9b8856f22109530caf228b2"



Version-Release number of selected component (if applicable):
ELO: 4.5.0-202006161654  and CLO: 4.5.0-202006161654 on latest 4.5 nightly

How reproducible: Always so far


Steps to Reproduce:
1. Install latest 4.5 nightly on AWS cluster.   Install ELO and CLO at specified versions
2. Create clusterlogging with yaml below

spec:
  collection:
    logs:
      type: fluentd
  curation:
    curator:
      schedule: 30 3 * * *
    type: curator
  logStore:
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      resources:
        requests:
          cpu: 28
          memory: 61Gi
      storage:
        size: 400G
        storageClassName: gp2
    retentionPolicy:
      logs.app:
        maxAge: 1d
    type: elasticsearch
  managementState: Managed
  visualization:
    kibana:
      replicas: 1
    type: kibana


3. Start app logging traffic

Actual results:

ES cluster and all fluentd pods start and come Ready but only the .kibana and .security indices are created.



Additional info:

Will add link to must-gather

Comment 2 Qiaoling Tang 2020-06-18 00:56:30 UTC

Hi @Mike, I didn't hit this issue.

I noticed in your clusterlogging yaml file, it has:
    retentionPolicy:
      logs.app:
        maxAge: 1d

I'm afraid the format is not correct, could you please try again with the below format?
    retentionPolicy: 
      application:
        maxAge: 1d

Besides, if you only specify retentionPolicy for app logs, then only app logs will be received, details are in  https://bugzilla.redhat.com/show_bug.cgi?id=1845788#c0 and https://bugzilla.redhat.com/show_bug.cgi?id=1845788#c4 .

Comment 3 Mike Fiedler 2020-06-18 01:26:19 UTC

The default yaml generated in the console turned out to be the issue (as hinted at by Qiaoling in comment 2).  Changing this bz to reflect the root issue - let me know if you prefer a new bz.  I think this is mustfix for 4.5 GA

After installing CLO and ELO, going to Installed Operators -> Cluster Logging-> YAML view presents the user with this ClusterLogging yaml:

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  namespace: openshift-logging
  name: instance
  labels: {}
spec:
  collection:
    logs:
      type: fluentd
  curation:
    curator:
      schedule: 30 3 * * *
    type: curator
  logStore:
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      storage:
        size: 200G
        storageClassName: gp2
    retentionPolicy:
      logs.app:
        maxAge: 7d
    type: elasticsearch
  managementState: Managed
  visualization:
    kibana:
      replicas: 1
    type: kibana

The retentionPolicy is incorrect (API change since 4.4?  upgrade issue?) and results in a logging config where the infra-* and app-* indices never get created.   See  https://bugzilla.redhat.com/show_bug.cgi?id=1845788#c0 and https://bugzilla.redhat.com/show_bug.cgi?id=1845788#c4.

Changing the retentionPolicy to the one below allows logging to work correctly.


    retentionPolicy: 
      application:
        maxAge: 1d
      infra:
        maxAge: 3h
      audit:
        maxAge: 2w

Comment 6 Anping Li 2020-06-24 02:15:05 UTC

Test in 4.6. the spec created from console.
#oc get clusterlogging instance  -o json | jq '.spec'
{
  "collection": {
    "logs": {
      "type": "fluentd"
    }
  },
  "curation": {
    "curator": {
      "schedule": "30 3 * * *"
    },
    "type": "curator"
  },
  "logStore": {
    "elasticsearch": {
      "nodeCount": 3,
      "redundancyPolicy": "SingleRedundancy",
      "storage": {
        "size": "200G",
        "storageClassName": "gp2"
      }
    },
    "retentionPolicy": {
      "application": {
        "maxAge": "7d"
      }
    },
    "type": "elasticsearch"
  },
  "managementState": "Managed",
  "visualization": {
    "kibana": {
      "replicas": 1
    },
    "type": "kibana"
  }
}

Comment 8 errata-xmlrpc 2020-10-27 16:08:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.