Bug 1897482 - Elasticsearch does not retain log data as per the log retention policy defined in cluster logging instance in OCP 4.6
Summary: Elasticsearch does not retain log data as per the log retention policy define...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.6
Hardware: s390x
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard: logging-exploration
Depends On:
Blocks: ocp-46-z-tracker
TreeView+ depends on / blocked
 
Reported: 2020-11-13 07:37 UTC by Sanjaya
Modified: 2020-12-22 15:31 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-22 15:31:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
app log last timestamp (79.34 KB, image/png)
2020-11-13 07:39 UTC, Sanjaya
no flags Details
app log current timestamp (80.40 KB, image/png)
2020-11-13 07:40 UTC, Sanjaya
no flags Details
audit log last timestamp (82.98 KB, image/png)
2020-11-13 07:41 UTC, Sanjaya
no flags Details
audit log current timestamp (83.66 KB, image/png)
2020-11-13 07:42 UTC, Sanjaya
no flags Details
infra log last timestamp (79.98 KB, image/png)
2020-11-13 07:42 UTC, Sanjaya
no flags Details
infra log current timestamp (69.28 KB, image/png)
2020-11-13 07:43 UTC, Sanjaya
no flags Details
retention policy yaml config (48.70 KB, image/png)
2020-11-13 07:44 UTC, Sanjaya
no flags Details
app log last time-2d unit (84.73 KB, image/png)
2020-11-23 08:53 UTC, Sanjaya
no flags Details
app log current time-2d unit (69.57 KB, image/png)
2020-11-23 08:54 UTC, Sanjaya
no flags Details
infra log last time-3h unit (65.37 KB, image/png)
2020-11-23 08:55 UTC, Sanjaya
no flags Details
infra log current time-3h unit (66.21 KB, image/png)
2020-11-23 08:59 UTC, Sanjaya
no flags Details
audit log last time-3h unit (118.39 KB, image/png)
2020-11-23 09:01 UTC, Sanjaya
no flags Details
audit log current time-3h unit (91.32 KB, image/png)
2020-11-23 09:02 UTC, Sanjaya
no flags Details
updated-1-logging-yaml (41.44 KB, image/png)
2020-11-23 09:03 UTC, Sanjaya
no flags Details
audit log forwarding yaml (43.02 KB, image/jpeg)
2020-12-18 11:54 UTC, Sanjaya
no flags Details

Description Sanjaya 2020-11-13 07:37:31 UTC
Description of problem:

Hi ,

i am trying to verify log retention policy within specific time frame(eg. 60min) for all 3 log sources(app/infra/audit) as per below Doc. 

and observed the application logs only retain approximately 1hr(sometimes 45min) data. but audit and infra logs was retaining (2 - 3hr) and (18-19hr) data respectively, even though its configured as 60min retention Policy in logging instance yaml.

is it intended for audit and infra logs OR is different to what the document suggests ?

https://docs.openshift.com/container-platform/4.6/logging/config/cluster-logging-log-store.html

i have considered last retained data timestamp by sort by timestamp in ascending order , and sort by descending order as current timestamp in Kibana.

screenshots attached for all log sources and retention time.

Version-Release number of selected component (if applicable): OCP 4.6


How reproducible:


Steps to Reproduce:
1. Setup cluster logging instance 
2. edit logging instance to add retention policy(oc edit clusterlogging instance) 
3.add timestamp as selected field for all index(app/infra/audit), sort to know the last timestamp of retained data in kibana. 

Actual results:


Expected results:

should retain data as per the time frame mentioned in logging instance.

Additional info:

Comment 1 Sanjaya 2020-11-13 07:39:42 UTC
Created attachment 1729008 [details]
app log last timestamp

Comment 2 Sanjaya 2020-11-13 07:40:38 UTC
Created attachment 1729009 [details]
app log current timestamp

Comment 3 Sanjaya 2020-11-13 07:41:29 UTC
Created attachment 1729010 [details]
audit log last timestamp

Comment 4 Sanjaya 2020-11-13 07:42:07 UTC
Created attachment 1729011 [details]
audit log current timestamp

Comment 5 Sanjaya 2020-11-13 07:42:52 UTC
Created attachment 1729012 [details]
infra log last timestamp

Comment 6 Sanjaya 2020-11-13 07:43:24 UTC
Created attachment 1729013 [details]
infra log current timestamp

Comment 7 Sanjaya 2020-11-13 07:44:29 UTC
Created attachment 1729014 [details]
retention policy yaml config

Comment 8 Sanjaya 2020-11-23 08:49:33 UTC
Hi,

please find the updated observation on log retention within below time frame. I had set retention time like( app-2days, infra- 3hrs, audit- 3hrs) in  "oc edit clusterlogging instance" yaml. and observed 
log-type | configured-time | actual-retention-time(in approx.)
=========================================================
application - 2days-    2days // OK
infra       - 3hours-   3hours // OK
audit      -  3hours -  (4-5)hours // NOT OK- expected 3hrs , but its deleting logs older than last 4-5 hours.

screenshots attached ,for more details of configured YAML and Kibana UI actual result.

Thanks,
Sanjay

Comment 9 Sanjaya 2020-11-23 08:53:12 UTC
Created attachment 1732363 [details]
app log last time-2d unit

Comment 10 Sanjaya 2020-11-23 08:54:03 UTC
Created attachment 1732364 [details]
app log current time-2d unit

Comment 11 Sanjaya 2020-11-23 08:55:12 UTC
Created attachment 1732365 [details]
infra log last time-3h unit

Comment 12 Sanjaya 2020-11-23 08:59:23 UTC
Created attachment 1732369 [details]
infra log current time-3h unit

Comment 13 Sanjaya 2020-11-23 09:01:10 UTC
Created attachment 1732370 [details]
audit log last time-3h unit

Comment 14 Sanjaya 2020-11-23 09:02:05 UTC
Created attachment 1732371 [details]
audit log current time-3h unit

Comment 15 Sanjaya 2020-11-23 09:03:18 UTC
Created attachment 1732373 [details]
updated-1-logging-yaml

Comment 16 wvoesch 2020-12-07 13:53:59 UTC
We are expecting this behavior also to occur on x86. Could someone please check this on x86? Thank you

Comment 17 Periklis Tsirakidis 2020-12-18 10:57:18 UTC
@sabeher2.com

Could you please elaborate how you manage to get the audit logs on our managed elasticsearch instance? As per default we don't support storing audit logs in our default managed Elasticsearch store. 

In addition, can you share the fluentd config map:

oc -n openshift-logging get configmap fluentd -o yaml > fluentd.yaml

Comment 18 Sanjaya 2020-12-18 11:52:08 UTC
(In reply to Periklis Tsirakidis from comment #17)
> @sabeher2.com
> 
> Could you please elaborate how you manage to get the audit logs on our
> managed elasticsearch instance? As per default we don't support storing
> audit logs in our default managed Elasticsearch store. 
> 
> In addition, can you share the fluentd config map:
> 
> oc -n openshift-logging get configmap fluentd -o yaml > fluentd.yaml

@periklis

Hi,
As default audit logs are not stored in the internal Elasticsearch instance, we have used the Log Forward API(kind: ClusterLogForwarder) to forward audit logs along with(app/infra)logs to the internal Elasticsearch instance(eg. outputRefs:default).

I have attached the copy of configured yaml file for reference.

RH referenced doc(Forward audit logs to the log store)- https://docs.openshift.com/container-platform/4.6/logging/config/cluster-logging-log-store.html

NOTE: regarding- fluentd config map output.
currently due to some technical issue, unable to access existing cluster , team is working on to allocate new VMs to setup cluster.
once the new cluster is Ready then  will share the - fluentd config map output.

Thanks.

Comment 19 Sanjaya 2020-12-18 11:54:31 UTC
Created attachment 1740260 [details]
audit log forwarding yaml

Comment 20 Jeff Cantrill 2020-12-18 12:28:45 UTC
The retention policy is a mechanism to expose part of the ES rollover api to assist in maintaining indices and stability for the cluster.  The retention policy is an input for the rollover conditions and does not directly relate to which documents will be retained.  New indicies are created when any of the rollover conditions [1]  are satisfied which means there is not necessarily a uniform balance;  some indices may be larger or smaller strictly because there are more documents or the documents are larger.  Removal of the indices is based upon the creation data of the the index as compared to retention policy, not the age of any of its documents. This means there are possibly many documents removed that do not explicitly meet the retention policy.  

The source of the logs is in material to how they are curated.  The jobs are all the same with the exception of the indices upon which they act.

> audit      -  3hours -  (4-5)hours // NOT OK- expected 3hrs , but its deleting logs older than last 4-5 hours.

This observation is not clear to me.  It reads like its deleting indices older then 3 hours which is what its configured to do.

IMO, there is not a bug here to be fixed.


[1] https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-rollover-index.html

Comment 21 Periklis Tsirakidis 2020-12-22 15:31:50 UTC
As per [1] not a bug to fix here:

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1897482#c20


Note You need to log in before you can comment on or make changes to this bug.