Bug 1532955 - Container logs was not sent to ES stack
Summary: Container logs was not sent to ES stack
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.9.0
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-10 05:26 UTC by Anping Li
Modified: 2018-03-28 14:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Metadata pipeline was relying on missing info Consequence: Required information caused the record pocessing to error Fix: Update the pipeline to better cache and fall back to pushing the log into an orphaned index if needed Result: Logs pushed into storage as desired.
Clone Of:
Environment:
Last Closed: 2018-03-28 14:18:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Log dump files (382.49 KB, application/x-gzip)
2018-01-10 05:29 UTC, Anping Li
no flags Details
fluentd log (1.28 KB, application/x-gzip)
2018-01-10 05:30 UTC, Anping Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin-aggregated-logging pull 898 0 'None' closed Metadata cpu cache fix 2021-01-27 12:41:08 UTC
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:18:45 UTC

Description Anping Li 2018-01-10 05:26:30 UTC
Description of problem:
There are no any projects.* index in elasticsearch. fluend log show there are lots of orphaned and the ES stack fill in orphaned documents.


Version-Release number of selected component (if applicable):
ocp:v3.9.0-0.16.0 
openshift3/logging-fluentd/images/v3.9.0-0.16.0.2

How reproducible:
always

Steps to Reproduce:
1. deploy loggging
2. create projects and applications
   oc new-project anlitest
   oc new-app httpd-example
3. check the indices in Elasticsearch
#oc exec -c elasticsearch logging-es-data-master-y8m32zuh-2-6dfwl -- curl -s -XGET --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/indices?v

4. Check the fluentd logs

Actual results:


step 3) No projects.xxx index
#oc exec -c elasticsearch logging-es-data-master-y8m32zuh-2-6dfwl -- curl -s -XGET --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/indices?v
health status index                                        pri rep docs.count docs.deleted store.size pri.store.size 
green  open   .searchguard.logging-es-data-master-y8m32zuh   1   0          5            0     33.5kb         33.5kb 
green  open   .operations.2018.01.10                         1   0       5688            0      3.3mb          3.3mb 
green  open   .kibana                                        1   0          1            0      3.1kb          3.1kb 
green  open   .orphaned.2018.01.10                           1   0     319114            0    502.2mb        502.2mb 

Step 4) lot of error as following 
2018-01-10 00:02:16 -0500 [error]: record cannot use elasticsearch index name type project_full: record is missing kubernetes.namespace_id field: {"docker"=>{"container_id"=>"ea71e08fceefe10b4499b6deba0d64a20bc9163dcb931b29dfd65b7b96704966"}, "kubernetes"=>{"container_name"=>"fluentd-elasticsearch", "namespace_name"=>"logging", "pod_name"=>"logging-fluentd-lc6qm", "pod_id"=>"64a74cb5-f5c3-11e7-9e19-fa163e78c39e", "labels"=>{"component"=>"fluentd", "controller-revision-hash"=>"254787015", "logging-infra"=>"fluentd", "pod-template-generation"=>"3", "provider"=>"openshift"}, "host"=>"192.168.1.223", "master_url"=>"https://kubernetes.default.svc.cluster.local"}, "message"=>"umount: /var/lib/docker/containers/fc901a4e983f1639b4c09ca045fae84cf1aa879c0342be5004429e27ac45ce74/shm: not mounted\n", "level"=>"err", "hostname"=>"192.168.1.223", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"10.128.0.17", "ipaddr6"=>"fe80::403:aeff:fe5c:886b", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2018-01-10T05:02:16.571670+00:00", "version"=>"0.12.42 1.6.0"}}, "@timestamp"=>"2018-01-10T05:02:10.186175+00:00"}
2018-01-10 00:02:16 -0500 [error]: record cannot use elasticsearch index name type project_full: record is missing kubernetes.namespace_id field: {"docker"=>{"container_id"=>"ea71e08fceefe10b4499b6deba0d64a20bc9163dcb931b29dfd65b7b96704966"}, "kubernetes"=>{"container_name"=>"fluentd-elasticsearch", "namespace_name"=>"logging", "pod_name"=>"logging-fluentd-lc6qm", "pod_id"=>"64a74cb5-f5c3-11e7-9e19-fa163e78c39e", "labels"=>{"component"=>"fluentd", "controller-revision-hash"=>"254787015", "logging-infra"=>"fluentd", "pod-template-generation"=>"3", "provider"=>"openshift"}, "host"=>"192.168.1.223", "master_url"=>"https://kubernetes.default.svc.cluster.local"}, "message"=>"umount: /var/lib/docker/containers/ff72869a18e86f380e7b5b6b31928372a9f71e633ae4b66d57007cfa5285ca4c/shm: not mounted\n", "level"=>"err", "hostname"=>"192.168.1.223", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"10.128.0.17", "ipaddr6"=>"fe80

Expected results:
There are indices named like projects.logging.**.2018.01.10, projects.anlitest.**.2018.01.10

There isn't so many .orphaned documents

Additional info:

Comment 1 Anping Li 2018-01-10 05:29:18 UTC
Created attachment 1379359 [details]
Log dump files

Comment 2 Anping Li 2018-01-10 05:30:46 UTC
Created attachment 1379361 [details]
fluentd log

Two many fluent logs, I only attached one from the openshift master.

Comment 3 Anping Li 2018-01-11 08:42:59 UTC
Container logs collection testing are blocked.

Comment 4 Anping Li 2018-01-16 06:18:18 UTC
For the containers log can be found in Openshift:v3.9.0-0.19.0 when using rpm installation, so move bug to medium Severity.   I will continue to check if the issue exist in containerized installation.

Comment 5 Anping Li 2018-01-17 02:38:46 UTC
Still no the project index in ES when use logging-fluentd/v3.9.0-0.20.0.0 in one cluster.  There are two many fluentd logs.  Will try again when https://bugzilla.redhat.com/show_bug.cgi?id=1531157 is fixed.

Comment 6 Junqi Zhao 2018-01-17 06:27:04 UTC
After running for a few minutes, there are so many es pods which status is Evicted, describe one es pod, and find the events:
Events:
  Type     Reason     Age   From                               Message
  ----     ------     ----  ----                               -------
  Warning  Evicted    4m    kubelet, qe-juzhao-39-gcs-1-nrr-1  The node was low on resource: [DiskPressure].

# oc get po | grep logging-es
logging-es-data-master-ea1bmc86-1-5tnqs   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-78rvl   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-8v22l   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-b57vf   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-djdhq   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-fd8bb   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-gx48r   0/2       Evicted   0          44m
logging-es-data-master-ea1bmc86-1-h9ksk   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-jbvn2   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-jfddt   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-jfxs7   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-lm25j   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-m7fds   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-mvrwz   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-pmmfs   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-pt78j   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-sc66g   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-tdm8w   0/2       Pending   0          4m
logging-es-data-master-ea1bmc86-1-zdbpv   0/2       Evicted   0          4m
logging-es-data-master-ea1bmc86-1-zwhq4   0/2       Evicted   0          4m

Comment 7 Anping Li 2018-01-17 11:22:37 UTC
The project indices can be found with the following workaround. 
1. delete /var/log/es-containers.log.pos
2. modify daemonset to v3.7 fluentd
3. modify daemonset back to v3.9 fluentd

Comment 8 Rich Megginson 2018-01-17 12:48:57 UTC
(In reply to Junqi Zhao from comment #6)
> After running for a few minutes, there are so many es pods which status is
> Evicted, describe one es pod, and find the events:
> Events:
>   Type     Reason     Age   From                               Message
>   ----     ------     ----  ----                               -------
>   Warning  Evicted    4m    kubelet, qe-juzhao-39-gcs-1-nrr-1  The node was
> low on resource: [DiskPressure].

If you deploy logging on a system with enough disk space, does it fix this bug?

> 
> # oc get po | grep logging-es
> logging-es-data-master-ea1bmc86-1-5tnqs   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-78rvl   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-8v22l   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-b57vf   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-djdhq   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-fd8bb   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-gx48r   0/2       Evicted   0          44m
> logging-es-data-master-ea1bmc86-1-h9ksk   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-jbvn2   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-jfddt   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-jfxs7   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-lm25j   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-m7fds   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-mvrwz   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-pmmfs   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-pt78j   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-sc66g   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-tdm8w   0/2       Pending   0          4m
> logging-es-data-master-ea1bmc86-1-zdbpv   0/2       Evicted   0          4m
> logging-es-data-master-ea1bmc86-1-zwhq4   0/2       Evicted   0          4m

Comment 9 Junqi Zhao 2018-01-18 08:58:03 UTC
(In reply to Rich Megginson from comment #8)
> (In reply to Junqi Zhao from comment #6)
> > After running for a few minutes, there are so many es pods which status is
> > Evicted, describe one es pod, and find the events:
> > Events:
> >   Type     Reason     Age   From                               Message
> >   ----     ------     ----  ----                               -------
> >   Warning  Evicted    4m    kubelet, qe-juzhao-39-gcs-1-nrr-1  The node was
> > low on resource: [DiskPressure].
> 
> If you deploy logging on a system with enough disk space, does it fix this
> bug?

It is not reproduced every time, I did not find this issue today, so not sure about your question

Comment 10 Jeff Cantrill 2018-01-18 19:55:32 UTC
Moving to Modified with the merge of https://github.com/openshift/origin-aggregated-logging/pull/898

Comment 14 Anping Li 2018-01-23 02:17:31 UTC
The fix isn't in 3.9.0-0.22.0.0 waiting for the next build

Comment 15 Anping Li 2018-01-24 03:37:53 UTC
The fix wasn't merged into logging-fluentd/images/v3.9.0-0.23.0.0

sh-4.2# gem list |grep fluent-plugin-kubernetes_metadata_filter
fluent-plugin-kubernetes_metadata_filter (0.33.0)

Comment 16 Junqi Zhao 2018-01-24 05:30:32 UTC
(In reply to Rich Megginson from comment #8)
> (In reply to Junqi Zhao from comment #6)
> > After running for a few minutes, there are so many es pods which status is
> > Evicted, describe one es pod, and find the events:
> > Events:
> >   Type     Reason     Age   From                               Message
> >   ----     ------     ----  ----                               -------
> >   Warning  Evicted    4m    kubelet, qe-juzhao-39-gcs-1-nrr-1  The node was
> > low on resource: [DiskPressure].
> 
> If you deploy logging on a system with enough disk space, does it fix this
> bug?

Reproduced today, maybe related to 
https://bugzilla.redhat.com/show_bug.cgi?id=1531157

Comment 17 Anping Li 2018-01-25 09:05:59 UTC
The bug is fixed on logging-fluentd:v3.9.0-0.24.0.0

Comment 20 errata-xmlrpc 2018-03-28 14:18:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489


Note You need to log in before you can comment on or make changes to this bug.